Opennebula, ZFS and Xen – Part 2 (Instant cloning)

The basic NFS powered setup suggested in the OpenNebula documentation set works but has a few deficiencies:

It doesn’t allow for a decoupled storage and frontend nodes, unless one mixes multiple protocols (i.e. iSCSI between the storage and frontend nodes, NFS between cluster and frontend nodes)
In case of a decoupled storage node it can be quite slow. Consider for example the time it takes to copy a 24G sparse image over a 1Gbps LAN.

frontend-node$ ls -lh disk.0
-rw-r--r-- 1 oneadmin cloud 25G Sep 14 17:45 disk.0
frontend-node$ ls -sh disk.0
803M disk.0
frontend-node$ time cp disk.0 disk.test
real    37m49.798s
user    0m7.141s
sys     16m51.376s

It is grossly inefficient. If each VM is a mostly identical copy of a master image, you shouldn’t need another 800GB for the VMs.

One may alleviate the above problems with a ridiculously expensive setup that combines FCP or iSCSI over a dedicated 10Gbps storage VLAN and an expensive SAN with deduplication capabilities. Alternately, one may leverage the power of ZFS, which provides for lightweight clones at zero performance cost.

Intro

Instant cloning relies on the relevant ZFS capabilities, which allows creating a new ZFS dataset (clone) based on an existing snapshot. Consider for example a ZFS dataset that contains the disk image of the OpenNebula sample VM:

# zfs list rpool/export/home/cloud/images/ttylinux
NAME                                      USED  AVAIL  REFER  MOUNTPOINT
rpool/export/home/cloud/images/ttylinux  28.0M   109G  27.9M  /srv/cloud/images/ttylinux

One can grab a snapshot of this ZFS dataset in a less than a second.

# time zfs snapshot rpool/export/home/cloud/images/ttylinux@test

real    0m0.477s
user    0m0.004s
sys     0m0.008s
root@qa-x2100-3:~# zfs list rpool/export/home/cloud/images/ttylinux@test
NAME                                           USED  AVAIL  REFER  MOUNTPOINT
rpool/export/home/cloud/images/ttylinux@test      0      -  27.9M  -

And create a clone of this snapshot again in an instant:

# time zfs clone rpool/export/home/cloud/images/ttylinux@test \
   rpool/export/home/cloud/one/var/testclone

real    0m0.294s
user    0m0.024s
sys     0m0.048s
# zfs list rpool/export/home/cloud/one/var/testclone
NAME                                        USED  AVAIL  REFER  MOUNTPOINT
rpool/export/home/cloud/one/var/testclone     1K   109G  27.9M  /srv/cloud/one/var/testclone

Snapshots and clones besides being “instant” share a couple of extra interesting properties:

they are cheap: as is evident by the USED column they consume minimal disk space upon their creation
they are lightweight: ZFS has been designed to support thousands of filesystems and snapshots

The above properties make ZFS the optimal choice for a VM storage subsystem.

Preparing the storage node

It should be clear by now that in a ZFS based OpenNebula setup the optimal storage choice is to create a snapshot for each “master image”, which will be cloned for each VM template based upon it.

1. Delegate the appropriate ZFS permissions to the oneadmin user

storage-node# zfs allow oneadmin clone,create,mount,share,sharenfs rpool/export/home/cloud

2. Create a separate ZFS dataset for each master image you need to support. The following example creates one dataset for a Solaris 10 Update 8 master image and one for the OpenNebula sample VM:

storage-node# zfs create rpool/export/home/cloud/images/S10U8
storage-node# zfs create rpool/export/home/cloud/images/ttylinux

3. Copy the master disk image under the dataset directory using a “disk.0” filename. It is important not to use a different filename.

storage-node# ls -lh /srv/cloud/images/S10U8/
total 803M
-rw-r--r-- 1 oneadmin cloud 670 Aug 16 20:17 S10U8.template
-rw-r--r-- 1 oneadmin cloud 25G Sep 14 17:45 disk.0
-rw-r--r-- 1 oneadmin cloud 924 Sep 23 15:50 s10u8.one

4. Grab a “golden snapshot” of your image once it’s ready

zfs snapshot rpool/export/home/cloud/images/S10U8@golde

5. Create the ZFS dataset that will store the ZFS clones.

storage-node# zfs create rpool/export/home/cloud/one/var

6. Make sure that the oneadmin user can do a non-interactive login from the frontend to the storage node with SSH key-based authentication.


oneadmin@frontend-node$ ssh ${storage-node) echo > /dev/null && echo success
success

Instant cloning

Having prepared the storage server, it’s time to customize the NFS transfer driver, in order to create a ZFS clone to create the VM instance image instead of a cp(1) over NFS. The NFS driver essentially takes two arguments similar to the following:

frontend-node:/srv/cloud/images/S10U8/disk.0
cluster-node:/srv/cloud/one/var/${VMID}/images/disk.0

and after some straightforward parsing executes the following commands:

oneadmin@frontend$ mkdir -p /srv/cloud/one/var/${VMID}/images/
oneadmin@frontend$ cp /srv/cloud/images/S10U8/disk.0 /srv/cloud/one/var/${VMID}/images/disk.0

Essentially we want to tweak the cloning script to run the following commands instead:

oneadmin@frontend$ ssh storage-server zfs create rpool/export/home/cloud/var/${VMID}
oneadmin@frontend$ ssh storage-server zfs create rpool/export/home/cloud/var/${VMID}/images
oneadmin@frontend$ ssh storage-server zfs clone rpool/export/home/cloud/images/S10U8@gold \
                   rpool/export/home/cloud/var/${VMID}/images

Unfortunately the above commands will not work. The reason is that OpenNebula uses mkdir(1) to create “/srv/cloud/var/${VMID}” before calling the cloning script (not really certain if I should file a bug for it), hence creating a dataset under an already existing directory, something that may lead to funny behavior.

This is the reason we created a separate dataset to host our clones. Having done that we can slightly revise the above commands:

oneadmin@frontend$ ssh storage-server zfs clone rpool/export/home/cloud/images/S10U8@gold \
                   rpool/export/home/cloud/var/images/${VMID}
oneadmin@frontend$ ln -s /srv/cloud/one/var/images/${VMDIR} /srv/cloud/one/var/${VMDIR}/images

As simple as that. It does add an extra parsing logic to figure out the ZFS dataset path but the results as evidenced by the oned.log are astounding:

Thu Sep 23 11:06:15 2010 [TM][D]: Message received: LOG - 73 tm_clone.sh: opennebula.sil.priv:/srv/cloud/images/S10U8/disk.0 10.8.3.218:/srv/cloud/one/var/73/images/disk.0 Thu Sep 23 11:06:15 2010 [TM][D]: Message received: LOG - 73 tm_clone.sh: Cloning ZFS rpool/export/home/cloud/images/S10U8@gold to Thu Sep 23 11:06:15 2010 [TM][D]: Message received: LOG - 73 tm_clone.sh: Executed "chmod a+w /srv/cloud/one/var/73/images/disk.0".

Thu Sep 23 11:06:15 2010 [TM][D]: Message received: TRANSFER SUCCESS 73 -
Cloning speed of one second. ONE!

The tm_clone.sh that implements the above commands follows:

#!/bin/bash

# -------------------------------------------------------------------------- #
# Copyright 2002-2009, Distributed Systems Architecture Group, Universidad   #
# Complutense de Madrid (dsa-research.org)                                   #
#                                                                            #
# Licensed under the Apache License, Version 2.0 (the "License"); you may    #
# not use this file except in compliance with the License. You may obtain    #
# a copy of the License at                                                   #
#                                                                            #
# http://www.apache.org/licenses/LICENSE-2.0                                 #
#                                                                            #
# Unless required by applicable law or agreed to in writing, software        #
# distributed under the License is distributed on an "AS IS" BASIS,          #
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.   #
# See the License for the specific language governing permissions and        #
# limitations under the License.                                             #
#--------------------------------------------------------------------------- #

SRC=$1
DST=$2

ZFS_HOST=10.8.30.191
ZFS_POOL=rpool
ZFS_BASE_PATH=/export/home/cloud        ## this is the path that maps to /srv/cloud
ZFS_IMAGES_PATH=one/var/images          ## relative one ZFS_BASE_PATH
ZFS_CMD=/usr/sbin/zfs
ZFS_SNAPSHOT_NAME=gold

if [ -z "${ONE_LOCATION}" ]; then
    TMCOMMON=/usr/lib/one/mads/tm_common.sh
else
    TMCOMMON=$ONE_LOCATION/lib/mads/tm_common.sh
fi

. $TMCOMMON

get_vmdir

function arg_zfs_path
{
    dirname `echo $1 | sed -e "s#^/srv/cloud#${ZFS_POOL}${ZFS_BASE_PATH}#"`
}
function zfs_strip_pool
{
    echo $1 | sed -e "s/${ZFS_POOL}//"
}

function get_vm_path
{
    dirname `dirname $1`
}

SRC_PATH=`arg_path $SRC`
DST_PATH=`arg_path $DST`
VM_PATH=`get_vm_path $DST_PATH`
VM_ID=`basename $VM_PATH`

fix_paths

ZFS_SRC_PATH=`arg_zfs_path $SRC_PATH`
TMPPP=`arg_zfs_path $DST_PATH`
ZFS_DST_MNT_PATH=`zfs_strip_pool $TMPPP`
ZFS_DST_PATH=${ZFS_POOL}${ZFS_BASE_PATH}/${ZFS_IMAGES_PATH}/${VM_ID}

DST_DIR=`dirname $DST_PATH`

log "Cloning ZFS ${ZFS_SRC_PATH}@${ZFS_SNAPSHOT_NAME} to ${ZFS_DST_CLONE_PATH}"
exec_and_log "ssh ${ZFS_HOST} ${ZFS_CMD} clone ${ZFS_SRC_PATH}@${ZFS_SNAPSHOT_NAME} ${ZFS_DST_PATH}"
exec_and_log "ln -s ${VMDIR}/images/${VM_ID} ${VMDIR}/${VM_ID}/images"

exec_and_log "chmod a+w $DST_PATH"

VM deletion
Once a VM is teared down one may dispose of its image files. The NFS transfer driver does so by disposing of the images directory altogether, executing a command similar to the following:

oneadmin@frontend-node$ rm -rf /srv/cloud/one/var/${VMID}/images

This kind of works but is suboptimal, since the ZFS clone dataset holding the VM instance image remains around. Ideally the transfer driver deletion script should run the following command instead:

oneadmin@frontend-node$ ssh storage-server destroy rpool/export/home/cloud/one/var/images/${VMID}

The script implementing the above follows:


#!/bin/bash

# -------------------------------------------------------------------------- #
# Copyright 2002-2009, Distributed Systems Architecture Group, Universidad   #
# Complutense de Madrid (dsa-research.org)                                   #
#                                                                            #
# Licensed under the Apache License, Version 2.0 (the "License"); you may    #
# not use this file except in compliance with the License. You may obtain    #
# a copy of the License at                                                   #
#                                                                            #
# http://www.apache.org/licenses/LICENSE-2.0                                 #
#                                                                            #
# Unless required by applicable law or agreed to in writing, software        #
# distributed under the License is distributed on an "AS IS" BASIS,          #
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.   #
# See the License for the specific language governing permissions and        #
# limitations under the License.                                             #
#--------------------------------------------------------------------------- #

SRC=$1
DST=$2

ZFS_HOST=10.8.30.191
ZFS_POOL=rpool
ZFS_BASE_PATH=/export/home/cloud        ## this is the path that maps to /srv/cloud
ZFS_IMAGES_PATH=one/var/images          ## relative one ZFS_BASE_PATH
ZFS_CMD=/usr/sbin/zfs
ZFS_SNAPSHOT_NAME=gold

if [ -z "${ONE_LOCATION}" ]; then
    TMCOMMON=/usr/lib/one/mads/tm_common.sh
else
    TMCOMMON=$ONE_LOCATION/lib/mads/tm_common.sh
fi

. $TMCOMMON

get_vmdir

function arg_zfs_path
{
    echo $1 | sed -e "s#^/srv/cloud#${ZFS_POOL}${ZFS_BASE_PATH}#"
}
function zfs_strip_pool
{
    echo $1 | sed -e "s/${ZFS_POOL}//"
}

function get_vm_path
{
    dirname `dirname $1`
}

fix_src_path

log $SRC_PATH
VM_ID=`basename \`dirname $SRC_PATH\``
DST_PATH=${VMDIR}/images/${VM_ID}
ZFS_DST_PATH=`arg_zfs_path ${DST_PATH}`

log "Destroying ${ZFS_DST_PATH} dataset"
exec_and_log "ssh ${ZFS_HOST} ${ZFS_CMD} destroy ${ZFS_DST_PATH}"

Miscellaneous notes

There are probably various miscellaneous enhancements that one can do in the above scripts (for instance the various ZFS variables and functions should be set in a single location). They are provided as-is without any commitments that they will work in your environment (though they probably will with minimal changes).

References

Opennebula, ZFS and Xen – Part 1 (Get going)

ZFS admin guide

Tags: opennebula, virtualization, xen, zfs

This entry was posted on September 26, 2010 at 3:08 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

8 Responses to “Opennebula, ZFS and Xen – Part 2 (Instant cloning)”

Opennebula, ZFS and Xen – Part 3 (Oracle VM server) « ~mperedim/weblog Says:
September 27, 2010 at 8:49 pm | Reply
[…] ~mperedim/weblog Just another WordPress.com weblog « Opennebula, ZFS and Xen – Part 2 (Instant cloning) […]
Alex Says:
July 14, 2011 at 4:42 pm | Reply
Nice guide 🙂 What about creating a similar solution with btrfs, that is compatible natively with Linux?
mperedim Says:
July 14, 2011 at 5:16 pm | Reply
I will admit that being mostly a Solaris guy I haven’t toyed with btrfs nor have it in my radar.

I may end up doing this eventually but don’t hold your breath 😉
Humberto Says:
October 13, 2011 at 9:01 am | Reply
Interesting solution! Have you experimented with live migration in this setup? Does it work? Thanks!
mperedim Says:
October 13, 2011 at 6:00 pm | Reply
Hi Humberto,

No I didn’t experiment with live migration in this setup, and unfortunately it’s no longer available to give it a shot. That said, from the VM hosts perspective the storage backend remains a “Shared FS” common across all hosts, which is the main requirement for live migration. Hence, I don’t see any reason why it wouldn’t work.
Humberto Says:
October 21, 2011 at 11:52 pm | Reply
Hi again and sorry if I ask too much 🙂

I have been wanting to give a try to your setup since I first read this post, but I have not had time yet. Luckily, it seems I will have some time in the coming days.
So before I get my hands dirty, I would like to have everything clear. If I understood it right, points 1 to 6 must be done in advance for the system to work with the tm files you provide, right? Points 1, 5 and 6 can be done once at installation time. But points 2 to 4 need to be done for every new image that we are going to use with Opennebula, right? I guess that the “oneimage register” command could then be modified to run the commands in points 2 to 4, right? In that way the whole process would be automated.

Cheers,
Humberto
mperedim Says:
October 22, 2011 at 2:29 am | Reply
Hi Humberto,

Correct, {1,5,6} is do-once instructions, {2,3,4} is per-image.

I will admit that I haven’t got a chance to toy around with newer Opennebula releases so I guess you are right about the “oneimage register” command and steps 2-4, but this is strictly based on what I’ve read in the documentation.

Note that I am not certain how well these instructions will work out in a 3.0 setup (they were written for a 1.4 installation). My suggestion is to figure out their “spirit” rather than follow them to the “letter”.
OpenNebula with ZFS datastoreRogierm's Blog | Rogierm's Blog Says:
November 8, 2012 at 2:03 pm | Reply
[…] on this blog article I created an updated datastore driver that allows you to use a ZFS backend with OpenNebula. This […]

~mperedim/weblog

Opennebula, ZFS and Xen – Part 2 (Instant cloning)

Share this:

Related

8 Responses to “Opennebula, ZFS and Xen – Part 2 (Instant cloning)”

Leave a reply to Opennebula, ZFS and Xen – Part 3 (Oracle VM server) « ~mperedim/weblog Cancel reply