The basic NFS powered setup suggested in the OpenNebula documentation set works but has a few deficiencies:
- It doesn’t allow for a decoupled storage and frontend nodes, unless one mixes multiple protocols (i.e. iSCSI between the storage and frontend nodes, NFS between cluster and frontend nodes)
- In case of a decoupled storage node it can be quite slow. Consider for example the time it takes to copy a 24G sparse image over a 1Gbps LAN.
frontend-node$ ls -lh disk.0 -rw-r--r-- 1 oneadmin cloud 25G Sep 14 17:45 disk.0 frontend-node$ ls -sh disk.0 803M disk.0 frontend-node$ time cp disk.0 disk.test real 37m49.798s user 0m7.141s sys 16m51.376s
- It is grossly inefficient. If each VM is a mostly identical copy of a master image, you shouldn’t need another 800GB for the VMs.
One may alleviate the above problems with a ridiculously expensive setup that combines FCP or iSCSI over a dedicated 10Gbps storage VLAN and an expensive SAN with deduplication capabilities. Alternately, one may leverage the power of ZFS, which provides for lightweight clones at zero performance cost.
Intro
Instant cloning relies on the relevant ZFS capabilities, which allows creating a new ZFS dataset (clone) based on an existing snapshot. Consider for example a ZFS dataset that contains the disk image of the OpenNebula sample VM:
# zfs list rpool/export/home/cloud/images/ttylinux NAME USED AVAIL REFER MOUNTPOINT rpool/export/home/cloud/images/ttylinux 28.0M 109G 27.9M /srv/cloud/images/ttylinux
One can grab a snapshot of this ZFS dataset in a less than a second.
# time zfs snapshot rpool/export/home/cloud/images/ttylinux@test real 0m0.477s user 0m0.004s sys 0m0.008s root@qa-x2100-3:~# zfs list rpool/export/home/cloud/images/ttylinux@test NAME USED AVAIL REFER MOUNTPOINT rpool/export/home/cloud/images/ttylinux@test 0 - 27.9M -
And create a clone of this snapshot again in an instant:
# time zfs clone rpool/export/home/cloud/images/ttylinux@test \ rpool/export/home/cloud/one/var/testclone real 0m0.294s user 0m0.024s sys 0m0.048s # zfs list rpool/export/home/cloud/one/var/testclone NAME USED AVAIL REFER MOUNTPOINT rpool/export/home/cloud/one/var/testclone 1K 109G 27.9M /srv/cloud/one/var/testclone
Snapshots and clones besides being “instant” share a couple of extra interesting properties:
- they are cheap: as is evident by the USED column they consume minimal disk space upon their creation
- they are lightweight: ZFS has been designed to support thousands of filesystems and snapshots
The above properties make ZFS the optimal choice for a VM storage subsystem.
Preparing the storage node
It should be clear by now that in a ZFS based OpenNebula setup the optimal storage choice is to create a snapshot for each “master image”, which will be cloned for each VM template based upon it.
1. Delegate the appropriate ZFS permissions to the oneadmin user
storage-node# zfs allow oneadmin clone,create,mount,share,sharenfs rpool/export/home/cloud
2. Create a separate ZFS dataset for each master image you need to support. The following example creates one dataset for a Solaris 10 Update 8 master image and one for the OpenNebula sample VM:
storage-node# zfs create rpool/export/home/cloud/images/S10U8 storage-node# zfs create rpool/export/home/cloud/images/ttylinux
3. Copy the master disk image under the dataset directory using a “disk.0” filename. It is important not to use a different filename.
storage-node# ls -lh /srv/cloud/images/S10U8/ total 803M -rw-r--r-- 1 oneadmin cloud 670 Aug 16 20:17 S10U8.template -rw-r--r-- 1 oneadmin cloud 25G Sep 14 17:45 disk.0 -rw-r--r-- 1 oneadmin cloud 924 Sep 23 15:50 s10u8.one
4. Grab a “golden snapshot” of your image once it’s ready
zfs snapshot rpool/export/home/cloud/images/S10U8@golde
5. Create the ZFS dataset that will store the ZFS clones.
storage-node# zfs create rpool/export/home/cloud/one/var
6. Make sure that the oneadmin user can do a non-interactive login from the frontend to the storage node with SSH key-based authentication.
oneadmin@frontend-node$ ssh ${storage-node) echo > /dev/null && echo success success
Instant cloning
Having prepared the storage server, it’s time to customize the NFS transfer driver, in order to create a ZFS clone to create the VM instance image instead of a cp(1) over NFS. The NFS driver essentially takes two arguments similar to the following:
- frontend-node:/srv/cloud/images/S10U8/disk.0
- cluster-node:/srv/cloud/one/var/${VMID}/images/disk.0
and after some straightforward parsing executes the following commands:
oneadmin@frontend$ mkdir -p /srv/cloud/one/var/${VMID}/images/ oneadmin@frontend$ cp /srv/cloud/images/S10U8/disk.0 /srv/cloud/one/var/${VMID}/images/disk.0
Essentially we want to tweak the cloning script to run the following commands instead:
oneadmin@frontend$ ssh storage-server zfs create rpool/export/home/cloud/var/${VMID} oneadmin@frontend$ ssh storage-server zfs create rpool/export/home/cloud/var/${VMID}/images oneadmin@frontend$ ssh storage-server zfs clone rpool/export/home/cloud/images/S10U8@gold \ rpool/export/home/cloud/var/${VMID}/images
Unfortunately the above commands will not work. The reason is that OpenNebula uses mkdir(1) to create “/srv/cloud/var/${VMID}” before calling the cloning script (not really certain if I should file a bug for it), hence creating a dataset under an already existing directory, something that may lead to funny behavior.
This is the reason we created a separate dataset to host our clones. Having done that we can slightly revise the above commands:
oneadmin@frontend$ ssh storage-server zfs clone rpool/export/home/cloud/images/S10U8@gold \ rpool/export/home/cloud/var/images/${VMID} oneadmin@frontend$ ln -s /srv/cloud/one/var/images/${VMDIR} /srv/cloud/one/var/${VMDIR}/images
As simple as that. It does add an extra parsing logic to figure out the ZFS dataset path but the results as evidenced by the oned.log are astounding:
Thu Sep 23 11:06:15 2010 [TM][D]: Message received: LOG - 73 tm_clone.sh: opennebula.sil.priv:/srv/cloud/images/S10U8/disk.0 10.8.3.218:/srv/cloud/one/var/73/images/disk.0
Thu Sep 23 11:06:15 2010 [TM][D]: Message received: LOG - 73 tm_clone.sh: Cloning ZFS rpool/export/home/cloud/images/S10U8@gold to
Thu Sep 23 11:06:15 2010 [TM][D]: Message received: LOG - 73 tm_clone.sh: Executed "chmod a+w /srv/cloud/one/var/73/images/disk.0".
Thu Sep 23 11:06:15 2010 [TM][D]: Message received: TRANSFER SUCCESS 73 -
Cloning speed of one second. ONE!
The tm_clone.sh that implements the above commands follows:
#!/bin/bash # -------------------------------------------------------------------------- # # Copyright 2002-2009, Distributed Systems Architecture Group, Universidad # # Complutense de Madrid (dsa-research.org) # # # # Licensed under the Apache License, Version 2.0 (the "License"); you may # # not use this file except in compliance with the License. You may obtain # # a copy of the License at # # # # http://www.apache.org/licenses/LICENSE-2.0 # # # # Unless required by applicable law or agreed to in writing, software # # distributed under the License is distributed on an "AS IS" BASIS, # # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # # See the License for the specific language governing permissions and # # limitations under the License. # #--------------------------------------------------------------------------- # SRC=$1 DST=$2 ZFS_HOST=10.8.30.191 ZFS_POOL=rpool ZFS_BASE_PATH=/export/home/cloud ## this is the path that maps to /srv/cloud ZFS_IMAGES_PATH=one/var/images ## relative one ZFS_BASE_PATH ZFS_CMD=/usr/sbin/zfs ZFS_SNAPSHOT_NAME=gold if [ -z "${ONE_LOCATION}" ]; then TMCOMMON=/usr/lib/one/mads/tm_common.sh else TMCOMMON=$ONE_LOCATION/lib/mads/tm_common.sh fi . $TMCOMMON get_vmdir function arg_zfs_path { dirname `echo $1 | sed -e "s#^/srv/cloud#${ZFS_POOL}${ZFS_BASE_PATH}#"` } function zfs_strip_pool { echo $1 | sed -e "s/${ZFS_POOL}//" } function get_vm_path { dirname `dirname $1` } SRC_PATH=`arg_path $SRC` DST_PATH=`arg_path $DST` VM_PATH=`get_vm_path $DST_PATH` VM_ID=`basename $VM_PATH` fix_paths ZFS_SRC_PATH=`arg_zfs_path $SRC_PATH` TMPPP=`arg_zfs_path $DST_PATH` ZFS_DST_MNT_PATH=`zfs_strip_pool $TMPPP` ZFS_DST_PATH=${ZFS_POOL}${ZFS_BASE_PATH}/${ZFS_IMAGES_PATH}/${VM_ID} DST_DIR=`dirname $DST_PATH` log "Cloning ZFS ${ZFS_SRC_PATH}@${ZFS_SNAPSHOT_NAME} to ${ZFS_DST_CLONE_PATH}" exec_and_log "ssh ${ZFS_HOST} ${ZFS_CMD} clone ${ZFS_SRC_PATH}@${ZFS_SNAPSHOT_NAME} ${ZFS_DST_PATH}" exec_and_log "ln -s ${VMDIR}/images/${VM_ID} ${VMDIR}/${VM_ID}/images" exec_and_log "chmod a+w $DST_PATH"
VM deletion
Once a VM is teared down one may dispose of its image files. The NFS transfer driver does so by disposing of the images directory altogether, executing a command similar to the following:
oneadmin@frontend-node$ rm -rf /srv/cloud/one/var/${VMID}/images
This kind of works but is suboptimal, since the ZFS clone dataset holding the VM instance image remains around. Ideally the transfer driver deletion script should run the following command instead:
oneadmin@frontend-node$ ssh storage-server destroy rpool/export/home/cloud/one/var/images/${VMID}
The script implementing the above follows:
#!/bin/bash # -------------------------------------------------------------------------- # # Copyright 2002-2009, Distributed Systems Architecture Group, Universidad # # Complutense de Madrid (dsa-research.org) # # # # Licensed under the Apache License, Version 2.0 (the "License"); you may # # not use this file except in compliance with the License. You may obtain # # a copy of the License at # # # # http://www.apache.org/licenses/LICENSE-2.0 # # # # Unless required by applicable law or agreed to in writing, software # # distributed under the License is distributed on an "AS IS" BASIS, # # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # # See the License for the specific language governing permissions and # # limitations under the License. # #--------------------------------------------------------------------------- # SRC=$1 DST=$2 ZFS_HOST=10.8.30.191 ZFS_POOL=rpool ZFS_BASE_PATH=/export/home/cloud ## this is the path that maps to /srv/cloud ZFS_IMAGES_PATH=one/var/images ## relative one ZFS_BASE_PATH ZFS_CMD=/usr/sbin/zfs ZFS_SNAPSHOT_NAME=gold if [ -z "${ONE_LOCATION}" ]; then TMCOMMON=/usr/lib/one/mads/tm_common.sh else TMCOMMON=$ONE_LOCATION/lib/mads/tm_common.sh fi . $TMCOMMON get_vmdir function arg_zfs_path { echo $1 | sed -e "s#^/srv/cloud#${ZFS_POOL}${ZFS_BASE_PATH}#" } function zfs_strip_pool { echo $1 | sed -e "s/${ZFS_POOL}//" } function get_vm_path { dirname `dirname $1` } fix_src_path log $SRC_PATH VM_ID=`basename \`dirname $SRC_PATH\`` DST_PATH=${VMDIR}/images/${VM_ID} ZFS_DST_PATH=`arg_zfs_path ${DST_PATH}` log "Destroying ${ZFS_DST_PATH} dataset" exec_and_log "ssh ${ZFS_HOST} ${ZFS_CMD} destroy ${ZFS_DST_PATH}"
Miscellaneous notes
There are probably various miscellaneous enhancements that one can do in the above scripts (for instance the various ZFS variables and functions should be set in a single location). They are provided as-is without any commitments that they will work in your environment (though they probably will with minimal changes).
References
Opennebula, ZFS and Xen – Part 1 (Get going)
ZFS admin guide
Tags: opennebula, virtualization, xen, zfs
September 27, 2010 at 8:49 pm |
[…] ~mperedim/weblog Just another WordPress.com weblog « Opennebula, ZFS and Xen – Part 2 (Instant cloning) […]
July 14, 2011 at 4:42 pm |
Nice guide 🙂 What about creating a similar solution with btrfs, that is compatible natively with Linux?
July 14, 2011 at 5:16 pm |
I will admit that being mostly a Solaris guy I haven’t toyed with btrfs nor have it in my radar.
I may end up doing this eventually but don’t hold your breath 😉
October 13, 2011 at 9:01 am |
Interesting solution! Have you experimented with live migration in this setup? Does it work? Thanks!
October 13, 2011 at 6:00 pm |
Hi Humberto,
No I didn’t experiment with live migration in this setup, and unfortunately it’s no longer available to give it a shot. That said, from the VM hosts perspective the storage backend remains a “Shared FS” common across all hosts, which is the main requirement for live migration. Hence, I don’t see any reason why it wouldn’t work.
October 21, 2011 at 11:52 pm |
Hi again and sorry if I ask too much 🙂
I have been wanting to give a try to your setup since I first read this post, but I have not had time yet. Luckily, it seems I will have some time in the coming days.
So before I get my hands dirty, I would like to have everything clear. If I understood it right, points 1 to 6 must be done in advance for the system to work with the tm files you provide, right? Points 1, 5 and 6 can be done once at installation time. But points 2 to 4 need to be done for every new image that we are going to use with Opennebula, right? I guess that the “oneimage register” command could then be modified to run the commands in points 2 to 4, right? In that way the whole process would be automated.
Cheers,
Humberto
October 22, 2011 at 2:29 am |
Hi Humberto,
Correct, {1,5,6} is do-once instructions, {2,3,4} is per-image.
I will admit that I haven’t got a chance to toy around with newer Opennebula releases so I guess you are right about the “oneimage register” command and steps 2-4, but this is strictly based on what I’ve read in the documentation.
Note that I am not certain how well these instructions will work out in a 3.0 setup (they were written for a 1.4 installation). My suggestion is to figure out their “spirit” rather than follow them to the “letter”.
November 8, 2012 at 2:03 pm |
[…] on this blog article I created an updated datastore driver that allows you to use a ZFS backend with OpenNebula. This […]