Wednesday, July 14, 2010

qcow2 (performance) considered harmful?

qcow2 is a virtual disk format available with qemu (and therefore kvm and xen). It's designed to provide a rich set of features, including snapshotting, encryption and compression for virtual machines stored within it. It also allows disk files to be created empty, then allow them to grow as data is stored in the virtual disk.

What some people may not have realized is that these features are all implemented by storing additional metadata within the .qcow2 file that you see on disk. This is not the case with alternative, raw image formats where a flat file on disk stores a block for block image of the virtual disk.

I was aware of the presence of metadata, but was not aware of the extent of it. I had a number of VM images sitting on qcow2 format virtual disks. What I was finding is that when trying to start a number of VMs simultaneously, KVM was actually timing out the VM creation and aborting. It seemed possible that this was just due to CPU load, but I investigated further.

With a test disk image I had available, which was 16GB in size for a 32GB virtual disk, I used the command:

qemu-img info disk0.qcow2

This took over 10 seconds to run to completion. Re-running the same command immediately after caused it to complete in well under a second. Clearly there was some caching effect going on at the OS level. To identify what it was, I ran the qemu-img command through strace:

strace -o qemu-image.strace qemu-img info disk0.qcow2

Analyzing the resulting strace file shows that the disk image is opened, followed by a large number of reads:

[snip]
17018 read(5, "\0\0\0\0\0\0020\0\0\0\0\0\0\200\20\0\0\0\0\0\1\0\20\0\0\0\0\0\1\200\20\0\0"..., 20480) = 20480
17018 lseek(5, 143360, SEEK_SET) = 143360
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 8392704, SEEK_SET) = 8392704
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 16781312, SEEK_SET) = 16781312
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 25169920, SEEK_SET) = 25169920
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 33558528, SEEK_SET) = 33558528
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 41947136, SEEK_SET) = 41947136
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 50335744, SEEK_SET) = 50335744
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 58724352, SEEK_SET) = 58724352
[etc]

For my 16GB file, those reads of 4,096 bytes are repeated a whopping 1,935 times. The reason this runs quickly the second time is that 1,935 blocks of 4KB in size is only actually about 8MB of data to cache - not a large amount for current systems. After the cache is warmed up the first time, the data is still there when the same command is run shortly afterwards.

This work seems to be done whenever the qcow2 format image is opened, and appears to account for why qcow2 format VMs can take a number of seconds to initialize prior to boot, particularly if the host has just been rebooted itself.

If this was a sequential read of 8MB, the whole process would always run quickly. Even fairly old disks are capable of reading 10's of MB's per second, with no trouble. Unfortunately, these reads are scattered around the disk, with a seek between each one. This means that the disk head had to move 1,935 times just to read these data blocks. If the host filesystem metadata wasn't cached, this means that there would be additional forced disk seeks simply to find those data blocks on the disk.

Modern 7,200 RPM SATA hard disks (like the ones I can afford for my lab!) can only service around 120 random IO's Per Second (IOPS). Based on this figure, you'd expect any operation to 'stat' the sample qcow2 disk to take around 16 seconds (1,935 / 120) or longer. Now it's possible to understand why the VM's were timing out during a cold start. Starting 4 VMs
at once could easily take over 60 seconds, just to perform the disk IOs to get the disk info.

I was able to work around my issue by converting my qcow2 disk images to raw format. Now, I can start several VMs at the same time. The system load is high because of all the work being done, but there has been no sign of VM creation timing out. Of course, this loses all of the feature advantages of the qcow2 format. Fortunately, I can get snapshotting because I use ZFS as my underlying filesystem over a NAS, YMMV.

To convert the disk format, you can do:

  • shut down the VM!
  • qemu-img convert disk0.qcow2 disk0.raw
  • update your VM (e.g. for kvm, the host.xml file) to point to the new .raw file

In summary, qcow2 is extremely feature rich, but in specific corner cases (cold start) it can have some interesting performance issues. Choose what format you use carefully, or host your VMs on an SSD where there's less of a performance penalty for doing random vs. sequential reads.

Thursday, July 8, 2010

Stacked ZFS filesystem deadlocking

Years ago, I posted how to create an encrypted filesystem on Solaris using the old CFS code as an encryption engine "shim", with ZFS layered on top to provide full POSIX semantics, scaling, integrity checks, etc.

The stack would look something like this:

ZFS (plaintext files) -> CFS encrypted mount -> UFS filesystem (encrypted files) -> raw devices

That was in 2006, when not many people were using ZFS for storage management. However, now that ZFS is likely to be the default choice for most people, you end up with a stack that looks like this:

ZFS (plaintext files) -> CFS encrypted mount -> ZFS root pool (encrypted files) -> raw devices

It turns out that when you layer ZFS like this, you can cause what appears to be a kernel deadlock under high I/O load, when writing files into the top-level ZFS mount. The process writing becomes hung and unkillable, with the underlying filesystem unresponsive. If the underlying ZFS filesystem is in the root pool, this means that in practice, the whole system is now unusable and has to be hard reset.

While this is a slightly contrived scenario, it gives me a good insight into why the use of files (rather than devices) is not supported for creating zpools in production use. So remember, in this case, layering considered harmful!

Unfortunately, because of this, I would no longer suggest using this approach to creating encrypted filesystems on Solaris.

Sunday, August 16, 2009

Configuring PuTTY to look like a traditional xterm

PuTTY is a really useful terminal emulator and ssh client for Windows but I'm not a huge fan of it's default appearance. For years (15 or so...) I've had a preferred configuration for my xterm sessions on Unix using the 9x15 font with an 'Ivory' coloured background. Here's my notes on how to get the same look under Windows with PuTTY.

1) Download the 9x15 font to Desktop
2) Install 9x15 font by double clicking and selecting 'Install'
3) Run PuTTY
4) Set Connection -> Seconds between keepalives = 60 (to avoid firewall timeouts)
5) Set Window -> Appearance -> Font = 9x15-ISO8859-1, 11 point
6) Set Window -> Translation -> Handling of line drawing characters = Font has XWindows Encoding
7) Uncheck "Bolded text is a different colour" in Window -> Colours
8) Set following Colours in Window -> Colours (Colour = Red, Green, Blue)
- Default Foreground = 0, 0, 0
- Default Bold Foreground = 0, 0, 0
- Default Background = 255, 255, 240
- Default Bold Background = 255, 255, 240
- Cursor Text = 255, 255, 240
- Cursor Colour = 0, 0, 0
9) Save as "Default Settings" profile in Session

All profiles created after this will include these settings.

Credit goes to Andrés Kievsky of http://www.ank.com.ar/fonts/ for making this font (and others) available in a usable form.

Thursday, December 25, 2008

Converting split vmdk files to qcow2

I've been playing around with VM environments recently.  I've got some legacy golden VMs created with VMWare server that I'd like to run under the Linux KVM environment.  There are a couple of challenges here:

1) convert disk format
2) make the VM boot under the different VM environment

For now, I'm addressing item #1.  The standard recipe is to run qemu-img on the .vmdk file to create a qcow2 disk image.  Unfortunately this doesn't work in my scenario because I'm using a vmdk split into sparse (2GB) files, which causes the standard invocation to fail.

Assuming that you have access to both a set of VMWare and QEMU binaries available, this can be achieved with a two step process:

1) convert from split files into single, growable image file.  Invocation for this is:

vmware-vdiskmanager -r "source.vmdk" -t 0 "temp.vmdk"

The option "-t 0" tells vdiskmanager to create a destination disk image which is growable and monolithic

2) use the standard approach to convert the monolithic file into a .qcow2:

qemu-img convert -f vmdk -O qcow2 "temp.vmdk" "destination.qcow2"

At this point, you have a disk image that will be accessible by software looking for a qcow2 format file.  After verifying that you can access it, feel free to delete the temporary file temp.vmdk.

Sunday, November 19, 2006

Solaris 10 filesystem encryption

Update: please note that this was written a decade ago, and described what was essentially a hack that could be used to achieve slow storage encryption on Solaris 10. I do not recommend using this approach now or in the future.

Like many people today, I wanted to set up filesystem encryption on Solaris 10. It didn't need to be fast (just as well, since hardware encryption is currently out of the question). At the minimum, I wanted Data At Rest (DAR) storage encryption supporting a filesystem layered on top. The requirements I came up with were:

1) no plaintext stored on disk
2) encryption must be transparent to applications
3) filesystem must provide full POSIX semantics
4) availability is less important than integrity
(if data is returned, it must be correct)

On Linux, this would be easy to do. Create a LUKS-format encrypted block device, map it through dm-crypt, then stick a filesystem on top with mkfs.  My laptop is actually protected this way already.

On Solaris, it should be this easy.  Encrypted filesystems are becoming increasingly necessary in Enterprise environments, and Enterprise computing shops are by far the biggest customers for Sun.  It will be this easy if Sun ever release the encrypted lofi drivers they've been talking about.  Unfortunately, it isn't currently this easy.

Considering the problem, the old CFS code from Matt Blaze sprang to mind as a possibility. It was years since I'd looked at this code - if it worked, it would make my encryption problems go away (and potentially cause my availability problems to start...). Unfortunately, some quick research determined that this wouldn't work, because the code (even if it was stable) doesn't support full POSIX semantics (e.g. chmod, chown, etc. don't work by design since the filesystem owner is the only user granted access).

Some web research failed to turn up a good alternative solution.

Further pondering. Perhaps I could use the CFS code to provide an encrypted file which I could then use as the backend for a filesystem through lofi? That might just work.

This left major concerns about data integrity. Using code that hasn't been maintained in 5 years and almost certainly has never been compiled on an Opteron with Solaris 10/amd64 by its author seems risky.  ZFS places integrity above performance (why else would they checksum every block?). zpools can also be constructed directly on top of files from another filesystem. Could ZFS on CFS be a match made in heaven?

It turns out that you can indeed create zpools on top of files encrypted with CFS. CFS has no largefiles support, so if you need the zpool to be larger than 2GB, you'll need to create multiple backing store files and add them as separate vdevs into the zpool.

This leaves us with a way to create POSIX filesystems on Solaris 10 with data integrity assured by ZFS and with all data 3DES encrypted through CFS.

If you're worried about exposing rpcbind or cfsd to the network, block the traffic using IP Filter (which is almost certainly installed on your Solaris 10 box already).

The caveat to all of this is that if ZFS does discover corruption, you've probably lost your data, so this is only going to work for storage of transient data - don't stick your Oracle 10g environment on this. Also, and potentially more limiting, even on modern hardware it's still slow as a dog: ~4MB/sec for 3DES encryption of sequential I/O on a 2 GHz Opteron.

HOWTO

1) download cfs-1.4.1 (original downloaded from: http://www.crypto.com/software/)
2) install it, configure it and mount a CFS filesystem as /crypt
3) within CFS, create as many 2GB files as you need to hold your data
4) create a zpool from these (zpool create -m none crypt /crypt/space/store0001.zpool /crypt/space/store0002.zpool ...)
5) create a filesystem on this (zfs create crypt/test; zfs set mountpoint=/test crypt/test)
6) play with your encrypted filesystem

For extra integrity, consider using RAIDZ when creating the pool.  That way, in the limited failure scenario of CFS corrupting one backing file, the data should still be recoverable.