Wednesday, July 14, 2010

qcow2 (performance) considered harmful?

qcow2 is a virtual disk format available with qemu (and therefore kvm and xen). It's designed to provide a rich set of features, including snapshotting, encryption and compression for virtual machines stored within it. It also allows disk files to be created empty, then allow them to grow as data is stored in the virtual disk.

What some people may not have realized is that these features are all implemented by storing additional metadata within the .qcow2 file that you see on disk. This is not the case with alternative, raw image formats where a flat file on disk stores a block for block image of the virtual disk.

I was aware of the presence of metadata, but was not aware of the extent of it. I had a number of VM images sitting on qcow2 format virtual disks. What I was finding is that when trying to start a number of VMs simultaneously, KVM was actually timing out the VM creation and aborting. It seemed possible that this was just due to CPU load, but I investigated further.

With a test disk image I had available, which was 16GB in size for a 32GB virtual disk, I used the command:

qemu-img info disk0.qcow2

This took over 10 seconds to run to completion. Re-running the same command immediately after caused it to complete in well under a second. Clearly there was some caching effect going on at the OS level. To identify what it was, I ran the qemu-img command through strace:

strace -o qemu-image.strace qemu-img info disk0.qcow2

Analyzing the resulting strace file shows that the disk image is opened, followed by a large number of reads:

[snip]
17018 read(5, "\0\0\0\0\0\0020\0\0\0\0\0\0\200\20\0\0\0\0\0\1\0\20\0\0\0\0\0\1\200\20\0\0"..., 20480) = 20480
17018 lseek(5, 143360, SEEK_SET) = 143360
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 8392704, SEEK_SET) = 8392704
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 16781312, SEEK_SET) = 16781312
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 25169920, SEEK_SET) = 25169920
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 33558528, SEEK_SET) = 33558528
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 41947136, SEEK_SET) = 41947136
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 50335744, SEEK_SET) = 50335744
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 58724352, SEEK_SET) = 58724352
[etc]

For my 16GB file, those reads of 4,096 bytes are repeated a whopping 1,935 times. The reason this runs quickly the second time is that 1,935 blocks of 4KB in size is only actually about 8MB of data to cache - not a large amount for current systems. After the cache is warmed up the first time, the data is still there when the same command is run shortly afterwards.

This work seems to be done whenever the qcow2 format image is opened, and appears to account for why qcow2 format VMs can take a number of seconds to initialize prior to boot, particularly if the host has just been rebooted itself.

If this was a sequential read of 8MB, the whole process would always run quickly. Even fairly old disks are capable of reading 10's of MB's per second, with no trouble. Unfortunately, these reads are scattered around the disk, with a seek between each one. This means that the disk head had to move 1,935 times just to read these data blocks. If the host filesystem metadata wasn't cached, this means that there would be additional forced disk seeks simply to find those data blocks on the disk.

Modern 7,200 RPM SATA hard disks (like the ones I can afford for my lab!) can only service around 120 random IO's Per Second (IOPS). Based on this figure, you'd expect any operation to 'stat' the sample qcow2 disk to take around 16 seconds (1,935 / 120) or longer. Now it's possible to understand why the VM's were timing out during a cold start. Starting 4 VMs
at once could easily take over 60 seconds, just to perform the disk IOs to get the disk info.

I was able to work around my issue by converting my qcow2 disk images to raw format. Now, I can start several VMs at the same time. The system load is high because of all the work being done, but there has been no sign of VM creation timing out. Of course, this loses all of the feature advantages of the qcow2 format. Fortunately, I can get snapshotting because I use ZFS as my underlying filesystem over a NAS, YMMV.

To convert the disk format, you can do:

  • shut down the VM!
  • qemu-img convert disk0.qcow2 disk0.raw
  • update your VM (e.g. for kvm, the host.xml file) to point to the new .raw file

In summary, qcow2 is extremely feature rich, but in specific corner cases (cold start) it can have some interesting performance issues. Choose what format you use carefully, or host your VMs on an SSD where there's less of a performance penalty for doing random vs. sequential reads.

Thursday, July 8, 2010

Stacked ZFS filesystem deadlocking

Years ago, I posted how to create an encrypted filesystem on Solaris using the old CFS code as an encryption engine "shim", with ZFS layered on top to provide full POSIX semantics, scaling, integrity checks, etc.

The stack would look something like this:

ZFS (plaintext files) -> CFS encrypted mount -> UFS filesystem (encrypted files) -> raw devices

That was in 2006, when not many people were using ZFS for storage management. However, now that ZFS is likely to be the default choice for most people, you end up with a stack that looks like this:

ZFS (plaintext files) -> CFS encrypted mount -> ZFS root pool (encrypted files) -> raw devices

It turns out that when you layer ZFS like this, you can cause what appears to be a kernel deadlock under high I/O load, when writing files into the top-level ZFS mount. The process writing becomes hung and unkillable, with the underlying filesystem unresponsive. If the underlying ZFS filesystem is in the root pool, this means that in practice, the whole system is now unusable and has to be hard reset.

While this is a slightly contrived scenario, it gives me a good insight into why the use of files (rather than devices) is not supported for creating zpools in production use. So remember, in this case, layering considered harmful!

Unfortunately, because of this, I would no longer suggest using this approach to creating encrypted filesystems on Solaris.