Wednesday, July 14, 2010

qcow2 (performance) considered harmful?

qcow2 is a virtual disk format available with qemu (and therefore kvm and xen). It's designed to provide a rich set of features, including snapshotting, encryption and compression for virtual machines stored within it. It also allows disk files to be created empty, then allow them to grow as data is stored in the virtual disk.

What some people may not have realized is that these features are all implemented by storing additional metadata within the .qcow2 file that you see on disk. This is not the case with alternative, raw image formats where a flat file on disk stores a block for block image of the virtual disk.

I was aware of the presence of metadata, but was not aware of the extent of it. I had a number of VM images sitting on qcow2 format virtual disks. What I was finding is that when trying to start a number of VMs simultaneously, KVM was actually timing out the VM creation and aborting. It seemed possible that this was just due to CPU load, but I investigated further.

With a test disk image I had available, which was 16GB in size for a 32GB virtual disk, I used the command:

qemu-img info disk0.qcow2

This took over 10 seconds to run to completion. Re-running the same command immediately after caused it to complete in well under a second. Clearly there was some caching effect going on at the OS level. To identify what it was, I ran the qemu-img command through strace:

strace -o qemu-image.strace qemu-img info disk0.qcow2

Analyzing the resulting strace file shows that the disk image is opened, followed by a large number of reads:

[snip]
17018 read(5, "\0\0\0\0\0\0020\0\0\0\0\0\0\200\20\0\0\0\0\0\1\0\20\0\0\0\0\0\1\200\20\0\0"..., 20480) = 20480
17018 lseek(5, 143360, SEEK_SET) = 143360
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 8392704, SEEK_SET) = 8392704
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 16781312, SEEK_SET) = 16781312
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 25169920, SEEK_SET) = 25169920
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 33558528, SEEK_SET) = 33558528
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 41947136, SEEK_SET) = 41947136
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 50335744, SEEK_SET) = 50335744
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 58724352, SEEK_SET) = 58724352
[etc]

For my 16GB file, those reads of 4,096 bytes are repeated a whopping 1,935 times. The reason this runs quickly the second time is that 1,935 blocks of 4KB in size is only actually about 8MB of data to cache - not a large amount for current systems. After the cache is warmed up the first time, the data is still there when the same command is run shortly afterwards.

This work seems to be done whenever the qcow2 format image is opened, and appears to account for why qcow2 format VMs can take a number of seconds to initialize prior to boot, particularly if the host has just been rebooted itself.

If this was a sequential read of 8MB, the whole process would always run quickly. Even fairly old disks are capable of reading 10's of MB's per second, with no trouble. Unfortunately, these reads are scattered around the disk, with a seek between each one. This means that the disk head had to move 1,935 times just to read these data blocks. If the host filesystem metadata wasn't cached, this means that there would be additional forced disk seeks simply to find those data blocks on the disk.

Modern 7,200 RPM SATA hard disks (like the ones I can afford for my lab!) can only service around 120 random IO's Per Second (IOPS). Based on this figure, you'd expect any operation to 'stat' the sample qcow2 disk to take around 16 seconds (1,935 / 120) or longer. Now it's possible to understand why the VM's were timing out during a cold start. Starting 4 VMs
at once could easily take over 60 seconds, just to perform the disk IOs to get the disk info.

I was able to work around my issue by converting my qcow2 disk images to raw format. Now, I can start several VMs at the same time. The system load is high because of all the work being done, but there has been no sign of VM creation timing out. Of course, this loses all of the feature advantages of the qcow2 format. Fortunately, I can get snapshotting because I use ZFS as my underlying filesystem over a NAS, YMMV.

To convert the disk format, you can do:

  • shut down the VM!
  • qemu-img convert disk0.qcow2 disk0.raw
  • update your VM (e.g. for kvm, the host.xml file) to point to the new .raw file

In summary, qcow2 is extremely feature rich, but in specific corner cases (cold start) it can have some interesting performance issues. Choose what format you use carefully, or host your VMs on an SSD where there's less of a performance penalty for doing random vs. sequential reads.

No comments: