Ross A. Hamilton

Thursday, December 8, 2011

Broken aptitude when running in xterm

With recent versions of Ubuntu, and using xterm as your terminal emulator, the package selection tool aptitude has a nasty habit of corrupting the display as it's used. For example, running aptitude, then searching for "test" produces the following:

As the display is updated, some text remains which should have been overwritten with blank space, but isn't. This makes the tool difficult to use, as you're left sorting out the real, current text from the gobbledygook remnants of previous screens. The fix for this problem is to change the TERM environment variable to be xterm-color rather than the default xterm. Unfortunately, this causes another issue because some tools (such as vim) have their own corruption issues when run with TERM set to xterm-color.

The solution is to put the following in your .bashrc:

if [ "$TERM" = "xterm" ]; then
alias aptitude="TERM=xterm-color sudo aptitude"
else
alias aptitude="sudo aptitude"
fi

The reason for the embedded sudo, and for the alias being defined even when TERM isn't xterm? sudo doesn't process shell aliases or functions, and so sudo must be embedded in the alias. Defining an alias even when TERM is already good is simply to preserve consistent behaviour, i.e. never needing to type sudo manually to invoke aptitude.

After doing this, aptitude now behaves correctly when searching:

Good stuff!

Sunday, November 27, 2011

Concatenating PDFs

Concatenating PDF files should be pretty straightforward. On Linux, there are several tools that can do this, including pdftk, pdf2ps and convert, which is a wrapper for Ghostscript. Unfortunately, I had a batch of files that I wanted to concatenate for ease of use on my tablet, and none of these tools were working. pdftk failed repeatedly with the useful error message:

Error: Failed to open PDF file:

input.pdf

Using pdf2ps did create a merged copy of the input files, but it was HUGE, consisting of bitmap images of the pages, losing the text in the process. ImageMagick convert never ran to completion, since I terminated it after it had eaten over 3GB of memory, presumably rendering the text into images.

Ultimately, I was able to successfully create a high quality, merged copy of my files by resorting to manually invoking Ghostscript:

gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=output.pdf *.pdf

The resulting file is actually smaller than the combined total of the input files, storing text as text, rather than horrid, pre-rendered bitmaps. Ghostscript used a sane amount of memory, and it ran to completion in a sensible amount of time.

To Ghostscript, bravo! To the others, a big "Why"?

Thursday, November 10, 2011

Ubuntu branding #fail

Install Ubuntu 11.10 Server, add Xfce4 desktop environment and reboot. Result? Debian space theme branding on the grub and boot screens. Quality Assurance, anyone? Perhaps images from upstream packages need a little more vetting before importing...

debian space-themed boot screen/grub menu on Ubuntu 11.10 "Oneiric"

Tuesday, November 8, 2011

Linux ICMP redirects

It seems that Ubuntu 11.10 ships with a sample /etc/sysctl.conf which contains the following statement, intended to tell the system not to originate ICMP redirects when acting as a router:

# Do not send ICMP redirects (we are not a router)
net.ipv4.conf.all.send_redirects = 0

Unfortunately, (at least) with kernel 3.0 as shipped with Oneiric, even after setting this and activating with 'sysctl -p', it doesn't work. Symptoms are noisy kernel log records such as:

[611513.083432] host 192.168.0.100/if2 ignores redirects for 8.8.8.8 to 192.168.0.1.

If you actually want to disable sending ICMP redirects, you have to explicitly set this per interface in /etc/sysctl.conf, by doing:

# Do not send ICMP redirects (we are not a router)

net.ipv4.conf.eth0.send_redirects = 0

net.ipv4.conf.eth1.send_redirects = 0

net.ipv4.conf.eth2.send_redirects = 0

etc.

Tuesday, October 25, 2011

Ubuntu service management

Running 'service --status-all' gives the following output:

[ - ] apparmor

[ ? ] apport

[ + ] apt-cacher-ng

[ ? ] atd

[ + ] bind9

(snipped)

Who on earth thought it was a good idea for the status characters to be symbols that have special meaning when they appear in a regex? It makes doing 'service --status-all | grep ...' less trivial than it should be.

Wednesday, March 30, 2011

Ping!

In both Linux and the BSD variants, the default behavior of the ICMP echo-based ping command is to enter an infinite loop, sending a probe once per second. This results in output something like:

ubuntu:~> ping 8.8.8.8

PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

64 bytes from 8.8.8.8: icmp_req=1 ttl=52 time=35.9 ms

64 bytes from 8.8.8.8: icmp_req=2 ttl=52 time=37.3 ms

64 bytes from 8.8.8.8: icmp_req=3 ttl=52 time=36.2 ms

64 bytes from 8.8.8.8: icmp_req=4 ttl=52 time=37.0 ms

64 bytes from 8.8.8.8: icmp_req=5 ttl=52 time=36.4 ms

--- 8.8.8.8 ping statistics ---

5 packets transmitted, 5 received, 0% packet loss, time 4004ms

rtt min/avg/max/mdev = 35.936/36.594/37.308/0.548 ms

In Solaris, on the other hand, the default output of the same command has always been to print '<machine> is alive', or 'no answer from <machine>' after a timeout. Per the Solaris ping manual page, the Linux/BSD behavior is known as statistics mode, and has to be enabled by running ping with the -s flag.

Now, it seems that an easter egg has been added to the Solaris 11/Express ping program at build 33, remaining in later builds. If you set the shell environment variable MACHINE_THAT_GOES_PING (I'm not joking!) to any non-null value, then the default ping behavior changes to statistics mode. I've confirmed this as being the case on my b127 Solaris Express host.

Thanks to John Beck for the tip!

Tuesday, March 22, 2011

Thought for the (last) decade?

Just found a wonderful quote from the original, July 1974, Communications of the ACM paper on UNIX written by DMR and Ken. They say:

"Perhaps the most important achievement of UNIX is to demonstrate that a powerful operating system for interactive use need not be expensive either in equipment or in human effort"

This is perhaps something that SCO should have considered over a decade ago when they first decided to pursue Linux because (they felt) it was created far too quickly to have been done without stealing (what they believed was their) UNIX IP.

Wednesday, February 16, 2011

Oracle and ZFS shenanigans

With the latest Solaris 10 release or recommended patch cluster, there are significant updates provided for ZFS. By patching or reinstalling with Solaris 10 9/10, you can get close to the zpool version which was previously only available by using Solaris Express Community Edition, now Solaris 11 Express.

But there is a potential catch. Each incremental feature change to zpool capabilities causes the zpool version number to be incremented. You can see what versions are supported on your local install by typing:

zpool upgrade -v

which lists all available features on the current driver, along with their version number. After applying the most recent Solaris 10 patch cluster, you'll see the following:

cara:~> zpool upgrade -v

This system is currently running ZFS pool version 22.

The following versions are supported:

VER DESCRIPTION

--- --------------------------------------------------------

1 Initial ZFS version

2 Ditto blocks (replicated metadata)

3 Hot spares and double parity RAID-Z

4 zpool history

5 Compression using the gzip algorithm

6 bootfs pool property

7 Separate intent log devices

8 Delegated administration

9 refquota and refreservation properties

10 Cache devices

11 Improved scrub performance

12 Snapshot properties

13 snapused property

14 passthrough-x aclinherit

15 user/group space accounting

16 stmf property support

17 Triple-parity RAID-Z

18 Snapshot user holds

19 Log device removal

20 Compression using zle (zero-length encoding)

21 Reserved

22 Received properties

Note that version 21 is 'Reserved'. If you run the same command on a system running the Express kernel, version 21 shows as:

21 Deduplication

22 Received properties

23 Slim ZIL

The whole point of zpool versioning is that a pool with a given version number should be mountable on any system running ZFS where the kernel supports at least that version of the pool. Sun went to great lengths to enable this, even specifying that ZFS was endian-independent, where all writes would be done with the local byte order, but reads would be honored in either big or little endian order. You can move a pool from a SPARC to an x86 platform, and it works.

This was going to be a blog post about the evils of Oracle Corporation breaking this compatibility. Version 21 is deduplication on the Express version, but reserved on the release version. I was going to rant about the dangers of creating a filesystem utilizing deduplication using Solaris Express, then trying to import it into a release version of Solaris 10.

But I can't quite do that.

After performing an experiment, it seems that Solaris 10 can in fact correctly mount pools created on Express which have deduplication enabled. However, 10 won't continue to dedup newly written data, since that's not a supported feature. This at least makes sense as a compromise. Compatibility is preserved across pool versions, to the extent that you won't see any nasty side effects like kernel panics if you accidentally mount a deduped filesystem on a release version of Solaris 10. You won't get any further benefit from this unsupported feature, but it shouldn't kill you either.

So my only question is this: is the dedup feature left out of these updates because Oracle wants to provide a compelling reason to move to Solaris 11 (which may also feature significantly different license terms)? Or are they leaving it out because there's a concern about bugs which impact integrity, availability, or both in the current version of the software?

Time will tell.

Saturday, February 12, 2011

Forcing a Linux Reboot

Linux zfs-fuse is an extremely useful piece of software, but this morning it crashed on me to the point where even 'reboot -f' was failing to reboot the server due to kernel confusion.

Fortunately, there is a way to use the Linux SysRq mechanism to force an immediate reboot. This won't sync the disks, and certainly won't wait for processes to terminate (which is why it works in this case), but it certainly saved me from going to the data center to manually intervene.

To do an emergency reboot on Linux, perform the following two steps as root:

echo 1 > /proc/sys/kernel/sysrq

echo b > /proc/sysrq-trigger

This causes an immediate reboot of the system. Of course, if the thing causing the problem was a corrupted root filesystem, the server may not boot, but that would be the case regardless :-)

More details on SysRq are available at the Linux Kernel site.

Wednesday, July 14, 2010

qcow2 (performance) considered harmful?

qcow2 is a virtual disk format available with qemu (and therefore kvm and xen). It's designed to provide a rich set of features, including snapshotting, encryption and compression for virtual machines stored within it. It also allows disk files to be created empty, then allow them to grow as data is stored in the virtual disk.

What some people may not have realized is that these features are all implemented by storing additional metadata within the .qcow2 file that you see on disk. This is not the case with alternative, raw image formats where a flat file on disk stores a block for block image of the virtual disk.

I was aware of the presence of metadata, but was not aware of the extent of it. I had a number of VM images sitting on qcow2 format virtual disks. What I was finding is that when trying to start a number of VMs simultaneously, KVM was actually timing out the VM creation and aborting. It seemed possible that this was just due to CPU load, but I investigated further.

With a test disk image I had available, which was 16GB in size for a 32GB virtual disk, I used the command:

qemu-img info disk0.qcow2

This took over 10 seconds to run to completion. Re-running the same command immediately after caused it to complete in well under a second. Clearly there was some caching effect going on at the OS level. To identify what it was, I ran the qemu-img command through strace:

strace -o qemu-image.strace qemu-img info disk0.qcow2

Analyzing the resulting strace file shows that the disk image is opened, followed by a large number of reads:

[snip]

17018 read(5, "\0\0\0\0\0\0020\0\0\0\0\0\0\200\20\0\0\0\0\0\1\0\20\0\0\0\0\0\1\200\20\0\0"..., 20480) = 20480
17018 lseek(5, 143360, SEEK_SET) = 143360
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 8392704, SEEK_SET) = 8392704
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 16781312, SEEK_SET) = 16781312
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 25169920, SEEK_SET) = 25169920
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 33558528, SEEK_SET) = 33558528
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 41947136, SEEK_SET) = 41947136
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 50335744, SEEK_SET) = 50335744
17018 read(5, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0"..., 4096) = 4096
17018 lseek(5, 58724352, SEEK_SET) = 58724352

[etc]

For my 16GB file, those reads of 4,096 bytes are repeated a whopping 1,935 times. The reason this runs quickly the second time is that 1,935 blocks of 4KB in size is only actually about 8MB of data to cache - not a large amount for current systems. After the cache is warmed up the first time, the data is still there when the same command is run shortly afterwards.

This work seems to be done whenever the qcow2 format image is opened, and appears to account for why qcow2 format VMs can take a number of seconds to initialize prior to boot, particularly if the host has just been rebooted itself.

If this was a sequential read of 8MB, the whole process would always run quickly. Even fairly old disks are capable of reading 10's of MB's per second, with no trouble. Unfortunately, these reads are scattered around the disk, with a seek between each one. This means that the disk head had to move 1,935 times just to read these data blocks. If the host filesystem metadata wasn't cached, this means that there would be additional forced disk seeks simply to find those data blocks on the disk.

Modern 7,200 RPM SATA hard disks (like the ones I can afford for my lab!) can only service around 120 random IO's Per Second (IOPS). Based on this figure, you'd expect any operation to 'stat' the sample qcow2 disk to take around 16 seconds (1,935 / 120) or longer. Now it's possible to understand why the VM's were timing out during a cold start. Starting 4 VMs

at once could easily take over 60 seconds, just to perform the disk IOs to get the disk info.

I was able to work around my issue by converting my qcow2 disk images to raw format. Now, I can start several VMs at the same time. The system load is high because of all the work being done, but there has been no sign of VM creation timing out. Of course, this loses all of the feature advantages of the qcow2 format. Fortunately, I can get snapshotting because I use ZFS as my underlying filesystem over a NAS, YMMV.

To convert the disk format, you can do:

shut down the VM!
qemu-img convert disk0.qcow2 disk0.raw
update your VM (e.g. for kvm, the host.xml file) to point to the new .raw file

In summary, qcow2 is extremely feature rich, but in specific corner cases (cold start) it can have some interesting performance issues. Choose what format you use carefully, or host your VMs on an SSD where there's less of a performance penalty for doing random vs. sequential reads.

Thursday, July 8, 2010

Stacked ZFS filesystem deadlocking

Years ago, I posted how to create an encrypted filesystem on Solaris using the old CFS code as an encryption engine "shim", with ZFS layered on top to provide full POSIX semantics, scaling, integrity checks, etc.

The stack would look something like this:

ZFS (plaintext files) -> CFS encrypted mount -> UFS filesystem (encrypted files) -> raw devices

That was in 2006, when not many people were using ZFS for storage management. However, now that ZFS is likely to be the default choice for most people, you end up with a stack that looks like this:

ZFS (plaintext files) -> CFS encrypted mount -> ZFS root pool (encrypted files) -> raw devices

It turns out that when you layer ZFS like this, you can cause what appears to be a kernel deadlock under high I/O load, when writing files into the top-level ZFS mount. The process writing becomes hung and unkillable, with the underlying filesystem unresponsive. If the underlying ZFS filesystem is in the root pool, this means that in practice, the whole system is now unusable and has to be hard reset.

While this is a slightly contrived scenario, it gives me a good insight into why the use of files (rather than devices) is not supported for creating zpools in production use. So remember, in this case, layering considered harmful!

Unfortunately, because of this, I would no longer suggest using this approach to creating encrypted filesystems on Solaris.

Sunday, August 16, 2009

Configuring PuTTY to look like a traditional xterm

PuTTY is a really useful terminal emulator and ssh client for Windows but I'm not a huge fan of it's default appearance. For years (15 or so...) I've had a preferred configuration for my xterm sessions on Unix using the 9x15 font with an 'Ivory' coloured background. Here's my notes on how to get the same look under Windows with PuTTY.

1) Download the 9x15 font to Desktop

2) Install 9x15 font by double clicking and selecting 'Install'

3) Run PuTTY

4) Set Connection -> Seconds between keepalives = 60 (to avoid firewall timeouts)

5) Set Window -> Appearance -> Font = 9x15-ISO8859-1, 11 point

6) Set Window -> Translation -> Handling of line drawing characters = Font has XWindows Encoding

7) Uncheck "Bolded text is a different colour" in Window -> Colours

8) Set following Colours in Window -> Colours (Colour = Red, Green, Blue)

- Default Foreground = 0, 0, 0

- Default Bold Foreground = 0, 0, 0

- Default Background = 255, 255, 240

- Default Bold Background = 255, 255, 240

- Cursor Text = 255, 255, 240

- Cursor Colour = 0, 0, 0

9) Save as "Default Settings" profile in Session

All profiles created after this will include these settings.

Credit goes to Andrés Kievsky of http://www.ank.com.ar/fonts/ for making this font (and others) available in a usable form.

Thursday, December 25, 2008

Converting split vmdk files to qcow2

I've been playing around with VM environments recently. I've got some legacy golden VMs created with VMWare server that I'd like to run under the Linux KVM environment. There are a couple of challenges here:

1) convert disk format

2) make the VM boot under the different VM environment

For now, I'm addressing item #1. The standard recipe is to run qemu-img on the .vmdk file to create a qcow2 disk image. Unfortunately this doesn't work in my scenario because I'm using a vmdk split into sparse (2GB) files, which causes the standard invocation to fail.

Assuming that you have access to both a set of VMWare and QEMU binaries available, this can be achieved with a two step process:

1) convert from split files into single, growable image file. Invocation for this is:

vmware-vdiskmanager -r "source.vmdk" -t 0 "temp.vmdk"

The option "-t 0" tells vdiskmanager to create a destination disk image which is growable and monolithic

2) use the standard approach to convert the monolithic file into a .qcow2:

qemu-img convert -f vmdk -O qcow2 "temp.vmdk" "destination.qcow2"

At this point, you have a disk image that will be accessible by software looking for a qcow2 format file. After verifying that you can access it, feel free to delete the temporary file temp.vmdk.