Red Hat Bugzilla – Bug 466681
pygrub uses cached and eventually outdated grub.conf, kernel and initrd
Last modified: 2010-12-20 13:13:06 EST
Description of problem:
When booting a VM or more specific when running pygrub against a volume (I haven't tested files) it may yield information from the cache. As the VM itself bypasses that caching pygrub will eventually read outdated information.
This leads to counter-intuitive restart behavior on kernel-upgrades and weird behavior when changing anything in a VMs /boot
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. create a LV using lvcreate
2. use virt-install to install a VM onto that LV
3. boot the vm
4. change anything in /boot/grub/grub.conf
5. reboot the vm
pygrub sees an old version of grub.conf
pygrub should see the most recent version of grub.conf
1. This leads to a security issue when going through the usual install-upgrade-reboot procedure as the reboot eventually boots an older kernel and you'll end up with an unpatched kernel.
2. This can be worked around by echo 1 > /proc/sys/vm/drop_caches before the reboot, so it is most likely a caching issue.
3. I think the correct behavior would be to make the fsimage library use O_DIRECT so the cache is bypassed.
4. When you're in a clustered enviroment using clvm and clustered LVs you'll have incoherent caching behavior for the LVs which might lead to even stranger behavior where a vm booted on node1 would load kernel A and the same vm booted on node2 would load kernel B (I could reproduce this in our lab)
Hm, I was not able to reproduce this at all. I'm running an i686 dom0, kernel-xen-2.6.18-118.el5 and xen-3.0.3-73.el5. I did:
# virt-install -n test -r 1024 -f /dev/VolGroup00/disk1 -p --nographics -l http://server/released/RHEL-5-Server/U2/i386/os
Which completed successfully. Then I booted into the installed VM, and edited /boot/grub/grub.conf (I just added some string to the default name), and then did "poweroff" inside the guest. Now, I did:
# xm create -c test
on the dom0, and I saw the updated grub. I then installed a new kernel inside the guest, and did a "reboot" inside the guest; when it booted, it booted to the new kernel. So it seems to me that we are re-reading the grub.conf correctly on every boot.
Can you give us some more details, please? It's possible that this was fixed between -64 (which you are running) and -73 (which I am running), but nothing in the changelog jumps out at me. What version of the kernel are you running in the guest and the host, and what architecture (i386, x86_64, etc.)?
*** Bug 466895 has been marked as a duplicate of this bug. ***
I've just noticed that one of our nfs servers that was affected by #453094 was
still having the problem even though kernel-xen-2.6.18-92.1.13.el5 was
installed. A closer look revealed that the machine is still booting the older
kernel (2.6.18-92.1.10.el5xen) because of this bug :(
Xen is at -64
dom0 x86_64 running 2.6.18-92.1.10.el5xen
domU x86_64 trying to run 2.6.18-92.1.13.el5xen but ends up at 2.6.18-92.1.10.el5xen
I'm running a fully patched 5.2 x86_64 on two nodes that are clustered with cman, including shared FC storage.
The VMs use volumes from a clustered VG (maybe this is cached differently on the host?).
The guest system for which I first encoutered the issue is a copied 4.7 i686. (In fact I almost went crazy while trying to figue out why pygrub didn't find anything)
What I really did was to rsync the 4.7 install to its new filesystem structure using a x86_64 guest running 5.2.
However, doing the edits on the booted 4.7 guest didn't help either.
When I did a dd to overwrite xvda1 with zeroes in the guest - the host didn't see the changes.
When I remade the filesystem on xvda1 the host didn't see any changes either.
The only way to make it work was issuing "echo 1 > /proc/sys/vm/drop_caches" on the host which immediately made every of the steps above behave as expected.
Rethinking the whole thing leads to the educated guess that the guest system kernel cannot be involved. If it was, that would mean that a malicious xen guest kernel could poison the host system's VFS cache.
Btw. what probably really helps to reproduce the issue is a box with enough memory to make sure the cached data doesn't vanish because of the cache pressure.
My Systems do both have 16 GBs of memory and when I saw the issue the hosts were running only two vms with 1 GB allocated - so the host had still around 14 GBs of memory for its VFS-cache.
As you're trying to reproduce on 2.6.18-118.el5 while I'm running 2.6.18-92.1.13.el5 maybe the kernel-crew did something cache-realated between 92 and 118 actually fixing this issue?
It's certainly possible, although I don't recall us putting anything in particular for this problem. I'll have to try again with an x86_64 machine; it could be something particular there. I also have a 16GB box to test with here, so hopefully I can reproduce the behavior.
In case it is any help.. In both machines that I had this issue dom0 is booted with dom0_mem= to limit the memory available to it. Also in both "disk = [ 'phy:/dev/disk/..." is used to define the disk available to domU.
I guess the question here is if writes from domU invalidate the file cache in dom0 or not.
I'm sorry, I still can't reproduce this behavior. I've gone to an x86_64 box, booted the -92 kernel, and also am now using the xen-3.0.3-64 package, and every time I make a change in the /boot/grub/grub.conf in the guest, it shows up immediately on the next boot when pygrub is run.
Here's *exactly* what I did:
1. lvcreate -n disk1 -l192 /dev/VolGroup00 (creates ~6GB LVM volume). Now, this is *not* on clustered stuff, so that might be a difference.
2. virt-install -n rhel5test -r 1024 -f /dev/VolGroup00/disk1 -p --nographics -l http://server/RHEL-5-Server/U2/x86_64/os/
3. Complete the install as usual.
4. After installation, the guest rebooted. At this point, I modified /boot/grub/grub.conf (just changed the name of the default grub entry), and then did "poweroff" inside the guest.
5. Now, I do "xm create -c rhel5test", and the updated grub entry is right there.
For reference, my machine is AMD, and has 8 cores and 16GB of memory, 1GB of which is given to the guest:
[root@amd1 ~]# xm list
Name ID Mem(MiB) VCPUs State Time(s)
Domain-0 0 13853 8 r----- 62.6
rhel5test 3 1023 1 -b---- 30.3
And my guest config looks like:
[root@amd1 ~]# cat /etc/xen/rhel5test
name = "rhel5test"
uuid = "14621e31-c550-fae3-9859-49da0c918302"
maxmem = 1024
memory = 1024
vcpus = 1
bootloader = "/usr/bin/pygrub"
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
vfb = [ ]
disk = [ "phy:/dev/VolGroup00/disk1,xvda,w" ]
vif = [ "mac=00:16:3e:61:4c:06,bridge=xenbr0" ]
1) Kostas, is your LVM storage on clustered LVM as well?
2) Kostas or Andreas, can you give me:
a) Exact host configuration, including package versions
b) Exact guest configuration, including kernel versions inside the guest, and the /etc/xen/ guest configuration file
c) Exact storage configuration
d) Exact steps you use to reproduce the problem.
No, no clustered LVM here. Note that I've seen this error under an F8 dom0 as well. In both machines I am exporting a whole disk to xen. Now that I think about it both machines that I've seen the problem are using Intel CPUs (Core2, Xeon E5335). Note that this doesn't happen every time in my case although once it happens it seems to stick until you drop the caches.
I'll try to see if I can reliably reproduce this...
I have tried on two machines and can reproduce this bug on both. They are both obviously outdated and I have not had a chance to test on a more recent system.
Machine 1 (i686):
Machine 2 (x86_66):
Both are configured to boot from a LVM logical volume, eg:
disk = [ 'phy:/dev/vg/oj,xvda,w', ]
Dropping the VM cache does indeed fix this problem.
I'm experiencing what I believe is the same issue as well. Instead of using LVMs though, I'm using plain ext3 on a RAID1 using two iscsi targets (viewable to the guest as physical disks) as my disks:
disk = [ "phy:/dev/disk/by-path/ip-XXX.XXX.XXX.XXX:3260-iscsi-iqn.1986-03.com.sun:02:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX-lun-0,xvda,w", "phy:/dev/disk/by-path/ip-XXX.XXX.XXX.XXX:3260-iscsi-iqn.1986-03.com.sun:02:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX-lun-0,xvdb,w" ]
I am limiting my Dom0 memory usage and my Dom0 is running on RHEL5.3 x86_64 on Dual-Core AMD Opterons with:
The workaround of 'echo 1 > /proc/sys/vm/drop_caches' seems to work for me, but I have to re-issue this command every time I update grub.conf in order for it to show up in pygrub.
I'm experiencing this problem too.
I run several HA clusters with xen VM's as services. Most of them currently use lvm Volumes for the virtual disks, and do not (at least I didn't observe it) experience the problem.
But some of the newer HA Cluster use LUN's from an EMC Clariion Storagesystem for the virtual disks. On these VM's I observe the problem regularly.
I would suggest as workaround adding this line to pygrub :
os.system('echo 1 > /proc/sys/vm/drop_caches')
somewhere inside get_fs_offset
One of our customers reported this after doing the following:
1) Reinstalling a previously installed system without changing the system name or the LUN on the SAN.
2) The last step in their kickstart process is to register with RHN Satellite and update the system. Part of this update was a new kernel.
When the system finished updating and restarted, the old kernel booted instead of the new one.
Chris - One of the common points between most reproductions seems to be the use of a SAN, whereas it seems like your tests are on local storage, is that correct? Could this be significant?
If I analyze the problem correctly, the situation is the following:
pygrub has to access the files directly, so it uses the caches of the Host system.
On updating kernel and grub.conf on the guest the following happens:
image on lvm or file: the guest does the access through the host I/O system,
the host cache is updated -> anything OK
image on a SAN LUN: the guest does the I/O more directly
the host cache is _not_ updated -> problem appears
A workaroung is to drop caches inside pygrub, as I suggested before, but this is a little bit crude. A more selective solution would be nice.
(In reply to comment #17)
> If I analyze the problem correctly, the situation is the following:
> pygrub has to access the files directly, so it uses the caches of the Host
> On updating kernel and grub.conf on the guest the following happens:
> image on lvm or file: the guest does the access through the host I/O system,
> the host cache is updated -> anything OK
> image on a SAN LUN: the guest does the I/O more directly
> the host cache is _not_ updated -> problem appears
Yes, this may be just what's happening.
> A workaroung is to drop caches inside pygrub, as I suggested before, but this
> is a little bit crude. A more selective solution would be nice.
Right. Dropping the caches is a bit of a big hammer solution, and impacts the entire system, which is not very nice at all. However, doing O_DIRECT to bypass the host cache completely is also ugly, since everything needs to be page-aligned to have a hope of working (and this is nigh-impossible to achieve in python).
I think we'll have to come up with some solution in C that does O_DIRECT accesses to the device, and then export that solution in a python wrapper. I'll see what I can do.
Jim, would it be hard to backport O_DIRECT support to RHEL's dd? That would be very usefulo for this bug.
> However, doing O_DIRECT to
> bypass the host cache completely is also ugly, since everything needs to be
> page-aligned to have a hope of working (and this is nigh-impossible to achieve
> in python).
> I think we'll have to come up with some solution in C that does O_DIRECT
> accesses to the device, and then export that solution in a python wrapper.
> I'll see what I can do.
Nearly all the pygrub filesystem reading code is already in C, done via the libfsimage.so library. The only place doing I/O in python is that which probes for partition table, which could easily be moved into the C layer too since its pretty trivial code. So using O_DIRECT shouldn't be too intractable for pygrub
(In reply to comment #19)
> Jim, would it be hard to backport O_DIRECT support to RHEL's dd? That would be
> very usefulo for this bug.
RHEL-5's dd already supports the O_DIRECT flag (oflag=direct). However, I don't really want to shell out to it in the python code, so I'm not sure that's a huge improvement.
(In reply to comment #20)
> Nearly all the pygrub filesystem reading code is already in C, done via the
> libfsimage.so library. The only place doing I/O in python is that which probes
> for partition table, which could easily be moved into the C layer too since its
> pretty trivial code. So using O_DIRECT shouldn't be too intractable for pygrub
But this is not a bad idea, maybe that's the way to go. Looking at this.
Created attachment 358609 [details]
Use O_DIRECT in pygrub
OK, here's a patch to force pygrub to use O_DIRECT everywhere it access the guest disk file. I'm not super happy with it, so I can't guarantee that this will be the final patch, but I am interested in whether this helps the people who are affected by this bug. Could the reporters of this bug please try the packages at:
And see if it helps the problem for them?
I tested you're patch on one system, but it doesn't help, again I have to drop caches before pygrub sees changes on the LUN
(In reply to comment #23)
> I tested you're patch on one system, but it doesn't help, again I have to drop
> caches before pygrub sees changes on the LUN
Did you reboot (or at least restart xend) after updating the package? If not, that could be the problem.
If you did restart xend, then that's quite confusing; accessing the disk via O_DIRECT should be bypassing the host page-cache completely. That leads to the thought that all of the data is not being flushed to the disk as it should, which would be worrying in other ways.
After having my test package installed, can you run:
strace -e open -o /tmp/pygrub-strace.out /usr/bin/pygrub /path/to/disk/image and attach the results here?
Ah, sorry, I didn't reboot or restart xend.
Hi, I'm having the same problem detailed in this bug report. I'm trying to use fiber channel mpath devices as backing for xen domu guests. If I use pygrub, update a domu kernel residing on the mpath device and reboot the domu, pygrub displays the outdated grub.conf.
I've had this problem using xen-3.0.3-94.el5 and kernel-xen-2.6.18-164.el5 on an x86_64 server.
I've tested the packages provided at http://people.redhat.com/clalance/bz466681 and have seen very strange results. I did the following:
1. Installed patched packages
2. Rebooted dom0
3. Zeroed mpath device with dd
4. Ran fresh koan install of domu to mpath device
5. Shutdown newly-installed guest domu
6. Started-up and shutdown guest domu several times to verify it boots fine with the patched packages
7. Booted domu and installed an additional kernel rpm
8. Rebooted domu and got the following python traceback:
[bobbyz@sakura ~]$ sudo xm create /etc/xen/mgmt -c
Using config file "/etc/xen/mgmt".
Traceback (most recent call last):
File "/usr/bin/pygrub", line 609, in ?
chosencfg = run_grub(file, entry, fs)
File "/usr/bin/pygrub", line 465, in run_grub
File "/usr/lib64/python2.4/curses/wrapper.py", line 44, in wrapper
return func(stdscr, *args, **kwds)
File "/usr/bin/pygrub", line 461, in run_main
sel = g.run()
File "/usr/bin/pygrub", line 355, in run
File "/usr/bin/pygrub", line 380, in run_main
self.screen.addstr(20, 5, "Will boot selected entry in %2d seconds"
_curses.error: addstr() returned ERR
No handlers could be found for logger "xend"
Error: Boot loader didn't return any data!
*) At this point even after a fresh dom0 reboot, I get the above pygrub traceback.
It appears something is causing problems with the addstr method, so I commented-this out and replaced with 'pass'. Now I could get pygrub to display, but if I press any keys, it dies with another traceback relating to addstr (the other conditional portion that displays blank text instead of a countdown). If I boot with the commented addstr and don't press anything, I can boot the domu. However, if I then remove one of my kernel rpms and reboot the domu, I still see the old cached grub.conf. I tried the strace provided and it looks like there is one read of my mpath device without o_DIRECT. I've added an attachment pygrub-strace.out_20090914 with the strace output.
I'd be happy to test any more patches/packages, as I'd love to resolve this caching issue. Thanks!
Created attachment 361028 [details]
pygrub strace output
(In reply to comment #26)
> It appears something is causing problems with the addstr method, so I
> commented-this out and replaced with 'pass'. Now I could get pygrub to
> display, but if I press any keys, it dies with another traceback relating to
> addstr (the other conditional portion that displays blank text instead of a
> countdown). If I boot with the commented addstr and don't press anything, I
> can boot the domu. However, if I then remove one of my kernel rpms and reboot
> the domu, I still see the old cached grub.conf. I tried the strace provided
> and it looks like there is one read of my mpath device without o_DIRECT. I've
Well, that's part of the problem. The rest of your traceback more likely than not comes because we are opening the device O_DIRECT, but not properly aligning a memory region that we are trying to read into.
What's odd is that when I first did this patch, it worked just fine for me. Then "after a while", it stopped working like it failed for you. I don't quite understand why, but I'll have to dig further to see what's going on.
Thanks for the testing,
just in case this is relevant...
When I hear about oddities with O_DIRECT, I'm reminded of a recent fix in coreutils where I learned that it's not just the alignment of the buffer that matters, but also is length. Its length must be a multiple of something power-of-2-ish -- maybe FS block size.
This bug made it so using dd to write a file with O_DIRECT would fail for any final portion of the file when the length of the final buffer was not a multiple of 512 (FS- and system-dependent):
The very same problem with "Direct / No Direct" IO hit me now with Windows HVM. I'm using the open Source PV driver for windows ( GPLPV ). The first part of the boot process is using Qemu, later on loading drivers it switches over to the GPLPV driver and access the Disk over paravirtualisation.
qemu uses problaby the system cache, but the PV driver later on used the LUN directly and so I got very curious defects on windows systems, until I also used "drop_caches" before rebooting a Windows HVM.
I just want to reiterate that this bug does not seem to affect _only_ LVs on LUNs. I'm running ext3 partitions on top of Software RAID1 (No LVs involved) on LUNs and still have this problem (see comment #12). So I don't believe the suggestion about dropping caches related to just the LV would help in this case.
If you're using multipath backed devices, you can simply flush the map and re-instate it. While still not ideal, it is less extreme than drop_caches..
Any updates on this? new packages to test?
Can you take this bug over? The latest news here is that the patch in this BZ converts almost all uses over to O_DIRECT. Unfortunately, while doing an strace on pygrub (strace -e open /usr/bin/pygrub /path/to/guest/image), I found that there is one more additional open that does *not* use O_DIRECT. That may end up being the source of the problem. I wasn't able to track down exactly where that open() was coming from; my guess is that it's from e2fsprogs-libs (or e4fsprogs-libs, now), but I'm not entirely certain of that. It needs more investigation.
I guess most of the work for O_DIRECT has been done, but I wonder if it might be simpler to call fadvise(fd, FADV_DONTNEED, 0, (loff_t)~0ULL) on the block device to cause the kernel to invalidate the caches for that device?
Just a thought.
One more word of warning about grub; in general reading a mounted filesystem's block device directly isn't guaranteed to be safe, in terms of consistency, if anything is writing to that filesystem then the block device reader may not get a consistent view. Not exactly this bug but worth thinking about in this context ...
(In reply to comment #39)
> I guess most of the work for O_DIRECT has been done, but I wonder if it might
> be simpler to call fadvise(fd, FADV_DONTNEED, 0, (loff_t)~0ULL) on the block
> device to cause the kernel to invalidate the caches for that device?
> Just a thought.
> One more word of warning about grub; in general reading a mounted filesystem's
> block device directly isn't guaranteed to be safe, in terms of consistency, if
> anything is writing to that filesystem then the block device reader may not get
> a consistent view. Not exactly this bug but worth thinking about in this
> context ...
Thanks for your reply Eric but my patches for both xen (based on Chris' one and similar to his) and e4fsprogs were already done. Check your e-mail for the e4fsprogs one..
I've created a new version of e4fsprogs and xen packages to open it directly. The original Chris' patch went the right direction but it was not enough and one more component had to be patched as well.
This is really hard because I was unable to reproduce it at all but according to technical point of view we did patch those components to directly open the file and read the data.
Andreas, could you please download and install new test e4fsprogs and xen packages from: http://people.redhat.com/minovotn/xen/ (e4fsprogs can be found in deps subdirectory) for architecture you're using and do some testing using those versions of both packages and provide us test results in a new BZ comment ?
The machines where I was hit by the bug are all in production right now. I'll see if I have a Test/Preproduction Box where I can reproduce the issue. Sadly, I don't have a spare Dell PE1950...
I prodded hch into looking at xen a little.
It sounds like xen is submitting bios directly and bypassing the normal routines which -would- have handled the cache coherency.
Fixing that, or at least flushing the block device caches on startup/shutdown is probably a far better solution than changing e2fsprogs, which only fixes one instance of the problem, and not 100% safely, at that.
If by any chance the host had -written- to the blk device, that could be written back at any tie, clobbering changes the guest made and causing corruption. Or vice versa...
That may not be the normal use case but in general what xen is doing does not seem safe.
OK, after a lot of consulting with Eric, Christoph, and Ric, I think we've come to the conclusion that doing this flushing in userspace is the wrong thing to do. Essentially what Eric said in comment #52 stands, that this would only fix it for ext*, and thus seems the wrong solution. Christoph looked at it more closely and determined that what the xen blkback driver in the kernel is doing is unsafe, and therefore the fix should be in the xen blkback driver. To that end, I've put together a patch which should invalidate all in-memory buffers on a device when we first open up the blkback thread for doing device I/O. I'll attach the patch to this BZ.
As I still can't reproduce this issue, I'll need one of the reporters to test out the patch and let me know if it fixes the problem for them. I've uploaded test kernels here:
Please let me know if this kernel fixes your problem.
Created attachment 384643 [details]
Kernel patch to invalidate memory before starting blkback thread
I believe this is what happens:
1) User runs "xm create"
2) pygrub runs, and at this point the guest disk blocks are cached in dom0 memory
3) guest actually starts using the kernel/initrd/config fetched from the guest
4) user modifies the guest grub.conf from the guest
5) user shuts down the guest
6) user runs "xm create" again to re-start the guest
7) pygrub runs, and now uses the cached old blocks from dom0 memory cache
8) guest starts with wrong/old information
So this new blkback patch should fix the step 7) by flushing the blocks from dom0 cache already in step 3) ?
(In reply to comment #55)
> I believe this is what happens:
> 1) User runs "xm create"
> 2) pygrub runs, and at this point the guest disk blocks are cached in dom0
> 3) guest actually starts using the kernel/initrd/config fetched from the guest
> 4) user modifies the guest grub.conf from the guest
> 5) user shuts down the guest
> 6) user runs "xm create" again to re-start the guest
> 7) pygrub runs, and now uses the cached old blocks from dom0 memory cache
> 8) guest starts with wrong/old information
> So this new blkback patch should fix the step 7) by flushing the blocks from
> dom0 cache already in step 3) ?
Right. The real problem is that blkback is not using the normal Linux routines for submitting data to disk; if it were, this problem would never have come up. What blkback is doing instead is submitting bios directly to the underlying block device, and not properly taking into account the dom0 page cache.
This patch actually corrects things in a slightly different way then you point out above, due to where I placed the invalidate code. Remembering that pygrub runs purely in dom0, here's what this patch should cause to happen:
1) xm create
2) pygrub runs, reads grub.conf which is now cached in dom0 memory
3) Guest starts, which causes a new blkback thread to be created
4) The blkback thread invalidates the blocks in dom0 memory
5) Guest runs
6) User modifies grub.conf in the guest
7) User shuts down the guest
8) xm create again
9) pygrub runs again, but because we invalidated the dom0 pages in step 4), we read the real grub.conf data off of the block device
10) goto 3)
I just run tests on a system with FC Storage. You're kernel patch bz466681 indeed solves the problem, with pygrub.
Will there be a backport to RHEL 5.4?
(In reply to comment #57)
> Hi Chris,
> I just run tests on a system with FC Storage. You're kernel patch bz466681
> indeed solves the problem, with pygrub.
> Will there be a backport to RHEL 5.4?
Awesome, thanks for the test. I'll be submitting this internally for the next version (5.5). If you need it for 5.4, please open a support ticket requesting it for the 5.4 z-stream.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.