Bug 835466
Summary: | [nova] injection of ssh authorized_keys sometimes fail when injecting other files too. | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | David Naori <dnaori> | ||||||||||||
Component: | openstack-nova | Assignee: | Alan Pevec <apevec> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Yaniv Kaul <ykaul> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 1.0 (Essex) | CC: | jkt, pbrady, rjones | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | 1.0 Essex | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | openstack-nova-2012.1-13.el6 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | |||||||||||||||
: | 838592 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2012-12-13 13:39:55 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | 838609 | ||||||||||||||
Bug Blocks: | |||||||||||||||
Attachments: |
|
(In reply to comment #0) > 2) only mount guest image once when injecting files (attached Pádraig's > workaround patch) In openstack-nova-2012.1-13.el6 Duplicate of bug 832511 ? Note the sleep(3) also works for the very similar upstream issue seen on RH 6.1 Note there are 2 issues really in the upstream https://bugs.launchpad.net/nova/+bug/1013689. The unsynchronised guest issue is very similar and also addressed with a sleep(3). However my "inject_once" patch attached here does not address their issue, which is not that surprising since it was only minimising traversal of the problematic code path. Verified openstack-nova-2012.1-13.el6 # for i in {1..10}; do nova boot test$i --flavor 1 --image 7e04542d-d996-43bb-bdec-f05d27d0df61 --file '/root/david2=/root/test' --security_groups 'allow-icmp-and-ssh' --key_name david; sleep 1 ; done #for ip in `nova list |awk '/demonet/{print $8}'| sed -e s/demonet=//g`; do ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i david.priv root@$ip md5sum /root/.ssh/authorized_keys /root/david2 ; done Warning: Permanently added '192.168.1.163' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Warning: Permanently added '192.168.1.158' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Warning: Permanently added '192.168.1.159' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Warning: Permanently added '192.168.1.161' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Warning: Permanently added '192.168.1.156' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Warning: Permanently added '192.168.1.165' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Warning: Permanently added '192.168.1.162' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Warning: Permanently added '192.168.1.157' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Warning: Permanently added '192.168.1.160' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Warning: Permanently added '192.168.1.164' (RSA) to the list of known hosts. b58e9c494f6ea298d17e70a845277066 /root/.ssh/authorized_keys 35f19e72dd33950659c0222922fc046a /root/david2 Created attachment 596883 [details] mount once patch to avoid issue So this patch is only avoiding the issue. Richard Jones fix some sync issues with libguestfs recently that may well be the root cause. For details of the sync issues addressed see: https://www.redhat.com/archives/libguestfs/2012-July/msg00020.html It would be great to reproduce this issue by using an earlier version of nova, or just: (cd /usr/lib/python2.6/site-packages/nova && patch -R -p1 < $this_patch) and then updating libguestfs to see is the issue addressed directly. Richard has released Fedora RPMs though I don't notice RHEL rpms at present. So this is just a reminder note for when they become available. cheers. (In reply to comment #6) > Created attachment 596883 [details] > mount once patch to avoid issue > > So this patch is only avoiding the issue. > Richard Jones fix some sync issues with libguestfs recently that may well be > the root cause. For details of the sync issues addressed see: > https://www.redhat.com/archives/libguestfs/2012-July/msg00020.html Unless you can prove that qemu is segfaulting, this is not the same thing. Read the conditions (a) through (d) in the above email message carefully. We can't get any further on this bug until someone captures the debug information from libguestfs: http://libguestfs.org/guestfs-faq.1.html#debugging_libguestfs I'm sorry, but everything written about this bug so far has been broad speculation. We need facts. Created attachment 597041 [details] compute.log with guestmount --verbose (In reply to comment #7) > We can't get any further on this bug until someone captures > the debug information from libguestfs: > > http://libguestfs.org/guestfs-faq.1.html#debugging_libguestfs Added --verbose to guestmount and logged STDERR to compute.log (Attached) Thanks to some discussion with David on IRC we have a plausible cause for this: (1) When Nova has finished with the mountpoint, it calls fusermount -u /tmp/mpXX (2) fusermount informs the kernel which begins the process of guestmount exiting, but it doesn't wait for guestmount to actually exit (ie. finish writing and closing the disk image, shutting down qemu etc) (3) Nova immediately continues by launching the guest. (4) Since guestmount hasn't exited, another qemu instance runs using probably the old copy of the disk image. (5) For some reason all hell *doesn't* break loose at this point. I'm not clear why that is. But anyway it's not good. The only way to fix this properly I can see is to use the mount-local API. However an interim workaround would be to spin on 'pgrep guestmount' just after calling fusermount and wait for the guestmount process to exit (for the hack it doesn't matter which one exits, as long as none are running before we proceed). David is going to collect the output of 'pgrep fusermount' before and after fusermount to see if the situation above is likely. I mean 'pgrep guestmount' in that last paragraph. Created attachment 597068 [details]
rhbz835466.sh
Standalone reproducer for this bug. This requires guestmount, guestfish,
qemu-img and fusermount programs and nothing else.
To run it:
chmod +x rhbz835466.sh
./rhbz835466.sh
There are various possible outputs:
(1)
No conclusion. You must do
fusermount -u /tmp/testmp
and rerun the test.
means that fusermount failed because something jumped
into the mountpoint. Follow the instructions and rerun
the test.
(2) "rhbz 835466 was reproduced"
means that the bug was reproduced. A file was written to
the mountpoint, but the underlying disk file was not
finalized before we managed to copy it away.
(3)
RHBZ#835466 could not be reproduced. This might be luck
or you might not be affected by it.
Also try uncommenting the "sleep 10" line, to make the bug go away. This indicates that the magical fix that went upstream just changed the timing of things as I predicted. As Richard suspected guestmount is running before and after fusermount 2012-07-09 15:59:16 DEBUG nova.utils [req-cb983e39-61a1-4774-9249-cddef6ae3dad 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap pgrep guestmount from (pid=8472) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-07-09 15:59:16 DEBUG nova.utils [req-cb983e39-61a1-4774-9249-cddef6ae3dad 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] STDOUT=8717 , STDERR= from (pid=8472) execute /usr/lib/python2.6/site-packages/nova/utils.py:235 2012-07-09 15:59:16 DEBUG nova.utils [req-cb983e39-61a1-4774-9249-cddef6ae3dad 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap fusermount -u /tmp/tmpEZz4xc from (pid=8472) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-07-09 15:59:16 DEBUG nova.utils [req-cb983e39-61a1-4774-9249-cddef6ae3dad 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] STDOUT=, STDERR= from (pid=8472) execute /usr/lib/python2.6/site-packages/nova/utils.py:235 2012-07-09 15:59:16 DEBUG nova.utils [req-cb983e39-61a1-4774-9249-cddef6ae3dad 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap pgrep guestmount from (pid=8472) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-07-09 15:59:16 DEBUG nova.utils [req-cb983e39-61a1-4774-9249-cddef6ae3dad 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] STDOUT=8717 , STDERR= from (pid=8472) execute /usr/lib/python2.6/site-packages/nova/utils.py:235 For completeness re attachment 597041 [details], disregard the guestmount commands returning with 247. They were manually killed due to --trace causing them to hang.
Search instead for the two "guestmount --verbose --rw" commands run in sequence. David confirms that the ssh keys injected in the first run are not present.
So `fusermount -u` is asynchronous. That's unexpected.
I don't suppose there's an easy way to get fuse to
wait for guestmount to exit? Is there ever a case you wouldn't want that?
I wondered could we wait until the mount_dir was unused,
though since we can already os.rmdir(self.mount_dir),
I guess that's immediately released, but guestmount is left running?
Manually waiting for guestmount to return wouldn't be too bad,
though it's awkward to get the pid of the actual guestmount process at present.
(In reply to comment #14) > I don't suppose there's an easy way to get fuse to > wait for guestmount to exit? Is there ever a case you wouldn't want that? Still investigating the fuse & guestmount code to see if this is possible. > I wondered could we wait until the mount_dir was unused, > though since we can already os.rmdir(self.mount_dir), > I guess that's immediately released, but guestmount is left running? Yes, I don't think this would ever be possible. After all we *have* to release the mountpoint before closing libguestfs, because if we closed libguestfs first then how would we handle the FUSE requests that Nova is making on the mountpoint? The only question is whether we can make fusermount hang synchronously until guestmount has completely finished, which is what I'm looking at. In any case, mount-local solves this properly since Nova would wait for g.mount_local_run to return and it gets to close the libguestfs handle itself. (In reply to comment #15) > (In reply to comment #14) > > I don't suppose there's an easy way to get fuse to > > wait for guestmount to exit? Is there ever a case you wouldn't want that? > > Still investigating the fuse & guestmount code to see if this > is possible. I don't think so. fusermount calls the umount2 syscall, which eventually gets us to fuse_abort_conn in the kernel, which as far as I can tell simply sends a close message to the userspace process but doesn't wait for any answer. Filed a bug to add guestmount --pid-file: https://bugzilla.redhat.com/show_bug.cgi?id=838592 I think David is going to work up a Nova patch which will wait for the guestmount process to exit after the fusermount. Created attachment 597371 [details]
A more direct workaround
I think this should address the issue more directly.
To verify it does it needs the "mount once" patch backed out of course,
but there are benefits to both, so I'll propose both upstream.
Note on the folsom branch I'll also remove the sync as part of the same patch.
Patch looks good to me. |
Created attachment 594410 [details] Compute.log , [PATCH] only mount guest image once when injecting files Description of problem: injection of ssh authorized_keys sometimes fail when injecting other files too. Scenario: * Selinux disabled on host and guest. [root@camel-vdsa nova]# getenforce Permissive # nova boot test --flavor 1 --image aaddc2ad-7180-48c2-865b-033bd418c9e3 --file '/root/david2=/root/test' --security_groups 'allow-icmp-and-ssh' --key_name david [root@test ~]# getenforce Permissive [root@test ~]# ll -a /root/ total 32 dr-xr-x---. 3 root root 4096 Jun 26 05:04 . dr-xr-xr-x. 22 root root 4096 Feb 29 06:01 .. -rw-r--r--. 1 root root 18 Feb 9 2011 .bash_logout -rw-r--r--. 1 root root 176 Feb 9 2011 .bash_profile -rw-r--r--. 1 root root 176 Feb 9 2011 .bashrc -rw-r--r--. 1 root root 100 Feb 9 2011 .cshrc -rw-r--r--. 1 root root 12 Jun 26 05:04 david2 -rw-r--r--. 1 root root 129 Feb 9 2011 .tcshrc 2012-06-26 12:03:58 INFO nova.virt.libvirt.connection [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] [instance: 64b 68fb6-86e0-441a-8f17-217198cbf5d6] Injecting key into image aaddc2ad-7180-48c2-865b-033bd418c9e3 2012-06-26 12:03:58 INFO nova.virt.libvirt.connection [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] [instance: 64b 68fb6-86e0-441a-8f17-217198cbf5d6] Injecting admin_password into image aaddc2ad-7180-48c2-865b-033bd418c9e3 2012-06-26 12:03:58 DEBUG nova.virt.disk.api [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] nbd unavailable: module not loaded from (pid=26137) mount /usr/lib/python2.6/site-packages/nova/virt/disk/api.py:205 2012-06-26 12:03:58 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap guestmount --rw -a /var/lib/nova/instances/instance-00000067/disk -i /tmp/tmpUeYZU6 from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:05 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap mkdir -p /tmp/tmpUeYZU6/root/.ssh from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:06 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap chown root /tmp/tmpUeYZU6/root/.ssh from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:06 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap chmod 700 /tmp/tmpUeYZU6/root/.ssh from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:06 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap tee -a /tmp/tmpUeYZU6/root/.ssh/authorized_keys from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:06 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap cp /tmp/tmpUeYZU6/etc/passwd /tmp/tmpBkL7pN from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:06 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap cp /tmp/tmpUeYZU6/etc/shadow /tmp/tmpJYvFFJ from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:06 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap cp /tmp/tmpBkL7pN /tmp/tmpUeYZU6/etc/passwd from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:07 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap cp /tmp/tmpJYvFFJ /tmp/tmpUeYZU6/etc/shadow from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:07 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap fusermount -u /tmp/tmpUeYZU6 from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:07 DEBUG nova.virt.disk.api [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] nbd unavailable: module not loaded from (pid=26137) mount /usr/lib/python2.6/site-packages/nova/virt/disk/api.py:205 2012-06-26 12:04:07 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap guestmount --rw -a /var/lib/nova/instances/instance-00000067/disk -i /tmp/tmpEBDePc from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:14 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap mkdir -p /tmp/tmpEBDePc/root from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:14 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap tee /tmp/tmpEBDePc/root/david2 from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:14 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap fusermount -u /tmp/tmpEBDePc from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:16 DEBUG nova.virt.libvirt.connection [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] [instance: 64b68fb6-86e0-441a-8f17-217198cbf5d6] Instance is running from (pid=26137) spawn /usr/lib/python2.6/site-packages/nova/virt/libvirt/connection.py:938 2012-06-26 12:04:16 DEBUG nova.utils [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] Running cmd (subprocess): sudo nova-rootwrap tee /sys/class/net/vnet0/brport/hairpin_mode from (pid=26137) execute /usr/lib/python2.6/site-packages/nova/utils.py:220 2012-06-26 12:04:16 DEBUG nova.compute.manager [req-310f14a6-2d11-419c-bba4-e4d5d55bc6af 29af43364ced4611bf6e093a04d4c144 3f1e394b238d4f26b742e9685a9a1a57] [instance: 64b68fb6-86e0-441a-8f17-217198cbf5d6] Checking state from (pid=26137) _get_power_state /usr/lib/python2.6/site-packages/nova/compute/manager.py:272 2012-06-26 12:04:17 INFO nova.virt.libvirt.connection [-] [instance: 64b68fb6-86e0-441a-8f17-217198cbf5d6] Instance spawned successfully. Version-Release number of selected component (if applicable): openstack-nova-2012.1-12.el6.noarch How reproducible: 80% Workarounds tested: 1) add sleep before and after fusermount -u import time time.sleep(3) utils.execute('fusermount', '-u', self.mount_dir, run_as_root=True, attempts=4) time.sleep(3) 2) only mount guest image once when injecting files (attached Pádraig's workaround patch) Additional info: attached Compute.log , [PATCH] only mount guest image once when injecting files