Bug 1095891
Summary: | systemd-212-4 causes Live images to hang on boot in 1 CPU guests | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Josh Boyer <jwboyer> | ||||||||
Component: | systemd | Assignee: | systemd-maint | ||||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | rawhide | CC: | awilliam, bcl, bruno, gczarcinski, harald, johannbg, kay, kevin, lnie, lnykryn, msekleta, pbrobinson, plautrba, samuel-rhbugs, s, systemd-maint, vpavlin, wgianopoulos, zbyszek, zing | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2014-05-28 19:47:58 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Josh Boyer
2014-05-08 18:52:06 UTC
Created attachment 893731 [details]
systemd.log_level=debug output of hang
Created attachment 893732 [details]
systemd.log_level=debug output of boot when CPUs=2
The attachments above are from the exact same VM instance, both using the XFCE 20140508 live image ISO. The only difference between them is that the hang situation has 1 CPU in the VM and the working boot has 2 CPUs in the VM. Seeing this also in my rawhide qemu-kvm install. Single cpu: systemd-212-4.fc21.x86_64 + kernel-3.15.0-0.rc4.git1.1.fc21.x86_64 - boots kernel-3.15.0-0.rc4.git2.1.fc21.x86_64 - hangs at Reached Basic System kernel-3.15.0-0.rc4.git3.1.fc21.x86_64 - hangs at Reached Basic System Two cpu: systemd-212-4.fc21.x86_64 + kernel-3.15.0-0.rc4.git1.1.fc21.x86_64 - boots kernel-3.15.0-0.rc4.git2.1.fc21.x86_64 - boots kernel-3.15.0-0.rc4.git3.1.fc21.x86_64 - boots *** Bug 1096386 has been marked as a duplicate of this bug. *** (In reply to Josh Boyer from comment #3) > The attachments above are from the exact same VM instance, both using the > XFCE 20140508 live image ISO. The only difference between them is that the > hang situation has 1 CPU in the VM and the working boot has 2 CPUs in the VM. I am curious as to what the last systemd version this worked with was. Reason I ask is that between 212-2, which I am assuming works and 212-4 there were 2 changes both in separate builds. So, does this work with 212-3? It seems to me the uuidd change is more likely to be causing my issue as it is more related to the way I am booting. (In reply to Bill Gianopoulos from comment #6) > (In reply to Josh Boyer from comment #3) > > The attachments above are from the exact same VM instance, both using the > > XFCE 20140508 live image ISO. The only difference between them is that the > > hang situation has 1 CPU in the VM and the working boot has 2 CPUs in the VM. > > I am curious as to what the last systemd version this worked with was. > Reason I ask is that between 212-2, which I am assuming works and 212-4 > there were 2 changes both in separate builds. So, does this work with > 212-3? It seems to me the uuidd change is more likely to be causing my > issue as it is more related to the way I am booting. The reason I say this is that I ma not doing a network boot but am doing a boot using UUID's. So just think ing perhaps the UUID patch is more relevant to my issue. Confirming this with F21 virt host here and a live image composed from today's Rawhide: consistently fails to boot with a guest with a single CPU. Haven't checked with a bare metal system yet. *** Bug 1097606 has been marked as a duplicate of this bug. *** What is the status of this issue? This is kind of important to fix, as it is not possible to get new users running rawhide on a single CPU system. Also, it is keeping people like me from testing the latest kernel because I am stuck on version 3.15.0-0.rc4.git1.1. If i try to update the kernel, the resultant kernel will not boot. I can, at least, test other packages. I tried downgrading systemd to see if that would help, but I can't get that to work using yum because of cyclic dependency issue. OK I figured out my dependency issue and downgraded systemd to 212-3 and then re-installed the 3.15.0-0.rc5.git2.9 kernel and that results in a successful boot. Therefore this issue is definitely a result of the change between systemd 212-3 and 212-4 which, according to the changelog, is: * Wed May 07 2014 Kay Sievers <kay> - 212-4 - add netns udev workaround Just to be excruciatingly clear here. The purpose of re-installing the kernel was to force a rebuild of initramfs with the downgraded version of systemd. OK this has gone on long enough. A patch that the description defines as a workaround so not a proper fix for anything is preventing single CPU systems from being able to boot. This "fix" need to be reverted ASAP. lennart and kay are travelling ATM, and harald's been off work lately, that's why systemd/udev stuff is taking longer than usual. the rest of us are usually reluctant to touch those bits unless we're really sure what we're doing, but i might try a systemd build with the changes from 3 to 4 reverted later. the change between 3 and 4 has a rather different description upstream, btw: http://cgit.freedesktop.org/systemd/systemd/commit/?id=9ea28c55a2488e6cd4a44ac5786f12b71ad5bc9f "udev: remove seqnum API and all assumptions about seqnums" (In reply to Adam Williamson from comment #14) > lennart and kay are travelling ATM, and harald's been off work lately, > that's why systemd/udev stuff is taking longer than usual. the rest of us > are usually reluctant to touch those bits unless we're really sure what > we're doing, but i might try a systemd build with the changes from 3 to 4 > reverted later. Sorry, sometimes I get a bit impatient. In order to fix a different issue I wanted to do a new clean install and just there has been no way to do that for awhile now. OK, so I hit a small road bump reproducing this for testing purposes - it doesn't seem to happen at least for me with non-debug kernels. But it looks like it's reliably reproducible with debug kernels. Today's (2014-05-26) Xfce nightly reliably reproduces this issue in a single-CPU KVM guest for me: six boot attempts, six fails. I built an Xfce live locally with the same kernel (3.15.0-0.rc6.git1.1.fc21.x86_64) but with a systemd scratch build with the patch from -4 dropped. Tried five boots on the same KVM, got five successes. Seems pretty definitive. I don't know what that patch is intended to fix, why it was considered sufficiently important to be backported, but I can't imagine that it could be something *worse* than this, so I'm going to go ahead and push out a systemd -5 with the patch reverted. Thanks, Bill, for identifying the offending component. as this seems like a very serious issue, I've reported it directly to upstream as https://bugs.freedesktop.org/show_bug.cgi?id=79283 just to be safe (though I'm sure Kay would look after it upstream in any case). Without this patch, the installer will hang or not work, because the way network namespaces are implemented in the kernel, they break udev by "stealing" expected seqnums from the host's primary namespace. The base OS recently started to use network namespaces, PrivateNetwork=yes in unbit files, so this will show up again. Is there a simple way to reproduce the "single CPU" issue? It sounds pretty strange. Are we sure that is hangs for forever, not only for a few minutes and the continues? Could you try to boot with plymouth disabled on the kernel command line? (Lennart and I are still on vacation this week, without proper internet.) "Is there a simple way to reproduce the "single CPU" issue? It sounds pretty strange." Very simple. Grab the nightly I linked above. Set up a normal KVM (I'm using virt-manager) with a single CPU. Try and boot it. Profit. Add a CPU, it'll boot fine. Use systemd 212-3 or 212-5, it'll boot fine. "Are we sure that is hangs for forever, not only for a few minutes and the continues?" I didn't leave mine for terribly long, don't know about the other reporters. I can leave one sitting here while I make dinner. "Could you try to boot with plymouth disabled on the kernel command line?" The hang is before plymouth kicks in, I think, but sure, easy enough to try... ...boot without 'rhgb quiet' and with 'rd.plymouth=0 plymouth.enable=0' still hangs. I'll leave this attempt sitting here for a while. (In reply to Adam Williamson from comment #20) > "Is there a simple way to reproduce the "single CPU" issue? It sounds pretty > strange." > > Very simple. Grab the nightly I linked above. Care to add an exact link here, I don't see it. We are in China, downloading large files might not work too well, so it would be nice to get the right one. :) my guess is, that it needs also http://cgit.freedesktop.org/systemd/systemd/commit/?id=83be2c398589a3d64db5999cfd5527c5219bff46 which fixes the "udevadm settle" issue introduced by 9ea28c55a2488e6cd4a44ac5786f12b71ad5bc9f hanging forever ... well, a couple of hours is close enough kay: sorry, the link was in the other bug report: http://kojipkgs.fedoraproject.org/work/tasks/1167/6891167/Fedora-Live-Xfce-x86_64-rawhide-20140526.iso Created attachment 899832 [details]
screenshot
Adding "debug" to the kernel commandline shows the output like in the
attached screenshot. Dracut hangs in a loop. It looks like the issue
Harald pointed out above.
I really have no idea what kind of woodoo is going on, that makes it
behave differently with one or more CPUs.
A new release is coming out today, and should have the fix.
Submitted to rawhide. I tested a live image with systemd 213-3 and kernel 3.15.0-0.rc7.git1.1.fc21.x86_64 (a debug kernel). Booted successfully three times in a row, install seemed fine up until it hit https://bugzilla.redhat.com/show_bug.cgi?id=1101557 , which looks to have been going on for a while. So I'd say this is looking fixed. |