Bug 1095891

Summary: systemd-212-4 causes Live images to hang on boot in 1 CPU guests
Product: [Fedora] Fedora Reporter: Josh Boyer <jwboyer>
Component: systemdAssignee: systemd-maint
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: awilliam, bcl, bruno, gczarcinski, harald, johannbg, kay, kevin, lnie, lnykryn, msekleta, pbrobinson, plautrba, samuel-rhbugs, s, systemd-maint, vpavlin, wgianopoulos, zbyszek, zing
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-28 19:47:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
systemd.log_level=debug output of hang
none
systemd.log_level=debug output of boot when CPUs=2
none
screenshot none

Description Josh Boyer 2014-05-08 18:52:06 UTC
Description of problem:

Downloading today's live images from today (20140508) and attempting to boot them in a KVM guest with 1 CPU will hang at the Basic system target.  If I boot the images from 20140507, they work and those are composed with systemd-212-2.

Version-Release number of selected component (if applicable):

systemd-214-4

How reproducible:

Always (tested Workstation and XFCE live images)


Steps to Reproduce:
1. Download live image
2. Create KVM guest with 1 CPU
3. Boot

Actual results:

Boot hangs at "Reached target Basic System."

Expected results:

Boots to the live desktop

Additional info:

I tried modifying the amount of memory allocated first, but that didn't seem to make a difference.  The images clearly work if the KVM guest has 2 CPUs, but not if they have one.

I noticed there was a rather large patch to udev in this systemd release.  It's possible that broke booting in this scenario.

Comment 1 Josh Boyer 2014-05-08 18:59:46 UTC
Created attachment 893731 [details]
systemd.log_level=debug output of hang

Comment 2 Josh Boyer 2014-05-08 19:00:44 UTC
Created attachment 893732 [details]
systemd.log_level=debug output of boot when CPUs=2

Comment 3 Josh Boyer 2014-05-08 19:01:51 UTC
The attachments above are from the exact same VM instance, both using the XFCE 20140508 live image ISO.  The only difference between them is that the hang situation has 1 CPU in the VM and the working boot has 2 CPUs in the VM.

Comment 4 Zing 2014-05-09 17:47:43 UTC
Seeing this also in my rawhide qemu-kvm install.

Single cpu:

systemd-212-4.fc21.x86_64 +
kernel-3.15.0-0.rc4.git1.1.fc21.x86_64 - boots
kernel-3.15.0-0.rc4.git2.1.fc21.x86_64 - hangs at Reached Basic System
kernel-3.15.0-0.rc4.git3.1.fc21.x86_64 - hangs at Reached Basic System

Two cpu:

systemd-212-4.fc21.x86_64 +
kernel-3.15.0-0.rc4.git1.1.fc21.x86_64 - boots
kernel-3.15.0-0.rc4.git2.1.fc21.x86_64 - boots
kernel-3.15.0-0.rc4.git3.1.fc21.x86_64 - boots

Comment 5 Bill Gianopoulos 2014-05-10 14:11:21 UTC
*** Bug 1096386 has been marked as a duplicate of this bug. ***

Comment 6 Bill Gianopoulos 2014-05-11 17:02:24 UTC
(In reply to Josh Boyer from comment #3)
> The attachments above are from the exact same VM instance, both using the
> XFCE 20140508 live image ISO.  The only difference between them is that the
> hang situation has 1 CPU in the VM and the working boot has 2 CPUs in the VM.

I am curious as to what the last systemd version this worked with was.  Reason I ask is that between 212-2, which I am assuming works and 212-4 there were 2 changes both in separate builds.  So, does this work with 212-3?  It seems to me the uuidd change is more likely to be causing my issue as it is more related to the way I am booting.

Comment 7 Bill Gianopoulos 2014-05-11 17:57:54 UTC
(In reply to Bill Gianopoulos from comment #6)
> (In reply to Josh Boyer from comment #3)
> > The attachments above are from the exact same VM instance, both using the
> > XFCE 20140508 live image ISO.  The only difference between them is that the
> > hang situation has 1 CPU in the VM and the working boot has 2 CPUs in the VM.
> 
> I am curious as to what the last systemd version this worked with was. 
> Reason I ask is that between 212-2, which I am assuming works and 212-4
> there were 2 changes both in separate builds.  So, does this work with
> 212-3?  It seems to me the uuidd change is more likely to be causing my
> issue as it is more related to the way I am booting.

The reason I say this is that I ma not doing a network boot but am doing a boot using UUID's.  So just think
ing perhaps the UUID patch is more relevant to my issue.

Comment 8 Adam Williamson 2014-05-13 00:04:14 UTC
Confirming this with F21 virt host here and a live image composed from today's Rawhide: consistently fails to boot with a guest with a single CPU. Haven't checked with a bare metal system yet.

Comment 9 David Shea 2014-05-14 15:53:35 UTC
*** Bug 1097606 has been marked as a duplicate of this bug. ***

Comment 10 Bill Gianopoulos 2014-05-16 15:20:02 UTC
What is the status of this issue?  This is kind of important to fix, as it is not possible to get new users running rawhide on a single CPU system.

Also, it is keeping people like me from testing the latest kernel because I am stuck on version 3.15.0-0.rc4.git1.1.  If i try to update the kernel, the resultant kernel will not boot.  I can, at least, test other packages.

I tried downgrading systemd to see if that would help, but I can't get that to work using yum because of cyclic dependency issue.

Comment 11 Bill Gianopoulos 2014-05-18 12:28:18 UTC
OK I figured out my dependency issue and downgraded systemd to 212-3 and then re-installed the 3.15.0-0.rc5.git2.9 kernel and that results in a successful boot.  Therefore this issue is definitely a result of the change between systemd 212-3 and 212-4 which, according to the changelog, is:

* Wed May 07 2014 Kay Sievers <kay> - 212-4 - add netns udev workaround

Comment 12 Bill Gianopoulos 2014-05-18 12:54:21 UTC
Just to be excruciatingly clear here.  The purpose of re-installing the kernel was to force a rebuild of initramfs with the downgraded version of systemd.

Comment 13 Bill Gianopoulos 2014-05-24 01:43:24 UTC
OK this has gone on long enough.  A patch that the description defines as a workaround so not a proper fix for anything is preventing single CPU systems from being able to boot.  This "fix" need to be reverted ASAP.

Comment 14 Adam Williamson 2014-05-24 02:27:14 UTC
lennart and kay are travelling ATM, and harald's been off work lately, that's why systemd/udev stuff is taking longer than usual. the rest of us are usually reluctant to touch those bits unless we're really sure what we're doing, but i might try a systemd build with the changes from 3 to 4 reverted later.

Comment 15 Adam Williamson 2014-05-24 02:32:20 UTC
the change between 3 and 4 has a rather different description upstream, btw:

http://cgit.freedesktop.org/systemd/systemd/commit/?id=9ea28c55a2488e6cd4a44ac5786f12b71ad5bc9f

"udev: remove seqnum API and all assumptions about seqnums"

Comment 16 Bill Gianopoulos 2014-05-24 10:15:20 UTC
(In reply to Adam Williamson from comment #14)
> lennart and kay are travelling ATM, and harald's been off work lately,
> that's why systemd/udev stuff is taking longer than usual. the rest of us
> are usually reluctant to touch those bits unless we're really sure what
> we're doing, but i might try a systemd build with the changes from 3 to 4
> reverted later.

Sorry, sometimes I get a bit impatient.  In order to fix a different issue I wanted to do a new clean install and just there has been no way to do that for awhile now.

Comment 17 Adam Williamson 2014-05-27 00:02:42 UTC
OK, so I hit a small road bump reproducing this for testing purposes - it doesn't seem to happen at least for me with non-debug kernels. But it looks like it's reliably reproducible with debug kernels.

Today's (2014-05-26) Xfce nightly reliably reproduces this issue in a single-CPU KVM guest for me: six boot attempts, six fails. I built an Xfce live locally with the same kernel (3.15.0-0.rc6.git1.1.fc21.x86_64) but with a systemd scratch build with the patch from -4 dropped. Tried five boots on the same KVM, got five successes. Seems pretty definitive.

I don't know what that patch is intended to fix, why it was considered sufficiently important to be backported, but I can't imagine that it could be something *worse* than this, so I'm going to go ahead and push out a systemd -5 with the patch reverted. Thanks, Bill, for identifying the offending component.

Comment 18 Adam Williamson 2014-05-27 00:10:45 UTC
as this seems like a very serious issue, I've reported it directly to upstream as https://bugs.freedesktop.org/show_bug.cgi?id=79283 just to be safe (though I'm sure Kay would look after it upstream in any case).

Comment 19 Kay Sievers 2014-05-27 00:40:14 UTC
Without this patch, the installer will hang or not work, because the way
network namespaces are implemented in the kernel, they break udev by "stealing"
expected seqnums from the host's primary namespace.

The base OS recently started to use network namespaces, PrivateNetwork=yes
in unbit files, so this will show up again.

Is there a simple way to reproduce the "single CPU" issue? It sounds pretty
strange.

Are we sure that is hangs for forever, not only for a few minutes and the
continues?

Could you try to boot with plymouth disabled on the kernel command
line?

(Lennart and I are still on vacation this week, without proper internet.)

Comment 20 Adam Williamson 2014-05-27 01:15:11 UTC
"Is there a simple way to reproduce the "single CPU" issue? It sounds pretty
strange."

Very simple. Grab the nightly I linked above. Set up a normal KVM (I'm using virt-manager) with a single CPU. Try and boot it. Profit. Add a CPU, it'll boot fine. Use systemd 212-3 or 212-5, it'll boot fine.

"Are we sure that is hangs for forever, not only for a few minutes and the
continues?"

I didn't leave mine for terribly long, don't know about the other reporters. I can leave one sitting here while I make dinner.

"Could you try to boot with plymouth disabled on the kernel command
line?"

The hang is before plymouth kicks in, I think, but sure, easy enough to try...

...boot without 'rhgb quiet' and with 'rd.plymouth=0 plymouth.enable=0' still hangs. I'll leave this attempt sitting here for a while.

Comment 21 Kay Sievers 2014-05-27 10:59:15 UTC
(In reply to Adam Williamson from comment #20)
> "Is there a simple way to reproduce the "single CPU" issue? It sounds pretty
> strange."
> 
> Very simple. Grab the nightly I linked above.

Care to add an exact link here, I don't see it. We are in China, downloading large files might not work too well, so it would be nice to get the
right one. :)

Comment 22 Harald Hoyer 2014-05-27 11:48:57 UTC
my guess is, that it needs also
http://cgit.freedesktop.org/systemd/systemd/commit/?id=83be2c398589a3d64db5999cfd5527c5219bff46

which fixes the "udevadm settle" issue introduced by 9ea28c55a2488e6cd4a44ac5786f12b71ad5bc9f

Comment 23 Gene Czarcinski 2014-05-27 14:53:17 UTC
hanging forever ... well, a couple of hours is close enough

Comment 24 Adam Williamson 2014-05-27 14:59:38 UTC
kay: sorry, the link was in the other bug report: http://kojipkgs.fedoraproject.org/work/tasks/1167/6891167/Fedora-Live-Xfce-x86_64-rawhide-20140526.iso

Comment 25 Kay Sievers 2014-05-28 06:07:27 UTC
Created attachment 899832 [details]
screenshot

Adding "debug" to the kernel commandline shows the output like in the
attached screenshot. Dracut hangs in a loop. It looks like the issue
Harald pointed out above.

I really have no idea what kind of woodoo is going on, that makes it
behave differently with one or more CPUs.

A new release is coming out today, and should have the fix.

Comment 26 Kay Sievers 2014-05-28 11:00:36 UTC
Submitted to rawhide.

Comment 27 Adam Williamson 2014-05-28 19:47:58 UTC
I tested a live image with systemd 213-3 and kernel 3.15.0-0.rc7.git1.1.fc21.x86_64 (a debug kernel). Booted successfully three times in a row, install seemed fine up until it hit https://bugzilla.redhat.com/show_bug.cgi?id=1101557 , which looks to have been going on for a while. So I'd say this is looking fixed.