Red Hat Bugzilla – Bug 1110968
dbus does not run on i686 Rawhide with 3.16 kernel
Last modified: 2014-06-23 23:33:35 EDT
Testing Fedora-Live-Workstation-i686-rawhide-20140618.iso in a VM. (URL: https://kojipkgs.fedoraproject.org/work/tasks/3758/7053758/Fedora-Live-Workstation-i686-rawhide-20140618.iso ). It consistently fails to boot (to graphical or multi-user). Seems to be because systemd-logind.service fails to start.
If I redirect journal output to the console, I see this when logind is starting up:
systemd: Failed to register match for Disconnected message: Connection timed out
systemd-logind: Failed to add match for NameOwnerChanged: Connection timed out
systemd-logind: Failed to fully start up daemon: Connection timed out
I don't know what it is it's failing to "connect" to, though. Can't find any indication. I don't see any obviously related messages earlier in boot, either.
Nominating as an Alpha blocker per "All release-blocking images must boot in their supported configurations".
This looks like a problem with dbus tbh, which appears not to reply?
Does "busctl" work correctly and instantly if you run it in this case?
I haven't managed to get to a point where I can interact with the system at the time of the problem. I can't (somehow) boot, enable debug-shell.service, and restart because this is a live image. I can't boot to rescue.target / 'single' because live images have no root password and the boot process can't cope with that. I tried booting to init=/bin/sh , manually symlinking debug-shell.service into basic.target.wants, and then running 'exec /usr/lib/systemd/systemd' , but for some reason this caused systemd to reboot the system. That was the point at which I gave up and just started redirecting journal output to the console instead.
I guess I could hack up a test live image with debug-shell.service enabled, that might be viable...
OK, so couple of things:
1) this is i686 specific
2) 'busctl' in a debug-shell fails when i hit the bug
going for lunch. more details later.
OK, so yeah, I built i686 and x86_64 images from today's Rawhide with debug-shell.service enabled. x86_64 image boots fine to a desktop, i686 image fails in this way.
It does seem like something odd is going on with dbus, going by the messages on the console. It seems to keep starting dbus - it always reports success, and no indication that it's being stopped, but then why does it *keep* starting it?
From the debug console, 'systemctl status dbus.service' shows it as failed, having been killed with signal SEGV.
Running 'systemctl restart dbus.service' from the debug shell hangs, at least beyond my patience - I've had it running for like 10 minutes now and nothing. Looks like you can ctrl-C out of it.
Running the command from /usr/lib/systemd/system/dbus.service:
dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
from the debug shell gives:
Failed to start message bus: No socket received.
that's all I've got for now.
oh, and just to confirm - 'ps aux | grep dbus' shows no running dbus daemon.
well apparently no-one's running Rawhide on i686, because this appears to affect an installed system too. Interestingly, though, an installed minimal system can crawl on its hands and knees to a tty and you can log in - boot pauses for a long time, all sorts of services fail to start, and there's a long delay on login, but you *can* log in.
Running 'busctl' at that point takes about 30 seconds to return "Failed to list names: Connection timed out".
ps aux | grep dbus shows a <defunct> dbus-daemon process (pid 537).
So I did a couple of tests to try and narrow this down to dbus for sure.
A major new dbus version landed recently - we went from 1.6.18 to 1.8.4 on 2014-06-12 ( http://koji.fedoraproject.org/koji/buildinfo?buildID=537883 ).
A live image from 2014-06-11 boots mostly normally, a live image from 2014-06-13 displays what looks like this issue. So that tends to indicate dbus.
However, it doesn't seem so straightforward testing with an installed system. If I install F20 and then upgrade to Rawhide but exclude dbus and dbus-libs from the transaction, the upgraded system still seems to show the bug. The same applies if I use systemd 213 instead of 214. So it doesn't seem to be *just* dbus 1.8.4 or systemd 214. I wasn't able to hit on a precise update that triggers the bug in the 'installed system' case, yet.
I'm pretty sure kernel is to blame here.
The crash occurs when signal handler for RT signal 33 (used internaly by NPTL/pthreads) returns to address zero (not sure why).
This only happens with 3.16 kernel, works fine with 3.15.
Seems to be related to v3.16's arch/x86 VDSO changes. Works fine when booting with vdso=0.
Andy, some more potential fallout from the 3.16 vdso changes.
Lubomir, do you have the actual backtrace from d-bus?
(In reply to Josh Boyer from comment #12)
> Lubomir, do you have the actual backtrace from d-bus?
Well, the stack is essentially empty:
(gdb) run --nofork --nopidfile --system
Starting program: /usr/bin/dbus-daemon --nofork --nopidfile --system
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
^^^ I had to remove libthread_db, since gdb detects it does not work with 3.16 and refuses to proceed with debugging.
[New LWP 883]
Program received signal SIG33, Real-time event 33.
[Switching to LWP 883]
0xb7fd9afc in ?? ()
[tcsetpgrp failed in terminal_inferior: Inappropriate ioctl for device]
Program received signal SIG33, Real-time event 33.
0x00000000 in ?? ()
#0 0x00000000 in ?? ()
(gdb) info registers
eax 0x0 0
ecx 0x0 0
edx 0x0 0
ebx 0x0 0
esp 0x0 0x0
ebp 0x0 0x0
esi 0x36f 879
edi 0xfffffffa -6
eip 0x0 0x0
eflags 0x202 [ IF ]
cs 0x3 3
ss 0x3 3
ds 0x0 0
es 0x21 33
fs 0xaacc 43724
gs 0x0 0
I'm currently running this in a very miminal environment to bisect the bad commit, thus no debugging symbols. Only proc, sysfs and selinuxfs are mounted and the selinux policy is loaded: proc is needed for capability drop to work and libselinux is what uses a pthread that hits the issue.
I'll try to provide a more meaningful trace later.
^^^ git bisect points to this as the first broken commit
Created attachment 911021 [details]
This is a seriously embarrassing typo. Try this.
I have no idea how this survived any testing at all. Apparently rt signals just aren't that common.
(In reply to Andy Lutomirski from comment #15)
> Created attachment 911021 [details]
> Likely fix
> This is a seriously embarrassing typo. Try this.
> I have no idea how this survived any testing at all. Apparently rt signals
> just aren't that common.
This fixes the issue for me.
thanks for sorting this out, folks! indeed the first 3.16 kernel build for Rawhide happened in the suspect timeframe (2014-06-11), so that all adds up nicely.
I think this is the first straightforward problem I caused with my vdso and vsyscall fiddling over the last couple years. Everything else was some subtle corner case involving userspace doing strange things. It's refreshing :)
I can confirm the latest 3.16 kernel causes this issue and that the kernel-3.15.0-1.fc21 boots fine
I've added Andy's patch to today's rawhide build. Should be fixed with the 3.16-rc2 kernel. Thanks all!
Thanks, it fixed my broken laptop.