Bug 1110968

Summary: dbus does not run on i686 Rawhide with 3.16 kernel
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rawhideCC: awilliam, gansalmon, itamar, johannbg, jonathan, kernel-maint, lkundrak, lnykryn, lpoetter, luto, madhu.chinakonda, mchehab, msekleta, pbrobinson, robatino, s, systemd-maint, vpavlin, walters, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-23 14:17:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1043119    
Attachments:
Description Flags
Likely fix none

Description Adam Williamson 2014-06-19 00:05:48 UTC
Testing Fedora-Live-Workstation-i686-rawhide-20140618.iso in a VM. (URL: https://kojipkgs.fedoraproject.org/work/tasks/3758/7053758/Fedora-Live-Workstation-i686-rawhide-20140618.iso ). It consistently fails to boot (to graphical or multi-user). Seems to be because systemd-logind.service fails to start.

If I redirect journal output to the console, I see this when logind is starting up:

systemd[1]: Failed to register match for Disconnected message: Connection timed out
systemd-logind[758]: Failed to add match for NameOwnerChanged: Connection timed out
systemd-logind[758]: Failed to fully start up daemon: Connection timed out

I don't know what it is it's failing to "connect" to, though. Can't find any indication. I don't see any obviously related messages earlier in boot, either.

Comment 1 Adam Williamson 2014-06-19 00:06:24 UTC
Nominating as an Alpha blocker per "All release-blocking images must boot in their supported configurations".

Comment 2 Lennart Poettering 2014-06-20 10:13:56 UTC
This looks like a problem with dbus tbh, which appears not to reply?

Does "busctl" work correctly and instantly if you run it in this case?

Comment 3 Adam Williamson 2014-06-20 19:09:54 UTC
I haven't managed to get to a point where I can interact with the system at the time of the problem. I can't (somehow) boot, enable debug-shell.service, and restart because this is a live image. I can't boot to rescue.target / 'single' because live images have no root password and the boot process can't cope with that. I tried booting to init=/bin/sh , manually symlinking debug-shell.service into basic.target.wants, and then running 'exec /usr/lib/systemd/systemd' , but for some reason this caused systemd to reboot the system. That was the point at which I gave up and just started redirecting journal output to the console instead.

I guess I could hack up a test live image with debug-shell.service enabled, that might be viable...

Comment 4 Adam Williamson 2014-06-20 20:51:52 UTC
OK, so couple of things:

1) this is i686 specific
2) 'busctl' in a debug-shell fails when i hit the bug

going for lunch. more details later.

Comment 5 Adam Williamson 2014-06-20 22:51:13 UTC
OK, so yeah, I built i686 and x86_64 images from today's Rawhide with debug-shell.service enabled. x86_64 image boots fine to a desktop, i686 image fails in this way.

It does seem like something odd is going on with dbus, going by the messages on the console. It seems to keep starting dbus - it always reports success, and no indication that it's being stopped, but then why does it *keep* starting it?

From the debug console, 'systemctl status dbus.service' shows it as failed, having been killed with signal SEGV.

Running 'systemctl restart dbus.service' from the debug shell hangs, at least beyond my patience - I've had it running for like 10 minutes now and nothing. Looks like you can ctrl-C out of it.

Running the command from /usr/lib/systemd/system/dbus.service:

dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation

from the debug shell gives:

Failed to start message bus: No socket received.

that's all I've got for now.

Comment 6 Adam Williamson 2014-06-20 22:52:08 UTC
oh, and just to confirm - 'ps aux | grep dbus' shows no running dbus daemon.

Comment 7 Adam Williamson 2014-06-21 04:25:29 UTC
well apparently no-one's running Rawhide on i686, because this appears to affect an installed system too. Interestingly, though, an installed minimal system can crawl on its hands and knees to a tty and you can log in - boot pauses for a long time, all sorts of services fail to start, and there's a long delay on login, but you *can* log in.

Running 'busctl' at that point takes about 30 seconds to return "Failed to list names: Connection timed out".

ps aux | grep dbus shows a <defunct> dbus-daemon process (pid 537).

Comment 8 Adam Williamson 2014-06-21 07:13:16 UTC
So I did a couple of tests to try and narrow this down to dbus for sure.

A major new dbus version landed recently - we went from 1.6.18 to 1.8.4 on 2014-06-12 ( http://koji.fedoraproject.org/koji/buildinfo?buildID=537883 ).

A live image from 2014-06-11 boots mostly normally, a live image from 2014-06-13 displays what looks like this issue. So that tends to indicate dbus.

However, it doesn't seem so straightforward testing with an installed system. If I install F20 and then upgrade to Rawhide but exclude dbus and dbus-libs from the transaction, the upgraded system still seems to show the bug. The same applies if I use systemd 213 instead of 214. So it doesn't seem to be *just* dbus 1.8.4 or systemd 214. I wasn't able to hit on a precise update that triggers the bug in the 'installed system' case, yet.

Comment 9 Lubomir Rintel 2014-06-21 08:58:37 UTC
I'm pretty sure kernel is to blame here.

The crash occurs when signal handler for RT signal 33 (used internaly by NPTL/pthreads) returns to address zero (not sure why).

This only happens with 3.16 kernel, works fine with 3.15.

Comment 10 Lubomir Rintel 2014-06-21 09:31:26 UTC
Seems to be related to v3.16's arch/x86 VDSO changes. Works fine when booting with vdso=0.

Comment 11 Josh Boyer 2014-06-21 13:24:58 UTC
Andy, some more potential fallout from the 3.16 vdso changes.

Comment 12 Josh Boyer 2014-06-21 13:35:11 UTC
Lubomir, do you have the actual backtrace from d-bus?

Comment 13 Lubomir Rintel 2014-06-21 13:54:28 UTC
(In reply to Josh Boyer from comment #12)
> Lubomir, do you have the actual backtrace from d-bus?

Well, the stack is essentially empty:

(gdb) run --nofork --nopidfile --system
Starting program: /usr/bin/dbus-daemon --nofork --nopidfile --system
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

^^^ I had to remove libthread_db, since gdb detects it does not work with 3.16 and refuses to proceed with debugging.

[New LWP 883]

Program received signal SIG33, Real-time event 33.
[Switching to LWP 883]
0xb7fd9afc in ?? ()
(gdb) c
Continuing.
[tcsetpgrp failed in terminal_inferior: Inappropriate ioctl for device]

Program received signal SIG33, Real-time event 33.
0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
(gdb) info registers 
eax            0x0	0
ecx            0x0	0
edx            0x0	0
ebx            0x0	0
esp            0x0	0x0
ebp            0x0	0x0
esi            0x36f	879
edi            0xfffffffa	-6
eip            0x0	0x0
eflags         0x202	[ IF ]
cs             0x3	3
ss             0x3	3
ds             0x0	0
es             0x21	33
fs             0xaacc	43724
gs             0x0	0
(gdb) 

I'm currently running this in a very miminal environment to bisect the bad commit, thus no debugging symbols. Only proc, sysfs and selinuxfs are mounted and the selinux policy is loaded: proc is needed for capability drop to work and libselinux is what uses a pthread that hits the issue.

I'll try to provide a more meaningful trace later.

Comment 14 Lubomir Rintel 2014-06-21 14:02:17 UTC
http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=6f121e548f83674ab4920a4e60afb58d4f61b829

^^^ git bisect points to this as the first broken commit

Comment 15 Andy Lutomirski 2014-06-21 14:26:18 UTC
Created attachment 911021 [details]
Likely fix

This is a seriously embarrassing typo.  Try this.

I have no idea how this survived any testing at all.  Apparently rt signals just aren't that common.

Comment 16 Lubomir Rintel 2014-06-21 15:35:15 UTC
(In reply to Andy Lutomirski from comment #15)
> Created attachment 911021 [details]
> Likely fix
> 
> This is a seriously embarrassing typo.  Try this.
> 
> I have no idea how this survived any testing at all.  Apparently rt signals
> just aren't that common.

This fixes the issue for me.

Thank you!

Comment 17 Adam Williamson 2014-06-21 16:06:17 UTC
thanks for sorting this out, folks! indeed the first 3.16 kernel build for Rawhide happened in the suspect timeframe (2014-06-11), so that all adds up nicely.

Comment 18 Andy Lutomirski 2014-06-21 16:22:07 UTC
I think this is the first straightforward problem I caused with my vdso and vsyscall fiddling over the last couple years.  Everything else was some subtle corner case involving userspace doing strange things.  It's refreshing :)

Comment 19 Peter Robinson 2014-06-22 12:59:23 UTC
I can confirm the latest 3.16 kernel causes this issue and that the kernel-3.15.0-1.fc21 boots fine

Comment 20 Josh Boyer 2014-06-23 14:17:43 UTC
I've added Andy's patch to today's rawhide build.  Should be fixed with the 3.16-rc2 kernel.  Thanks all!

Comment 21 Christopher Meng 2014-06-24 03:33:35 UTC
Thanks, it fixed my broken laptop.