Bug 1463745

Summary:

systemctl commands timeout

Product:

[Fedora] Fedora

Reporter:

John Ellson <john.ellson>

Component:

systemd

Assignee:

systemd-maint

Status:

CLOSED DUPLICATE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

rawhide

CC:

johannbg, lnykryn, msekleta, muadda, ssahani, s, systemd-maint, zbyszek

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-07-02 21:23:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
output of "journalctl -b -1"	none
output of: gdb -p 1	none
gdb output of: bt full	none
gdb -p 1 ... bt full	none

Description John Ellson 2017-06-21 15:55:12 UTC

Description of problem:
systemctl commands timeout

This bug has been reported on Ubuntu and Arch Linux, and is now occurring on Fedor27
According to:
https://unix.stackexchange.com/questions/98128/what-are-the-systemd-command-line-options-switched-root-and-deserialize
https://serverfault.com/questions/712928/systemctl-commands-timeout-when-ran-as-root
this bug is related to the use of --switched-root --deserialize 24 options
use when root starts systemd

Problem occurs on both x86_64, and i686 fedora27 architectures
Problem is not currently occurring on fedore26 or earlier.

Version-Release number of selected component (if applicable):
systemd-233-4.fc27

How reproducible:
100%

Steps to Reproduce:
1. normal reboot
2. login as root
3. ps auxwq 1
4. systemctl

Actual results:

[root@bld-fedora27 ~]# ps auxwq 1
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.2 0.5 158188 11852 ? Ss 11:22 0:02 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
[root@bld-fedora27 ~]# systemctl
Failed to list units: Connection timed out
[root@bld-fedora27 ~]#

.... also ....
The Fedora icon on boot screen never completes. If "rhgb quiet" is removed from the kernel line then the bug doesn't happen. See https://bugzilla.redhat.com/show_bug.cgi?id=1452808

.... also ...
ssh -vvv logins from other systems hang for ~60 seconds after "pledge: exec" while waiting for
debug3: receive packet: type 80
debug1: client_input_global_request: rtype hostkeys-00
Other reports suggest that UsePAM=0 avoids this hang. I've not tried, but maybe pam is involved here. Again, removing "rhgb quiet" stops the problem
See https://bugzilla.redhat.com/show_bug.cgi?id=1452804
https://serverfault.com/questions/792486/ssh-connection-takes-forever-to-initiate-stuck-at-pledge-network

Expected results:
None of the above

Additional info:
Related bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1452804
https://bugzilla.redhat.com/show_bug.cgi?id=1452808
https://unix.stackexchange.com/questions/98128/what-are-the-systemd-command-line-options-switched-root-and-deserialize
https://serverfault.com/questions/712928/systemctl-commands-timeout-when-ran-as-root
https://serverfault.com/questions/792486/ssh-connection-takes-forever-to-initiate-stuck-at-pledge-network

Comment 1 Zbigniew Jędrzejewski-Szmek 2017-06-21 20:07:44 UTC

Hm, hanging systemctl usually means that PID1 crashed. Can you show 'journalctl -b' output?

Comment 2 John Ellson 2017-06-26 15:14:16 UTC

Created attachment 1292011 [details]
output of "journalctl -b -1"

output contains this crash:

Jun 26 10:51:00 bld-fedora27 kernel: ------------[ cut here ]------------
Jun 26 10:51:00 bld-fedora27 kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo_util.c:589!
Jun 26 10:51:00 bld-fedora27 kernel: invalid opcode: 0000 [#1] SMP
Jun 26 10:51:00 bld-fedora27 kernel: Modules linked in: sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq
Jun 26 10:51:00 bld-fedora27 kernel: CPU: 1 PID: 339 Comm: plymouthd Not tainted 4.12.0-0.rc6.git3.1.fc27.x86_64 #1
Jun 26 10:51:00 bld-fedora27 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Jun 26 10:51:00 bld-fedora27 kernel: task: ffff91dd77990000 task.stack: ffffb69e00720000
Jun 26 10:51:00 bld-fedora27 kernel: RIP: 0010:ttm_bo_kmap+0x1af/0x250 [ttm]
Jun 26 10:51:00 bld-fedora27 kernel: RSP: 0018:ffffb69e00723ba8 EFLAGS: 00010206
Jun 26 10:51:00 bld-fedora27 kernel: RAX: ffff91dd77987b50 RBX: ffff91dd779dc858 RCX: ffff91dd779dcc08
Jun 26 10:51:00 bld-fedora27 kernel: RDX: 0000000000000300 RSI: 0000000000000000 RDI: ffff91dd77430d90
Jun 26 10:51:00 bld-fedora27 kernel: RBP: ffffb69e00723be0 R08: 0000000000010004 R09: ffff91dd779dc928
Jun 26 10:51:00 bld-fedora27 kernel: R10: 0000000000000008 R11: 0000000000000000 R12: ffffb69e00723c48
Jun 26 10:51:00 bld-fedora27 kernel: R13: 0000000000000000 R14: ffff91dd77430d70 R15: 0000000000000000
Jun 26 10:51:00 bld-fedora27 kernel: FS:  00007fa4e051fd00(0000) GS:ffff91dd7d400000(0000) knlGS:0000000000000000
Jun 26 10:51:00 bld-fedora27 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 26 10:51:00 bld-fedora27 kernel: CR2: 00005616290356f8 CR3: 0000000035b8b000 CR4: 00000000001406e0
Jun 26 10:51:00 bld-fedora27 kernel: Call Trace:
Jun 26 10:51:00 bld-fedora27 kernel:  qxl_bo_kmap+0x42/0x70 [qxl]
Jun 26 10:51:00 bld-fedora27 kernel:  qxl_draw_dirty_fb+0x1f5/0x420 [qxl]
Jun 26 10:51:00 bld-fedora27 kernel:  qxl_framebuffer_surface_dirty+0xa0/0xf0 [qxl]
Jun 26 10:51:00 bld-fedora27 kernel:  drm_mode_dirtyfb_ioctl+0x190/0x1d0 [drm]
Jun 26 10:51:00 bld-fedora27 kernel:  drm_ioctl+0x232/0x500 [drm]
Jun 26 10:51:00 bld-fedora27 kernel:  ? drm_mode_getfb+0x110/0x110 [drm]
Jun 26 10:51:00 bld-fedora27 kernel:  ? __might_sleep+0x4a/0x80
Jun 26 10:51:00 bld-fedora27 kernel:  ? selinux_file_ioctl+0x124/0x1e0
Jun 26 10:51:00 bld-fedora27 kernel:  do_vfs_ioctl+0xa6/0x720
Jun 26 10:51:00 bld-fedora27 kernel:  ? getnstimeofday64+0xe/0x20
Jun 26 10:51:00 bld-fedora27 kernel:  ? security_file_ioctl+0x43/0x60
Jun 26 10:51:00 bld-fedora27 kernel:  SyS_ioctl+0x79/0x90
Jun 26 10:51:00 bld-fedora27 kernel:  do_syscall_64+0x6c/0x1c0
Jun 26 10:51:00 bld-fedora27 kernel:  entry_SYSCALL64_slow_path+0x25/0x25
Jun 26 10:51:00 bld-fedora27 kernel: RIP: 0033:0x7fa4df504eb7
Jun 26 10:51:00 bld-fedora27 kernel: RSP: 002b:00007ffca83efbc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 26 10:51:00 bld-fedora27 kernel: RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fa4df504eb7
Jun 26 10:51:00 bld-fedora27 kernel: RDX: 00007ffca83efc00 RSI: 00000000c01864b1 RDI: 0000000000000009
Jun 26 10:51:00 bld-fedora27 kernel: RBP: 00007ffca83efc00 R08: 00007fa4dd42577c R09: 0000000000000010
Jun 26 10:51:00 bld-fedora27 kernel: R10: 00007fa4dd424884 R11: 0000000000000246 R12: 00000000c01864b1
Jun 26 10:51:00 bld-fedora27 kernel: R13: 0000000000000009 R14: 0000555e394ef330 R15: 00007fa4ddf8278c
Jun 26 10:51:00 bld-fedora27 kernel: Code: d0 48 8b bb 80 00 00 00 48 c1 e6 0c f6 43 62 04 74 44 48 03 7b 70 4c 01 ef e8 8e 21 bd ef 49 89 04 24 44 8b 45 d0 e9 18 ff ff ff <0f> 0b 4b 8d 7c 3e 58 89 4
Jun 26 10:51:00 bld-fedora27 kernel: RIP: ttm_bo_kmap+0x1af/0x250 [ttm] RSP: ffffb69e00723ba8
Jun 26 10:51:00 bld-fedora27 kernel: ---[ end trace 13b661e538049e06 ]---

Comment 3 Zbigniew Jędrzejewski-Szmek 2017-06-26 16:14:49 UTC

That oops doesn't explain why systemd itself would not respond.
Looks like a bug in the QXL graphics subsystem, you might want to file that as a separate bug.

Based on the logs, everything seems to be running OK. Is dbus generally functional? E.g. can you run 'loginctl' or 'udisksctl status'?

Comment 4 John Ellson 2017-06-26 19:11:56 UTC

'loginctl' and 'udisksctl status' both appear to run ok.


https://bugzilla.redhat.com/show_bug.cgi?id=1465148
filed for the QXL issue

Comment 5 Zbigniew Jędrzejewski-Szmek 2017-06-28 22:54:59 UTC

I don't see any substantial info in the bug reports and stackoverflow questions you linked. Seems to be a bunch of unrelated issues.

Returning to your problem: it seems that somehow either the communication between systemctl and systemd fail, or systemd fails to provide a response. It would love to figure out what is going on here...

First question: from the logs, I see you are running with selinux in enforcing mode. If you switch to permissive ('setenforce 0'), does that change anything?

Second question: when you run the systemctl command, and it's waiting for the reply, can you start 'gdb -p 1' in another terminal, and paste the output of 'bt full'? This should help with figuring out if its systemd that's hanging.

Comment 6 John Ellson 2017-07-01 19:42:22 UTC

> If you switch to permissive ('setenforce 0'), does that change anything?

I changed to "SELINUX=disabled" in /etc/syslinux/config 
and rebooted (with "rhgb qyuet" present").

Disabling selinux made no difference.

One new observation is that the first ssh -l root, and systemctl, was successful.    On the second attempt,  a couple of minutes later, the ssh login was really slow,  and the systemctl command hung.

Comment 7 John Ellson 2017-07-01 19:44:17 UTC

Created attachment 1293513 [details]
output of:  gdb -p 1

systemctl was running and not yet timed-out when this "gdb -p 1" output was captured.

Comment 8 John Ellson 2017-07-01 19:48:30 UTC

Created attachment 1293514 [details]
gdb output of:  bt full

Comment 9 John Ellson 2017-07-01 19:52:49 UTC

Created attachment 1293515 [details]
gdb -p 1 ...   bt full

You probably wanted those from a single gdb session.

Comment 10 Zbigniew Jędrzejewski-Szmek 2017-07-02 20:41:17 UTC

*** Bug 1452808 has been marked as a duplicate of this bug. ***

Comment 11 Zbigniew Jędrzejewski-Szmek 2017-07-02 20:50:26 UTC

Thanks for the data.

> Jun 26 10:51:22 bld-fedora27 systemd[1]: plymouth-quit.service: Start operation timed out. Terminating.

The pattern here and in #1452808 seems to be the same:
- systemd starts plymouth-quit.service
- plymouth-quit.service fails because of a timeout
- systemd tries to write to the console
- writev() never returns – systemd hangs

The only difference is that you see a QXL backtrace in the logs, but it seems reasonable to assume that in either case the graphics subsystem is hosed.

We'll need to discuss the solution to this upstream. One option would be to set a timeout on the writev, but that's quite messy.

Comment 12 Zbigniew Jędrzejewski-Szmek 2017-07-02 21:23:11 UTC

After discussing this with Lennart, it seems that isn't much that systemd can do. Once the kernel does an oops, any code which calls into the subsystem will also hang. Plymouth is hanging, and systemd will hang too once it calls writev() on the console, even if we used O_NONBLOCK. So the only solution is to fix the kernel code.

*** This bug has been marked as a duplicate of bug 1465148 ***