Bug 1463745 - systemctl commands timeout
systemctl commands timeout
Status: CLOSED DUPLICATE of bug 1465148
Product: Fedora
Classification: Fedora
Component: systemd (Show other bugs)
rawhide
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: systemd-maint
Fedora Extras Quality Assurance
:
: 1452808 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-21 11:55 EDT by John Ellson
Modified: 2017-07-02 17:23 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-02 17:23:11 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
output of "journalctl -b -1" (154.06 KB, text/plain)
2017-06-26 11:14 EDT, John Ellson
no flags Details
output of: gdb -p 1 (7.02 KB, text/plain)
2017-07-01 15:44 EDT, John Ellson
no flags Details
gdb output of: bt full (2.63 KB, text/plain)
2017-07-01 15:48 EDT, John Ellson
no flags Details
gdb -p 1 ... bt full (10.49 KB, text/plain)
2017-07-01 15:52 EDT, John Ellson
no flags Details

  None (edit)
Description John Ellson 2017-06-21 11:55:12 EDT
Description of problem:
systemctl commands timeout

This bug has been reported on Ubuntu and Arch Linux,  and is now occurring on Fedor27
According to:
   https://unix.stackexchange.com/questions/98128/what-are-the-systemd-command-line-options-switched-root-and-deserialize
   https://serverfault.com/questions/712928/systemctl-commands-timeout-when-ran-as-root
this bug is related to the use of --switched-root --deserialize 24  options
use when root starts systemd

Problem occurs on both x86_64, and i686 fedora27 architectures
Problem is not currently occurring on fedore26 or earlier.


Version-Release number of selected component (if applicable):
systemd-233-4.fc27

How reproducible:
100%

Steps to Reproduce:
1. normal reboot
2. login as root
3. ps auxwq 1
4. systemctl

Actual results:

[root@bld-fedora27 ~]# ps auxwq 1
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.2  0.5 158188 11852 ?        Ss   11:22   0:02 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
[root@bld-fedora27 ~]# systemctl
Failed to list units: Connection timed out
[root@bld-fedora27 ~]#

.... also ....
The Fedora icon on boot screen never completes.   If "rhgb quiet" is removed from the kernel line then the bug doesn't happen.   See https://bugzilla.redhat.com/show_bug.cgi?id=1452808


.... also ...
ssh -vvv  logins from other systems hang for ~60 seconds after "pledge: exec" while waiting for 
    debug3: receive packet: type 80
    debug1: client_input_global_request: rtype hostkeys-00@openssh.com 
Other reports suggest that UsePAM=0 avoids this hang.  I've not tried, but maybe pam is involved here.  Again, removing "rhgb quiet" stops the problem
See https://bugzilla.redhat.com/show_bug.cgi?id=1452804
    https://serverfault.com/questions/792486/ssh-connection-takes-forever-to-initiate-stuck-at-pledge-network


Expected results:
None of the above

Additional info:
Related bugs:
   https://bugzilla.redhat.com/show_bug.cgi?id=1452804
   https://bugzilla.redhat.com/show_bug.cgi?id=1452808
   https://unix.stackexchange.com/questions/98128/what-are-the-systemd-command-line-options-switched-root-and-deserialize
   https://serverfault.com/questions/712928/systemctl-commands-timeout-when-ran-as-root
   https://serverfault.com/questions/792486/ssh-connection-takes-forever-to-initiate-stuck-at-pledge-network
Comment 1 Zbigniew Jędrzejewski-Szmek 2017-06-21 16:07:44 EDT
Hm, hanging systemctl usually means that PID1 crashed. Can you show 'journalctl -b' output?
Comment 2 John Ellson 2017-06-26 11:14 EDT
Created attachment 1292011 [details]
output of "journalctl -b -1"

output contains this crash:

Jun 26 10:51:00 bld-fedora27 kernel: ------------[ cut here ]------------
Jun 26 10:51:00 bld-fedora27 kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo_util.c:589!
Jun 26 10:51:00 bld-fedora27 kernel: invalid opcode: 0000 [#1] SMP
Jun 26 10:51:00 bld-fedora27 kernel: Modules linked in: sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq
Jun 26 10:51:00 bld-fedora27 kernel: CPU: 1 PID: 339 Comm: plymouthd Not tainted 4.12.0-0.rc6.git3.1.fc27.x86_64 #1
Jun 26 10:51:00 bld-fedora27 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Jun 26 10:51:00 bld-fedora27 kernel: task: ffff91dd77990000 task.stack: ffffb69e00720000
Jun 26 10:51:00 bld-fedora27 kernel: RIP: 0010:ttm_bo_kmap+0x1af/0x250 [ttm]
Jun 26 10:51:00 bld-fedora27 kernel: RSP: 0018:ffffb69e00723ba8 EFLAGS: 00010206
Jun 26 10:51:00 bld-fedora27 kernel: RAX: ffff91dd77987b50 RBX: ffff91dd779dc858 RCX: ffff91dd779dcc08
Jun 26 10:51:00 bld-fedora27 kernel: RDX: 0000000000000300 RSI: 0000000000000000 RDI: ffff91dd77430d90
Jun 26 10:51:00 bld-fedora27 kernel: RBP: ffffb69e00723be0 R08: 0000000000010004 R09: ffff91dd779dc928
Jun 26 10:51:00 bld-fedora27 kernel: R10: 0000000000000008 R11: 0000000000000000 R12: ffffb69e00723c48
Jun 26 10:51:00 bld-fedora27 kernel: R13: 0000000000000000 R14: ffff91dd77430d70 R15: 0000000000000000
Jun 26 10:51:00 bld-fedora27 kernel: FS:  00007fa4e051fd00(0000) GS:ffff91dd7d400000(0000) knlGS:0000000000000000
Jun 26 10:51:00 bld-fedora27 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 26 10:51:00 bld-fedora27 kernel: CR2: 00005616290356f8 CR3: 0000000035b8b000 CR4: 00000000001406e0
Jun 26 10:51:00 bld-fedora27 kernel: Call Trace:
Jun 26 10:51:00 bld-fedora27 kernel:  qxl_bo_kmap+0x42/0x70 [qxl]
Jun 26 10:51:00 bld-fedora27 kernel:  qxl_draw_dirty_fb+0x1f5/0x420 [qxl]
Jun 26 10:51:00 bld-fedora27 kernel:  qxl_framebuffer_surface_dirty+0xa0/0xf0 [qxl]
Jun 26 10:51:00 bld-fedora27 kernel:  drm_mode_dirtyfb_ioctl+0x190/0x1d0 [drm]
Jun 26 10:51:00 bld-fedora27 kernel:  drm_ioctl+0x232/0x500 [drm]
Jun 26 10:51:00 bld-fedora27 kernel:  ? drm_mode_getfb+0x110/0x110 [drm]
Jun 26 10:51:00 bld-fedora27 kernel:  ? __might_sleep+0x4a/0x80
Jun 26 10:51:00 bld-fedora27 kernel:  ? selinux_file_ioctl+0x124/0x1e0
Jun 26 10:51:00 bld-fedora27 kernel:  do_vfs_ioctl+0xa6/0x720
Jun 26 10:51:00 bld-fedora27 kernel:  ? getnstimeofday64+0xe/0x20
Jun 26 10:51:00 bld-fedora27 kernel:  ? security_file_ioctl+0x43/0x60
Jun 26 10:51:00 bld-fedora27 kernel:  SyS_ioctl+0x79/0x90
Jun 26 10:51:00 bld-fedora27 kernel:  do_syscall_64+0x6c/0x1c0
Jun 26 10:51:00 bld-fedora27 kernel:  entry_SYSCALL64_slow_path+0x25/0x25
Jun 26 10:51:00 bld-fedora27 kernel: RIP: 0033:0x7fa4df504eb7
Jun 26 10:51:00 bld-fedora27 kernel: RSP: 002b:00007ffca83efbc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 26 10:51:00 bld-fedora27 kernel: RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fa4df504eb7
Jun 26 10:51:00 bld-fedora27 kernel: RDX: 00007ffca83efc00 RSI: 00000000c01864b1 RDI: 0000000000000009
Jun 26 10:51:00 bld-fedora27 kernel: RBP: 00007ffca83efc00 R08: 00007fa4dd42577c R09: 0000000000000010
Jun 26 10:51:00 bld-fedora27 kernel: R10: 00007fa4dd424884 R11: 0000000000000246 R12: 00000000c01864b1
Jun 26 10:51:00 bld-fedora27 kernel: R13: 0000000000000009 R14: 0000555e394ef330 R15: 00007fa4ddf8278c
Jun 26 10:51:00 bld-fedora27 kernel: Code: d0 48 8b bb 80 00 00 00 48 c1 e6 0c f6 43 62 04 74 44 48 03 7b 70 4c 01 ef e8 8e 21 bd ef 49 89 04 24 44 8b 45 d0 e9 18 ff ff ff <0f> 0b 4b 8d 7c 3e 58 89 4
Jun 26 10:51:00 bld-fedora27 kernel: RIP: ttm_bo_kmap+0x1af/0x250 [ttm] RSP: ffffb69e00723ba8
Jun 26 10:51:00 bld-fedora27 kernel: ---[ end trace 13b661e538049e06 ]---
Comment 3 Zbigniew Jędrzejewski-Szmek 2017-06-26 12:14:49 EDT
That oops doesn't explain why systemd itself would not respond.
Looks like a bug in the QXL graphics subsystem, you might want to file that as a separate bug.

Based on the logs, everything seems to be running OK. Is dbus generally functional? E.g. can you run 'loginctl' or 'udisksctl status'?
Comment 4 John Ellson 2017-06-26 15:11:56 EDT
'loginctl' and 'udisksctl status' both appear to run ok.


https://bugzilla.redhat.com/show_bug.cgi?id=1465148
filed for the QXL issue
Comment 5 Zbigniew Jędrzejewski-Szmek 2017-06-28 18:54:59 EDT
I don't see any substantial info in the bug reports and stackoverflow questions you linked. Seems to be a bunch of unrelated issues.

Returning to your problem: it seems that somehow either the communication between systemctl and systemd fail, or systemd fails to provide a response. It would love to figure out what is going on here...

First question: from the logs, I see you are running with selinux in enforcing mode. If you switch to permissive ('setenforce 0'), does that change anything?

Second question: when you run the systemctl command, and it's waiting for the reply, can you start 'gdb -p 1' in another terminal, and paste the output of 'bt full'? This should help with figuring out if its systemd that's hanging.
Comment 6 John Ellson 2017-07-01 15:42:22 EDT
> If you switch to permissive ('setenforce 0'), does that change anything?

I changed to "SELINUX=disabled" in /etc/syslinux/config 
and rebooted (with "rhgb qyuet" present").

Disabling selinux made no difference.

One new observation is that the first ssh -l root, and systemctl, was successful.    On the second attempt,  a couple of minutes later, the ssh login was really slow,  and the systemctl command hung.
Comment 7 John Ellson 2017-07-01 15:44 EDT
Created attachment 1293513 [details]
output of:  gdb -p 1

systemctl was running and not yet timed-out when this "gdb -p 1" output was captured.
Comment 8 John Ellson 2017-07-01 15:48 EDT
Created attachment 1293514 [details]
gdb output of:  bt full
Comment 9 John Ellson 2017-07-01 15:52 EDT
Created attachment 1293515 [details]
gdb -p 1 ...   bt full

You probably wanted those from a single gdb session.
Comment 10 Zbigniew Jędrzejewski-Szmek 2017-07-02 16:41:17 EDT
*** Bug 1452808 has been marked as a duplicate of this bug. ***
Comment 11 Zbigniew Jędrzejewski-Szmek 2017-07-02 16:50:26 EDT
Thanks for the data.

> Jun 26 10:51:22 bld-fedora27 systemd[1]: plymouth-quit.service: Start operation timed out. Terminating.

The pattern here and in #1452808 seems to be the same:
- systemd starts plymouth-quit.service
- plymouth-quit.service fails because of a timeout
- systemd tries to write to the console
- writev() never returns – systemd hangs

The only difference is that you see a QXL backtrace in the logs, but it seems reasonable to assume that in either case the graphics subsystem is hosed.

We'll need to discuss the solution to this upstream. One option would be to set a timeout on the writev, but that's quite messy.
Comment 12 Zbigniew Jędrzejewski-Szmek 2017-07-02 17:23:11 EDT
After discussing this with Lennart, it seems that isn't much that systemd can do. Once the kernel does an oops, any code which calls into the subsystem will also hang. Plymouth is hanging, and systemd will hang too once it calls writev() on the console, even if we used O_NONBLOCK. So the only solution is to fix the kernel code.

*** This bug has been marked as a duplicate of bug 1465148 ***

Note You need to log in before you can comment on or make changes to this bug.