Bug 1387661 - Kernel traceback appears, zombies replicate like crazy, and SSH becomes inaccessible after a while with 3.10.0-327.36.1.rt56.237.el7.x86_64
Summary: Kernel traceback appears, zombies replicate like crazy, and SSH becomes inacc...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-rt
Version: 7.2
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Steven Rostedt
QA Contact: Jiri Kastner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-21 13:19 UTC by Radek Bíba
Modified: 2017-01-11 06:54 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-11 06:54:29 UTC
Target Upstream Version:


Attachments (Terms of Use)
dmesg (9.98 KB, text/plain)
2016-10-21 13:19 UTC, Radek Bíba
no flags Details

Description Radek Bíba 2016-10-21 13:19:06 UTC
Created attachment 1212850 [details]
dmesg

Description of problem:
A system running kernel-3.10.0-327.36.1.rt56.237.el7.x86_64 ran into problems after a few minutes of uptime. A running SSH session was still usable, but no other connections were accepted; 'Write failed: Broken pipe' occurred instead. In the running SSH session, I ran top and ps and found out that there was an increasing number of zombie sshd and crond processes. Eventually, I noticed a trackeback on the system console and also in dmesg. Attaching the relevant part of the dmesg output.

Reboot didn't help; the system ran into the same problems in just a few minutes.

I've now booted a non-RT kernel (3.10.0-327.36.2.el7.x86_64) and everything has been OK so far.

Comment 2 Radek Bíba 2016-10-21 15:15:18 UTC
I might add that this happened on a test system with nearly every package in the distro installed. I've installed another system with an almost minimum package set, and it's been running smoothly so far. I'm going to keep it up and running during the weekend.

Comment 3 Jiri Kastner 2016-10-25 18:08:36 UTC
(In reply to Radek Bíba from comment #2)
> I might add that this happened on a test system with nearly every package in
> the distro installed. I've installed another system with an almost minimum
> package set, and it's been running smoothly so far. I'm going to keep it up
> and running during the weekend.

is that another machine also kvm guest?

Comment 4 Radek Bíba 2016-10-26 06:25:33 UTC
(In reply to Jiri Kastner from comment #3)
> is that another machine also kvm guest?

Yes, https://beaker.engineering.redhat.com/view/bkr-hv03-guest26.dsal.lab.eng.bos.redhat.com#details.

It's doing just fine, BTW.

Comment 5 Daniel Bristot de Oliveira 2016-10-26 16:58:29 UTC
Hi! Radek,

I have some questions about this problem:

Is it reproducible?
If it is, could you please try to reproduce the problem and then crash the system to generate a vmcore?
As far as I understood, it took place in a virtual machine, is that correct?

-- Daniel

Comment 6 Radek Bíba 2016-10-27 07:54:37 UTC
Hi Daniel,

Yes, this problem has been reliably reproducible so far. The RT kernel "gets sick" after a while every time.

I booted the kernel, crashed it (echo c > /proc/sysrq-trigger; I hope that's a good way to do it in this case), and got the crash data in /var/crash/127.0.0.1-2016-10-27-06\:53\:14/. The vmcore file is more than a hundred MB, and using "makedumpfile -c -d 31" didn't help much, so I created a zip archive from the full vmcore (deflated 49%). You can get the zip archive as well as the accompanying vmcore-dmesg.txt file from here:

http://nest.test.redhat.com/mnt/qa/scratch/rbiba/kernel-rt-bz1387661/

(The host is only accessible inside the VPN.)

The answer to the third question is yes. The system is a virtual machine. I currently don't have access to the hypervisor, though.

HTH.

Comment 7 Daniel Bristot de Oliveira 2016-10-31 14:17:30 UTC
vmcore analysis:

vmcore at:
ssh YOURUSER.rdu2.redhat.com
...
$ retrace-server-interact 483612848 crash

Analysis:

UN(responsible/D state) tasks, ordered by the last time they run:

crash> foreach UN ps -m
[0 00:00:00.002] [UN]  PID: 1      TASK: ffff8801166b0000  CPU: 0   COMMAND: "systemd"
[0 00:00:37.578] [UN]  PID: 847    TASK: ffff8800346de9c0  CPU: 0   COMMAND: "systemd-logind"
[0 00:02:42.728] [UN]  PID: 585    TASK: ffff8800345c5240  CPU: 0   COMMAND: "systemd-journal"
[0 00:02:42.784] [UN]  PID: 148    TASK: ffff880034570000  CPU: 0   COMMAND: "kworker/0:2"

The stack of the oldest UN task is:

PID: 148    TASK: ffff880034570000  CPU: 0   COMMAND: "kworker/0:2"
 #0 [ffff88010cbfbc28] __schedule at ffffffff81627d69
 #1 [ffff88010cbfbc90] schedule at ffffffff81628340
 #2 [ffff88010cbfbca8] __rt_mutex_slowlock at ffffffff81628ca5
 #3 [ffff88010cbfbd00] rt_mutex_slowlock at ffffffff81628f1c
 #4 [ffff88010cbfbdc8] rt_mutex_lock at ffffffff816290b1
 #5 [ffff88010cbfbdd8] _mutex_lock at ffffffff81629a2e
 #6 [ffff88010cbfbde8] cgroup_free_fn at ffffffff810e86b8
 #7 [ffff88010cbfbe20] process_one_work at ffffffff81091c36
 #8 [ffff88010cbfbe68] worker_thread at ffffffff81092992
 #9 [ffff88010cbfbec8] kthread at ffffffff8109a0b1
#10 [ffff88010cbfbf50] ret_from_fork at ffffffff816321d8

The task is waiting in the mutex at 0xffffffff81973a00, which is onwned by:
crash> px ((struct rt_mutex *)0xffffffff81973a00)->owner
$2 = (struct task_struct *) 0xffff8801166b0001

The second oldest UN task is:

PID: 585    TASK: ffff8800345c5240  CPU: 0   COMMAND: "systemd-journal"
 #0 [ffff880033d4bc90] __schedule at ffffffff81627d69
 #1 [ffff880033d4bcf8] schedule at ffffffff81628340
 #2 [ffff880033d4bd10] __rt_mutex_slowlock at ffffffff81628ca5
 #3 [ffff880033d4bd68] rt_mutex_slowlock at ffffffff81628f1c
 #4 [ffff880033d4be30] rt_mutex_lock at ffffffff816290b1
 #5 [ffff880033d4be40] _mutex_lock at ffffffff81629a2e
 #6 [ffff880033d4be50] proc_cgroup_show at ffffffff810eb786
 #7 [ffff880033d4be98] seq_read at ffffffff811f073a
 #8 [ffff880033d4bf08] vfs_read at ffffffff811cb2cc
 #9 [ffff880033d4bf38] sys_read at ffffffff811cbe1f
#10 [ffff880033d4bf80] system_call_fastpath at ffffffff81632289
    RIP: 00007fead7b564a0  RSP: 00007fffa5f791c8  RFLAGS: 00010207
    RAX: 0000000000000000  RBX: ffffffff81632289  RCX: 0000000000000003
    RDX: 0000000000000400  RSI: 00007fead8f5c000  RDI: 0000000000000011
    RBP: 000000000000000a   R8: 00000000ffffffff   R9: 0000000000000000
    R10: 0000000000000022  R11: 0000000000000246  R12: 00007feadada9410
    R13: 0000000000000000  R14: 0000000000000000  R15: 00007feadada9410
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

which is also wating in the same mutex:

crash> px ((struct rt_mutex *)0xffffffff81973a00)->owner
$1 = (struct task_struct *) 0xffff8801166b0001

And the third...

PID: 847    TASK: ffff8800346de9c0  CPU: 0   COMMAND: "systemd-logind"
 #0 [ffff8800d97b7c90] __schedule at ffffffff81627d69
 #1 [ffff8800d97b7cf8] schedule at ffffffff81628340
 #2 [ffff8800d97b7d10] __rt_mutex_slowlock at ffffffff81628ca5
 #3 [ffff8800d97b7d68] rt_mutex_slowlock at ffffffff81628f1c
 #4 [ffff8800d97b7e30] rt_mutex_lock at ffffffff816290b1
 #5 [ffff8800d97b7e40] _mutex_lock at ffffffff81629a2e
 #6 [ffff8800d97b7e50] proc_cgroup_show at ffffffff810eb786
 #7 [ffff8800d97b7e98] seq_read at ffffffff811f073a
 #8 [ffff8800d97b7f08] vfs_read at ffffffff811cb2cc
 #9 [ffff8800d97b7f38] sys_read at ffffffff811cbe1f
#10 [ffff8800d97b7f80] system_call_fastpath at ffffffff81632289
    RIP: 00007f14fe0564a0  RSP: 00007ffe20a57998  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffffffff81632289  RCX: ffffffffffffffff
    RDX: 0000000000000400  RSI: 00007f14fefb7000  RDI: 0000000000000014
    RBP: 000000000000000a   R8: 00000000ffffffff   R9: 0000000000000000
    R10: 0000000000000022  R11: 0000000000000246  R12: 00007f14ffd94dc0
    R13: 0000000000000000  R14: 0000000000000000  R15: 00007f14ffd94dc0
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

is also waiting in the same lock!

crash> px ((struct rt_mutex *)0xffffffff81973a00)->owner
$3 = (struct task_struct *) 0xffff8801166b0001

The logind is probably the task holding the login.

The owner of the mutex is (removing the last 1, that is the HAS_WAITER flag).

crash> ps -m ffff8801166b0000
[0 00:00:00.002] [UN]  PID: 1      TASK: ffff8801166b0000  CPU: 0   COMMAND: "systemd"

which has the following stack:

crash> bt ffff8801166b0000
PID: 1      TASK: ffff8801166b0000  CPU: 0   COMMAND: "systemd"
 #0 [ffff8801166bbb68] __schedule at ffffffff81627d69
 #1 [ffff8801166bbbd0] schedule at ffffffff81628340
 #2 [ffff8801166bbbe8] schedule_timeout at ffffffff81626609
 #3 [ffff8801166bbc98] wait_for_completion at ffffffff81627244
 #4 [ffff8801166bbce8] wait_rcu_gp at ffffffff81096674
 #5 [ffff8801166bbd40] synchronize_rcu at ffffffff81114859
 #6 [ffff8801166bbd50] mem_cgroup_start_move at ffffffff811bb2ac
 #7 [ffff8801166bbd60] mem_cgroup_reparent_charges at ffffffff811be930
 #8 [ffff8801166bbdd8] mem_cgroup_css_offline at ffffffff811bedef
 #9 [ffff8801166bbe00] cgroup_destroy_locked at ffffffff810e94b5
#10 [ffff8801166bbe48] cgroup_rmdir at ffffffff810e9755
#11 [ffff8801166bbe60] vfs_rmdir at ffffffff811d74d8
#12 [ffff8801166bbe88] do_rmdir at ffffffff811dc525
#13 [ffff8801166bbf70] sys_rmdir at ffffffff811dd6d6
#14 [ffff8801166bbf80] system_call_fastpath at ffffffff81632289
    RIP: 00007f473e54cc37  RSP: 00007ffeb07803a8  RFLAGS: 00000202
    RAX: 0000000000000054  RBX: ffffffff81632289  RCX: ffffffffffffffff
    RDX: 00007f473e81e7b8  RSI: 0000000000000000  RDI: 00007f4740ff31e0
    RBP: 00007f473fc9d710   R8: 0000000000000001   R9: 00007f4741115918
    R10: 0000000000000100  R11: 0000000000000202  R12: ffffffff811dd6d6
    R13: ffff8801166bbf78  R14: 0000000000000001  R15: 00007f4740ff46e0
    ORIG_RAX: 0000000000000054  CS: 0033  SS: 002b

Systemd is doing a rmdir to destroy a cgroup. To destroy structs that contain RCU protected data, it is waiting an RCU grace period. But the grace period is either not happening, or being delayed.

By checking the runqueue we can see if there is any RCU or timer softirq thread being delayed, which could justify the delay of the RCU grace period:

crash> runq
CPU 0 RUNQUEUE: ffff88011fc18800
  CURRENT: PID: 7030   TASK: ffff8800d4970000  COMMAND: "bash"
  RT PRIO_ARRAY: ffff88011fc188e0
     [no tasks queued]
  CFS RB_ROOT: ffff88011fc188a0
     [120] PID: 7026   TASK: ffff8800d4975240  COMMAND: "sshd"

CPU 1 RUNQUEUE: ffff88011fd18800
  CURRENT: PID: 0      TASK: ffff880116772340  COMMAND: "swapper/1"
  RT PRIO_ARRAY: ffff88011fd188e0
     [no tasks queued]
  CFS RB_ROOT: ffff88011fd188a0
     [no tasks queued]

As there is no -rt tasks, there is no starvation, hence we do not have an RCU stall by starvation.

Here is the ps -m of RUnnable tasks:
crash> foreach RU ps -m
[0 00:00:00.000] [RU]  PID: 7030   TASK: ffff8800d4970000  CPU: 0   COMMAND: "bash"
[0 00:00:00.000] [RU]  PID: 7026   TASK: ffff8800d4975240  CPU: 0   COMMAND: "sshd"
[0 00:07:16.254] [RU]  PID: 0      TASK: ffffffff81936440  CPU: 0   COMMAND: "swapper/0"
[0 00:07:16.254] [RU]  PID: 0      TASK: ffff880116772340  CPU: 1   COMMAND: "swapper/1"

They just started to work, so the system was probably idle before it. These tasks are running because Radek run echo c > /proc/sysrq-trigger...

This problem looks similar to the BZ1376184.

Comment 8 Daniel Bristot de Oliveira 2016-10-31 14:51:26 UTC
Radek,

Could you please add the following parameters to the kernel command line:

"trace_buf_size=10M trace_event=rcu"

And then boot the reproducer and crash it like you did? The trace data is important to understand the problem.

Moreover, could you please try to boot the with the RHEL-RT-7.3 beta kernel to see if the same problem happens?

Best Regards,
Daniel

Comment 9 Radek Bíba 2016-10-31 15:46:38 UTC
Sure, no problem. I tried booting 3.10.0-514.rt56.420.el7.x86_64 (is that what you meant by the beta kernel?) and it was still fine after more than 15 minutes. That looks promising, doesn't it? Then I booted 3.10.0-514.rt56.420.el7 again, with the additional parameters, and it got sick soon, as usual. Here are the vmcore and vmcore-dmesg.txt files: http://nest.test.redhat.com/mnt/qa/scratch/rbiba/kernel-rt-bz138766.2/ . I'll now boot the beta kernel again and keep an eye on it.

Thanks for looking into this, Daniel!

Comment 10 Radek Bíba 2016-11-01 07:51:28 UTC
I let the system run over night, with an SSH session and a connection to the console, and here's what I see this morning:

1 - The SSH session is still usable, but no new session can be created:

$ ssh root@HOSTNAME
ssh_exchange_identification: Connection closed by remote host


2 - The console is clear; no traceback appeared. But I can't log in:

Red Hat Enterprise Linux Server 7.2 (Maipo)
Kernel 3.10.0-514.rt56.420.el7.x86_64 on an x86_64

x86-64-7s-v1 login: root
Password: 
login: timed out after 60 seconds


3 - `top' in the SSH session tells me:

top - 03:35:35 up 15:29,  3 users,  load average: 4.00, 4.01, 4.05
Tasks: 3326 total,   1 running, 795 sleeping,   0 stopped, 2530 zombie

Here are vmcore files from the system in this state in case you need them: http://nest.test.redhat.com/mnt/qa/scratch/rbiba/kernel-rt-bz138766.3/ . This kernel was booted without the extra parameters, though.

I need to make the system available again because it's a semi-production system, so I'm booting a non-RT kernel, but please let me know if you'd like me to boot the RT kernel again and keep it up for a little longer and check something.

Comment 11 Steven Rostedt 2016-12-07 23:37:27 UTC
Radek,

can you reproduce the issue again but this time add to the kernel command line:

"trace_buf_size=10M trace_event=*:*"

This will enable all events.

And try to crash it as soon as you notice that it becomes unresponsive.

Thanks!

Comment 12 Radek Bíba 2016-12-08 09:36:34 UTC
The system is now running the latest released RT kernel (released just the other day) and I can't reproduce the issue. It's been running smoothly for almost an hour.

# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-514.2.2.rt56.424.el7.x86_64 root=UUID=d1e30e68-dc2c-45db-871c-4c7819f41332 ro crashkernel=auto console=ttyS0,115200 trace_buf_size=10M trace_event=*:* LANG=en_US.UTF-8

Were there any changes in that kernel that might have resolved the issue?

Comment 13 Clark Williams 2016-12-08 15:28:20 UTC
(In reply to Radek Bíba from comment #12)
> The system is now running the latest released RT kernel (released just the
> other day) and I can't reproduce the issue. It's been running smoothly for
> almost an hour.
> 
> # cat /proc/cmdline 
> BOOT_IMAGE=/vmlinuz-3.10.0-514.2.2.rt56.424.el7.x86_64
> root=UUID=d1e30e68-dc2c-45db-871c-4c7819f41332 ro crashkernel=auto
> console=ttyS0,115200 trace_buf_size=10M trace_event=*:* LANG=en_US.UTF-8
> 
> Were there any changes in that kernel that might have resolved the issue?

Hard to pin down the changes, since the failing kernel is a 7.2.z kernel and the one you're running now is a 7.3.z kernel. Git shows over 18k commits difference between them.

Comment 14 Beth Uptagrafft 2017-01-10 17:56:49 UTC
In talking with Clark and the RT team, we are unsure how to move forward without a way to reproduce and further debug this issue.  Radek, are you okay with closing this BZ?

Comment 15 Radek Bíba 2017-01-11 06:54:29 UTC
Yeah, let's close this bug.


Note You need to log in before you can comment on or make changes to this bug.