1547623 – A few tasks blocked waiting on the rescuer task assigned to the workqueue "close_wq"

Bug 1547623 - A few tasks blocked waiting on the rescuer task assigned to the workqueue "close_wq"

Summary: A few tasks blocked waiting on the rescuer task assigned to the workqueue "cl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	2.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Daniel Bristot de Oliveira
QA Contact:	Jiri Kastner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-21 16:11 UTC by Aaron Tomlin
Modified:	2021-09-09 13:15 UTC (History)
CC List:	7 users (show)
Fixed In Version:	kernel-rt-3.10.0-693.30.1.rt56.617.el6rt
Doc Type:	Bug Fix
Doc Text:	The attempt to join to a multicast group without ensuring that CMA device exists was leading the system to a crash. The solution for this fix was then to verify if the device exists before joining to the multicast group, returning an error in the case of the non-existing device.
Clone Of:
Environment:
Last Closed:	2018-05-29 17:16:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1752	0	None	None	None	2018-05-29 17:16:52 UTC

Comment 3 Daniel Bristot de Oliveira 2018-02-21 17:48:22 UTC

Command to access the crash:

$ crash /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-514.rt56.231.el6rt.x86_64/vmlinux /cores/retrace/tasks/909019666/crash/vmcore

Comment 6 Daniel Bristot de Oliveira 2018-03-01 17:04:30 UTC

Looking at the dmesg I found that there was a BUG related to the rdma
modules:

[7329306.591162] BUG: unable to handle kernel NULL pointer dereference at 0000000000000140             <------------- BUG!
[7329306.591166] IP: [<ffffffffa0584ac6>] rdma_join_multicast+0x106/0x190 [rdma_cm]
[7329306.591167] PGD 1fe52fd067 PUD 1fe5e62067 PMD 0 
[7329306.591168] Oops: 0000 [#1] PREEMPT SMP 
[removed loaded modules line...]
[7329306.591190] CPU: 9 PID: 7694 Comm: RiskEngine Not tainted 3.10.0-514.rt56.231.el6rt.x86_64 #1
[7329306.591191] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/25/2017
[7329306.591191] task: ffff881fe6444d10 ti: ffff881fe6634000 task.ti: ffff881fe6634000
[7329306.591194] RIP: 0010:[<ffffffffa0584ac6>]  [<ffffffffa0584ac6>] rdma_join_multicast+0x106/0x190 [rdma_cm]       <--- at this point, this task should own both mutexes.
[7329306.591194] RSP: 0018:ffff881fe6637d18  EFLAGS: 00010246
[7329306.591194] RAX: 0000000000000000 RBX: ffff881d4e502000 RCX: 0000000000000000
[7329306.591195] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[7329306.591195] RBP: ffff881fe6637d58 R08: ffff881d4e502200 R09: ffffffffa0584a46
[7329306.591195] R10: 0000000000000000 R11: 0000000000000293 R12: ffff881d4e502218
[7329306.591196] R13: ffff881fe30cce40 R14: ffff881fe30cce50 R15: ffff881fe30ccd80
[7329306.591196] FS:  00007f70a1008700(0000) GS:ffff88207fcc0000(0000) knlGS:0000000000000000
[7329306.591197] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7329306.591197] CR2: 0000000000000140 CR3: 0000001a177fe000 CR4: 00000000001407e0
[7329306.591198] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7329306.591199] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[7329306.591199] Stack:
[7329306.591200]  ffff881fe5907180 ffff881fe30ccd80 ffff881fe5907180 ffff881fe6637dd8
[7329306.591201]  ffff881fe5907c00 ffff881fe30ccd80 ffff881fe5907180 ffff881fe30ccdb0
[7329306.591202]  ffff881fe6637dc8 ffffffffa051667f 0000000000000000 0000000000000000
[7329306.591202] Call Trace:
[7329306.591205]  [<ffffffffa051667f>] ucma_process_join+0xff/0x1d0 [rdma_ucm]
[7329306.591207]  [<ffffffffa05167b2>] ucma_join_multicast+0x62/0x90 [rdma_ucm]
[7329306.591208]  [<ffffffffa0514678>] ucma_write+0xf8/0x130 [rdma_ucm]
[7329306.591212]  [<ffffffff811caebd>] ? rw_verify_area+0x5d/0xe0
[7329306.591214]  [<ffffffff811cb008>] vfs_write+0xc8/0x170
[7329306.591215]  [<ffffffff811cbb30>] SyS_write+0x80/0xe0
[7329306.591219]  [<ffffffff8163a9b2>] system_call_fastpath+0x16/0x1b
[7329306.591227] Code: 93 00 02 00 00 48 8d b3 00 02 00 00 4c 89 f7 e8 21 33 d7 e0 4c 89 e7 e8 79 da 0a e1 e8 c4 20 b2 e0 48 8b 13 0f b6 83 b8 01 00 00 <48> 8b 92 40 01 00 00 48 c1 e0 04 8b 54 10 08 f7 c2 00 00 a0 00 
[7329306.591229] RIP  [<ffffffffa0584ac6>] rdma_join_multicast+0x106/0x190 [rdma_cm]
[7329306.591229]  RSP <ffff881fe6637d18>
[7329306.591229] CR2: 0000000000000140
[7329310.561864] ---[ end trace 0000000000000002 ]---
[7329313.582427] bond0: link status up for interface ib0, enabling it in 100 ms
[7329313.682425] bond0: link status definitely up for interface ib0, 0 Mbps full duplex
[8444946.635030] bridge: automatic filtering via arp/ip/ip6tables has been deprecated. Update your scripts to load br_netfilter if you need this.

Then the sysrq - c trigger:

[8445057.149874] SysRq : Trigger a crash
[8445057.149889] BUG: unable to handle kernel NULL pointer dereference at           (null)
[8445057.149894] IP: [<ffffffff813b135b>] sysrq_handle_crash+0x1b/0x30
[8445057.149895] PGD 13fbb94067 PUD 1fe48ab067 PMD 0 
[8445057.149896] Oops: 0002 [#2] PREEMPT SMP 
[8445057.149919] CPU: 3 PID: 19693 Comm: bash Tainted: G      D        ------------   3.10.0-514.rt56.231.el6rt.x86_64 #1
[8445057.149920] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/25/2017
[8445057.149921] task: ffff8801488799b0 ti: ffff880d98c44000 task.ti: ffff880d98c44000
[8445057.149922] RIP: 0010:[<ffffffff813b135b>]  [<ffffffff813b135b>] sysrq_handle_crash+0x1b/0x30
[8445057.149923] RSP: 0018:ffff880d98c47e68  EFLAGS: 00010296
[8445057.149923] RAX: 0000000000000000 RBX: 0000000000000063 RCX: 0000000000000000
[8445057.149923] RDX: 0000000000000001 RSI: 0000000000000009 RDI: 0000000000000063
[8445057.149924] RBP: ffff880d98c47e68 R08: fff7a87ffccdf01a R09: ffffffff81bc56af
[8445057.149924] R10: 0000000000000000 R11: 0000000000020000 R12: ffffffff819dedc0
[8445057.149924] R13: 000000000000000f R14: 0000000000000000 R15: ffff880d98c47f40
[8445057.149925] FS:  00007f5e10e01700(0000) GS:ffff88207fc00000(0000) knlGS:0000000000000000
[8445057.149925] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[8445057.149926] CR2: 0000000000000000 CR3: 0000001fe4fe9000 CR4: 00000000001407e0
[8445057.149927] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[8445057.149927] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[8445057.149927] Stack:
[8445057.149928]  ffff880d98c47e98 ffffffff813b14b9 0000000000000002 fffffffffffffffb
[8445057.149929]  ffff880d98c47f40 0000000000000002 ffff880d98c47eb8 ffffffff813b155d
[8445057.149930]  00007f5e10e09000 ffff881ff1a44c00 ffff880d98c47ed8 ffffffff8123adf3
[8445057.149930] Call Trace:
[8445057.149933]  [<ffffffff813b14b9>] __handle_sysrq+0x129/0x190
[8445057.149934]  [<ffffffff813b155d>] write_sysrq_trigger+0x3d/0x40
[8445057.149937]  [<ffffffff8123adf3>] proc_reg_write+0x43/0x70
[8445057.149940]  [<ffffffff811cb008>] vfs_write+0xc8/0x170
[8445057.149942]  [<ffffffff811cbb30>] SyS_write+0x80/0xe0
[8445057.149945]  [<ffffffff8163a9b2>] system_call_fastpath+0x16/0x1b
[8445057.149954] Code: c7 00 1e 00 00 e8 86 3e db ff c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 e8 42 10 ce ff c7 05 48 26 80 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 
[8445057.149955] RIP  [<ffffffff813b135b>] sysrq_handle_crash+0x1b/0x30
[8445057.149955]  RSP <ffff880d98c47e68>
[8445057.149955] CR2: 0000000000000000

The time between the BUG and the sysrq is:
  8444946 − 7329306 = 1115640 which is 12 days and 21:54:00 hours.

which is more or less the time that the tasks are waiting:

[ 7 03:48:33.369] [UN]  PID: 28131  TASK: ffff880ff57b4d10  CPU: 4   COMMAND: "RiskGateway"
[12 21:55:45.303] [UN]  PID: 32558  TASK: ffff881e57104d10  CPU: 3   COMMAND: "Dispatcher"
[12 21:55:50.558] [UN]  PID: 19686  TASK: ffff88176f478000  CPU: 3   COMMAND: "kworker/u24:0"
[12 21:55:50.559] [UN]  PID: 7671   TASK: ffff881ff5254d10  CPU: 10  COMMAND: "tr.finalize"

So, my first guess here is that the locking problem is a side effect
of the BUG. Mainly because this is the stack of the BUG:

[7329306.591190] CPU: 9 PID: 7694 Comm: RiskEngine Not tainted 3.10.0-514.rt56.231.el6rt.x86_64 #1
[7329306.591191] task: ffff881fe6444d10 ti: ffff881fe6634000 task.ti: ffff881fe6634000
[7329306.591194] RIP: 0010:[<ffffffffa0584ac6>]  [<ffffffffa0584ac6>] rdma_join_multicast+0x106/0x190 [rdma_cm]       <--- at this point, this task should own both mutexes.
....
[7329306.591200]  ffff881fe5907180 ffff881fe30ccd80 ffff881fe5907180 ffff881fe6637dd8
[7329306.591201]  ffff881fe5907c00 ffff881fe30ccd80 ffff881fe5907180 ffff881fe30ccdb0
[7329306.591202]  ffff881fe6637dc8 ffffffffa051667f 0000000000000000 0000000000000000
[7329306.591202] Call Trace:
[7329306.591205]  [<ffffffffa051667f>] ucma_process_join+0xff/0x1d0 [rdma_ucm]
[7329306.591207]  [<ffffffffa05167b2>] ucma_join_multicast+0x62/0x90 [rdma_ucm]
[7329306.591208]  [<ffffffffa0514678>] ucma_write+0xf8/0x130 [rdma_ucm]
[7329306.591212]  [<ffffffff811caebd>] ? rw_verify_area+0x5d/0xe0
[7329306.591214]  [<ffffffff811cb008>] vfs_write+0xc8/0x170
[7329306.591215]  [<ffffffff811cbb30>] SyS_write+0x80/0xe0
[7329306.591219]  [<ffffffff8163a9b2>] system_call_fastpath+0x16/0x1b
[7329306.591227] Code: 93 00 02 00 00 48 8d b3 00 02 00 00 4c 89 f7 e8 21 33 d7 e0 4c 89 e7 e8 79 da 0a e1 e8 c4 20 b2 e0 48 8b 13 0f b6 83 b8 01 00 00 <48> 8b 92 40 01 00 00 48 c1 e0 04 8b 54 10 08 f7 c2 00 00 a0 00 
[7329306.591229] RIP  [<ffffffffa0584ac6>] rdma_join_multicast+0x106/0x190 [rdma_cm]
[7329306.591229]  RSP <ffff881fe6637d18>
[7329306.591229] CR2: 0000000000000140
[7329310.561864] ---[ end trace 0000000000000002 ]---

Which is in the same "ucma_write" function of other UN tasks. As the BUGed task was holding (at least one... the one in the write) mutexes involved in the UN tasks, my first step would be to try to understand why the BUG() took place. The problem is that the bugged task is:

crash> ps 7694
ps: invalid task or pid value: 7694

   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM

not present anymore. So it is hard to debug it :-(.

Aaron, do you know if they can reproduce this problem again with "panic on BUG" set?

-- Daniel

Comment 7 Alexey Vasyukov 2018-03-07 14:00:14 UTC

Hello Daniel,

I'm the person who (most probably) can reproduce the issue. It will take from several hours to few days (procedural work), but it is possible. Which configuration should I apply to gather necessary information? Sorry, but I'm not familiar with "panic on BUG" option.

Alexey

Comment 8 Aaron Tomlin 2018-03-07 15:30:49 UTC

(In reply to Alexey Vasyukov from comment #7)

Alexey,
 
> I'm the person who (most probably) can reproduce the issue. It will take
> from several hours to few days (procedural work), but it is possible. Which
> configuration should I apply to gather necessary information? Sorry, but I'm
> not familiar with "panic on BUG" option.

The kernel parameter, is question is namely: 'panic_on_oops'

You can turn this feature on at boot-time via 'oops=panic' (see
Documentation/kernel-parameters.txt [1] for more information).

I suspect, prior to the 'SysRq c', kernel parameter "panic_on_oops" was not
set. Please note that panic_on_oops is explicitly set (or enabled) to force
a panic, in the event of a 'SysRq c' (see sysrq_handle_crash()).

[1]: https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt

Comment 9 Alexey Vasyukov 2018-03-07 15:51:07 UTC

Got it. I'll return as soon as I get a new dump. As far as I understand, with this option we expect the kernel to die and create the dump on initial BUG immediately, without the need of manual crash using sysrq+c later.

Comment 10 Aaron Tomlin 2018-03-07 16:10:41 UTC

(In reply to Daniel Bristot de Oliveira from comment #6)
> Looking at the dmesg I found that there was a BUG related to the rdma
> modules:
> 
> [7329306.591162] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000140             <------------- BUG!
> [7329306.591166] IP: [<ffffffffa0584ac6>] rdma_join_multicast+0x106/0x190

I see.

Task "RiskEngine" (PID 7694) would have been killed i.e. no_context() ->
oops_end(sig == SIGKILL) -> do_exit(). Therefore the UCMA file mutex
[acquired in ucma_process_join()] would not be released. In other words,
any task waiting to acquire the said mutex would be waiting indefinitely.

According to the Oops in question, the exception RIP suggests
rdma_protocol_roce() was called with a NULL pointer i.e.
rdma_join_multicast() -> rdma_protocol_roce(id->device == NULL, ...)

static inline bool rdma_protocol_roce(const struct ib_device *device, ...)
{
        return device->port_immutable[port_num].core_cap_flags &
               ^^^^^^
                     id->device is NULL

          (RDMA_CORE_CAP_PROT_ROCE | RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP);
}

crash> p &((struct ib_device *)0x0)->port_immutable
$1 = (struct ib_port_immutable **) 0x140

We know an IB device object is allocated via ib_alloc_device()
and freed via ib_dealloc_device().

Comment 11 Aaron Tomlin 2018-03-07 16:12:02 UTC

(In reply to Alexey Vasyukov from comment #9)
> Got it. I'll return as soon as I get a new dump. As far as I understand,
> with this option we expect the kernel to die and create the dump on initial
> BUG immediately, without the need of manual crash using sysrq+c later.

Correct. This is indeed what we'd prefer please. Thanks.

Comment 12 Daniel Bristot de Oliveira 2018-03-08 17:08:35 UTC

Hi Aaron and Alexey!

First of all, thanks for work on the BZ! :-)

I have just one point that might be turning things easier after we find the problem.

We do not have the ELS in the MRG-RT kernel. So, all the fixes are only applied to the latest kernel-rt. Then, it would be nice to try to reproduce this problem in the most recent available kernel. In this way, we can take advantage of two things:

1) we will have a better understanding of the problem in the that we will apply/develop the fix;
2) it might be the case that this problem was already fixed in the latest kernel via a fix in the RHEL kernel.

So, Alexey, is it possible to use the latest available kernel-rt in our system?

Thanks in advance.
-- Daniel

Comment 13 Alexey Vasyukov 2018-03-08 17:55:18 UTC

Hello Daniel,

> So, Alexey, is it possible to use the latest available kernel-rt in our system?

Yes, it's possible. I can update our test stand to any version required. 

But I'm a bit lost with versions in MRG Realtime channels. We are on 3.10.0-514.rt56.231.el6rt.x86_64 now. The package has "Build Date: 2017-07-24 19:29:55 UTC".

Should I update to 3.10.0-693.21.1.rt56.607.el6rt (Build Date: 2018-02-23 22:36:53 UTC)?

Alexey

Comment 14 Beth Uptagrafft 2018-03-08 18:10:35 UTC

(In reply to Alexey Vasyukov from comment #13)
> Hello Daniel,
> 
> > So, Alexey, is it possible to use the latest available kernel-rt in our system?
> 
> Yes, it's possible. I can update our test stand to any version required. 
> 
> But I'm a bit lost with versions in MRG Realtime channels. We are on
> 3.10.0-514.rt56.231.el6rt.x86_64 now. The package has "Build Date:
> 2017-07-24 19:29:55 UTC".
> 
> Should I update to 3.10.0-693.21.1.rt56.607.el6rt (Build Date: 2018-02-23
> 22:36:53 UTC)?
> 
> Alexey


Hi Alexey,
It's late for Daniel..  Yes, you got it right!  I would use 3.10.0-693.21.1.rt56.607 for your testing.
-Beth

Comment 15 Alexey Vasyukov 2018-03-13 18:44:31 UTC

Hello Team.

We are finally back. The kernel was updated on test machines, and we reproduced the issue. New vmcore was uploaded to the support ticket that was the reason for this BZ. Hopefully, you have an access to it.

As far as wee see from vmcore-dmesg, this vmcore is correct one - the panic was caused by kernel oops related to rdma_join_multicast.

If you need any additional information - let us know. We can now reproduce the issue relatively fast, and we are ready to do any reasonable experiments.

Alexey

Comment 22 Daniel Bristot de Oliveira 2018-04-05 11:57:57 UTC

Hi

We found that this problem will be potentially fixed by a patch present on upstream kernel (added last March 13, so a recent fix). To be sure that is the correct fix, could you please try the test kernel I attached to the support case?

I used that interface because the size of the rpm is too large to be attached here.

[ I added a note to the support case ]

Thanks in advance.
-- Daniel

Comment 23 Alexey Vasyukov 2018-04-05 14:03:32 UTC

Hello, Daniel.

Ok, we will retest asap and return with the results.

Best regards,
Alexey

Comment 27 Alexey Vasyukov 2018-04-11 03:50:11 UTC

Hello all.

Preliminary testing showed, that most probably the issue was fixed. We need several more days to run "worst case" tests. I'll return asap with the results.

Alexey

Comment 28 Daniel Bristot de Oliveira 2018-04-11 08:16:03 UTC

Hi Alexey!

Thanks for your feedback! Looking forward to the final results!

-- Daniel

Comment 29 Alexey Vasyukov 2018-04-18 13:18:44 UTC

Hi Daniel.

So, the results:

1. Simple test case looks to be fixed by this kernel.

Just for the sake of completeness of BZ history I'll copy simple test case description from support ticket:

> 1. IB testing procedure
> 
> It' really simple. We have:
>  - 2 servers
>  - 2 IB cards in each server
>  - IB switches in full mesh topology
>  - active-passive bonding on top on 2 IB cards
> 
> We just disable one currently active port (physically unplug the 
> cable or disable the port on switch using 'ibportstate' tool) and
> wait for bonding failover. Usually everything goes as expected, but
> sometimes we see these zombie processes after the tests. We can
> confirm that these zombies are the processes that were using
> Infiniband at the moment of failover. From complete history of these
> tests, it looks like there is a small chance (<10%, but >1%) that IB
> failover goes not so clean, leaving zombies.

The case described above looks to be fixed.


2. We encounter a new issue when running complete set of test cases. IB setup is the same, all machines are updated to kernel-rt-3.10.0-693.21.1.rt56.607.BZ1547623.

Test scenario:
- start the load that uses IB multicast
- reboot Subnet Manager
- try to start any new process that wants to use IB multicast (default mckey test tool is enough to reproduce the issue)


Observed behaviour:
- New process does not start, complaining about multicast not working
- 'service rdma restart' fixes multicast, but with ~50% probability causes kernel panic during service stop


Note: this new issue is definitely not reproducible with 3.10.0-514.rt56.231.el6rt.x86_64 kernel.


The new core 01987193-1524055080-smtest-vmcore.tar.gz is uploaded to support ticket 01987193 - it's the dump that was created during 'service rdma restart' attempt.


Please, let me know, if we should open a new support case and/or BZ for this.


Best regards,
Alexey

Comment 31 Aaron Tomlin 2018-04-23 12:44:04 UTC

(In reply to Alexey Vasyukov from comment #29)

Alexey,

> 1. Simple test case looks to be fixed by this kernel.

Thank you for getting back to us with your findings.

> 2. We encounter a new issue when running complete set of test cases. IB
> setup is the same, all machines are updated to
> kernel-rt-3.10.0-693.21.1.rt56.607.BZ1547623.

This appears to be unrelated to the original reported issue.

Please open a new support case to avoid confusion.

Kind regards,
Aaron

Comment 45 errata-xmlrpc 2018-05-29 17:16:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1752

Note You need to log in before you can comment on or make changes to this bug.