Bug 1905747
| Summary: | kernel-rt-debug: WARNING: possible circular locking dependency detected(&serv->sv_lock -> (softirq_ctrl.lock).lock) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Chunyu Hu <chuhu> |
| Component: | kernel-rt | Assignee: | Wander <wcosta> |
| kernel-rt sub component: | Locking | QA Contact: | Chunyu Hu <chuhu> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | low | ||
| Priority: | unspecified | CC: | bhu, jlelli, kcarcia, lgoncalv, llong, mstowell, wcosta, williams |
| Version: | 8.4 | Keywords: | Triaged |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | 8.4 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | kernel-rt-4.18.0-312.rt7.93.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-11-09 17:28:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1954361, 1975917 | ||
| Bug Blocks: | 1969483 | ||
What happens in CPU 0: * svc_close_list acquires sv_lock (https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/blob/8.4-rt/net/sunrpc/svc_xprt.c#L1064) and calls svc_xprt_do_enqueue via svc_xprt_enqueue * svc_xprt_do_enqueue acquires sp_lock calling spin_lock_bh (https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/blob/8.4-rt/net/sunrpc/svc_xprt.c#L429) * spin_lock_bh eventually calls __local_bh_disable_ip * __local_bh_disable_ip acquires softirq_ctrl.lock (https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/blob/8.4-rt/kernel/softirq.c#L166) * spin_lock_bh then calls rt_spin_lock to acquire the lock passed to it In this scenario, the thread in the CPU 0 acquires locks in the following order: * sv_lock * softirq_ctrl.lock * sp_lock In CPU 1, svc_sock_update_bufs calls spin_lock_bh on sv_lock (https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/blob/8.4-rt/net/sunrpc/svcsock.c#L1205), so it will acquire softirq_ctrl.lock and then sv_lock, resulting in the following locking order: * softirq_ctrl.lock * sv_lock The opposite order of CPU 0, triggering the warning. However softirq_ctrl.lock is a per CPU lock, so in this case, it is a false positive. The remaining question (at least for me) is if the code enforces the correct behavior when the kernel schedules both threads in the same CPU. *** Bug 1928777 has been marked as a duplicate of this bug. *** Waiman, Juri and I was talking about this bug and we wonder if you see any problem in backporting BZ1954361 to 8.4 stock. I can do it if you agree. (In reply to Wander from comment #9) > Waiman, Juri and I was talking about this bug and we wonder if you see any > problem in backporting BZ1954361 to 8.4 stock. I can do it if you agree. I suppose that you want to backport the lockdep part of the 8.5 patchset. Right? I don't have any problem with that. Go the 8.4 BZ, set the z-stream? flag and post a comment on what patches to be backported. -Longman Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: kernel-rt security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4140 |
Description of problem: In CTC1 of 8.4.0, qe is trying to run debug kernel, and enabled the WARNING check through restraintd, fin there was a warning with rt debug. [ 2173.950271] NFSD: no clients to reclaim, skipping NFSv4 grace period (net f00000d8) [ 2205.250044] mount.nfs (111124) used greatest stack depth: 19128 bytes left [ 2282.352127] ====================================================== [ 2282.352129] WARNING: possible circular locking dependency detected [ 2282.352135] 4.18.0-259.rt7.24.el8.x86_64+debug #1 Tainted: G OE --------- - - [ 2282.352136] ------------------------------------------------------ [ 2282.352140] nfsd/111048 is trying to acquire lock: [ 2282.352142] ffff88901dc21620 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip+0x2a5/0x800 [ 2282.352162] but task is already holding lock: [ 2282.352164] ffff888f66de80b0 (&serv->sv_lock){+.+.}-{0:0}, at: svc_close_list+0x21/0x100 [sunrpc] [ 2282.352247] which lock already depends on the new lock. [ 2282.352249] the existing dependency chain (in reverse order) is: [ 2282.352250] -> #1 (&serv->sv_lock){+.+.}-{0:0}: [ 2282.352262] lock_acquire+0x185/0xa30 [ 2282.352268] rt_spin_lock+0x2b/0x120 [ 2282.352333] svc_sock_update_bufs+0x32/0xc0 [sunrpc] [ 2282.352394] svc_start_kthreads+0x205/0x530 [sunrpc] [ 2282.352461] nfs_callback_up+0x496/0x1030 [nfsv4] [ 2282.352520] nfs4_init_client+0x347/0x570 [nfsv4] [ 2282.352577] nfs4_set_client+0x258/0x400 [nfsv4] [ 2282.352633] nfs4_create_server+0x5c1/0xd60 [nfsv4] [ 2282.352687] nfs4_try_get_tree+0xd3/0x200 [nfsv4] [ 2282.352693] vfs_get_tree+0x89/0x3c0 [ 2282.352699] do_mount+0xe33/0x15b0 [ 2282.352704] ksys_mount+0xb6/0xd0 [ 2282.352709] __x64_sys_mount+0xba/0x150 [ 2282.352715] do_syscall_64+0x13c/0x670 [ 2282.352720] entry_SYSCALL_64_after_hwframe+0x6a/0xdf [ 2282.352721] -> #0 ((softirq_ctrl.lock).lock){+.+.}-{2:2}: [ 2282.352732] check_prevs_add+0x3d0/0x1690 [ 2282.352736] __lock_acquire+0x2254/0x2d70 [ 2282.352739] lock_acquire+0x185/0xa30 [ 2282.352743] rt_spin_lock+0x2b/0x120 [ 2282.352747] __local_bh_disable_ip+0x2a5/0x800 [ 2282.352816] svc_xprt_do_enqueue+0x147/0xbe0 [sunrpc] [ 2282.352881] svc_close_list+0xbd/0x100 [sunrpc] [ 2282.352944] svc_close_net+0x6a/0x4b0 [sunrpc] [ 2282.353005] svc_shutdown_net+0x14/0x80 [sunrpc] [ 2282.353045] nfsd_destroy+0x1bb/0x300 [nfsd] [ 2282.353083] nfsd+0x343/0x400 [nfsd] [ 2282.353089] kthread+0x3ce/0x4b0 [ 2282.353093] ret_from_fork+0x3a/0x50 [ 2282.353095] other info that might help us debug this: [ 2282.353097] Possible unsafe locking scenario: [ 2282.353098] CPU0 CPU1 [ 2282.353099] ---- ---- [ 2282.353100] lock(&serv->sv_lock); [ 2282.353104] lock((softirq_ctrl.lock).lock); [ 2282.353107] lock(&serv->sv_lock); [ 2282.353110] lock((softirq_ctrl.lock).lock); [ 2282.353113] *** DEADLOCK *** [ 2282.353117] 3 locks held by nfsd/111048: [ 2282.353118] #0: ffffffffc1e7d758 (nfsd_mutex){+.+.}-{3:3}, at: nfsd+0x2da/0x400 [nfsd] [ 2282.353162] #1: ffff888f66de80b0 (&serv->sv_lock){+.+.}-{0:0}, at: svc_close_list+0x21/0x100 [sunrpc] [ 2282.353231] #2: ffffffffb34d22c0 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x7d/0x120 [ 2282.353241] stack backtrace: [ 2282.353250] CPU: 56 PID: 111048 Comm: nfsd Kdump: loaded Tainted: G OE --------- - - 4.18.0-259.rt7.24.el8.x86_64+debug #1 [ 2282.353253] Hardware name: Lenovo ThinkSystem SR630 -[7X02CTO1WW]-/-[7X02CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019 [ 2282.353255] Call Trace: [ 2282.353265] dump_stack+0x8e/0xd0 [ 2282.353275] check_noncircular+0x30f/0x3c0 [ 2282.353284] ? print_circular_bug+0x360/0x360 [ 2282.353303] check_prevs_add+0x3d0/0x1690 [ 2282.353316] ? check_irq_usage+0xa60/0xa60 [ 2282.353326] ? sched_clock+0x5/0x10 [ 2282.353331] ? sched_clock_cpu+0x238/0x340 [ 2282.353339] __lock_acquire+0x2254/0x2d70 [ 2282.353356] lock_acquire+0x185/0xa30 [ 2282.353362] ? __local_bh_disable_ip+0x2a5/0x800 [ 2282.353370] ? rcu_read_unlock+0x50/0x50 [ 2282.353376] ? lock_acquire+0x185/0xa30 [ 2282.353381] ? rt_spin_lock+0x7d/0x120 [ 2282.353390] rt_spin_lock+0x2b/0x120 [ 2282.353395] ? __local_bh_disable_ip+0x2a5/0x800 [ 2282.353401] __local_bh_disable_ip+0x2a5/0x800 [ 2282.353466] ? svc_xprt_do_enqueue+0xff/0xbe0 [sunrpc] [ 2282.353532] svc_xprt_do_enqueue+0x147/0xbe0 [sunrpc] [ 2282.353599] svc_close_list+0xbd/0x100 [sunrpc] [ 2282.353664] svc_close_net+0x6a/0x4b0 [sunrpc] [ 2282.353735] svc_shutdown_net+0x14/0x80 [sunrpc] [ 2282.353776] nfsd_destroy+0x1bb/0x300 [nfsd] [ 2282.353817] nfsd+0x343/0x400 [nfsd] [ 2282.353860] ? nfsd_destroy+0x300/0x300 [nfsd] [ 2282.353866] kthread+0x3ce/0x4b0 [ 2282.353873] ? kthread_delayed_work_timer_fn+0x240/0x240 [ 2282.353880] ret_from_fork+0x3a/0x50 [ 2282.385049] nfsd: last server has exited, flushing export cache Version-Release number of selected component (if applicable): 4.18.0-259.rt7.24.el8.x86_64+debug How reproducible: once now Steps to Reproduce: 1. run /kernel/errata/Regression with kernel-rt-debug , possible it's the Test-bz371551 subtest. https://beaker.engineering.redhat.com/recipes/9161108#task118848142,task118848143 2. check console log 3. Actual results: There's warning on lockdep for rpc and softirq. Expected results: Fix the lockdep warning. Additional info: Didn't hit this with stock kernel.