Bug 2008541
| Summary: | gfs2: schedule while atomic | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Alexander Aring <aahringo> |
| Component: | kernel | Assignee: | Andreas Gruenbacher <agruenba> |
| kernel sub component: | GFS-GFS2 | QA Contact: | cluster-qe <cluster-qe> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | adas, agruenba, gfs2-maint |
| Version: | 9.0 | Keywords: | Triaged |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | kernel-5.14.0-59.el9 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-17 15:40:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
https://listman.redhat.com/archives/cluster-devel/2021-September/msg00082.html possible solution for it? Alex, I'm having difficulties following your problem description and what the actual call stack is; I don't see thaw_glock there at all. In general, the glock code is pretty careful not do drop the final reference to a glock while holding rcu_read_lock or a spin lock; instead, whenever it needs to drop a reference that might be the final one in such a context, it delegates that to glock_work_func (see glock_work_func, gfs2_glock_queue_put, __gfs2_glock_queue_work). Do you have any more data? Hmm, I see now that thaw_glock calls gfs2_glock_put when it hits a glock that doesn't need thawing. So when gfs2_control_func calls gfs2_glock_thaw, that gfs2_glock_put can indeed lead to the bug you describe; it's only the stack trace that's been confusing me. A quick fix is to call gfs2_glock_queue_put instead of gfs2_glock_put in thaw_glock. Taking glock references unnecessarily during glock_hash_walk has always been ugly though, so maybe we should get rid of that instead. clearing needinfo, I think it's not necessary anymore. Regression tests passed with kernel-5.14.0-54.mr242_220204_1900.el9.x86_64 Regression tests passed with kernel-5.14.0-70.2.1.el9_0.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: kernel), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:3907 |
Description of problem: In some cases there is a schedule while atomic when iterating over the glock hash which helds rcu_read_lock() and there is a "dlm_unlock()" call. The problem is that "thaw_glock()" will call "sdp->sd_lockstruct.ls_ops->lm_put_lock(gl);" which ends in dlm case in callback gdlm_put_lock() and this finally calls in some cases "dlm_unlock()" but "dlm_unlock()" can't be called from atomic context because semaphores, mutexes, etc. How reproducible: It's hard to reproduce but happens on unmount (because thaw_glock() is called then) With the right kernel settings you will get: [ 993.426039] ============================= [ 993.426765] WARNING: suspicious RCU usage [ 993.427522] 5.14.0-rc2+ #265 Tainted: G W [ 993.428492] ----------------------------- [ 993.429237] include/linux/rcupdate.h:328 Illegal context switch in RCU read-side critical section! [ 993.430860] other info that might help us debug this: [ 993.432304] rcu_scheduler_active = 2, debug_locks = 1 [ 993.433493] 3 locks held by kworker/u32:2/194: [ 993.434319] #0: ffff888109c23148 ((wq_completion)gfs2_control){+.+.}-{0:0}, at: process_one_work+0x452/0xad0 [ 993.436135] #1: ffff888109507e10 ((work_completion)(&(&sdp->sd_control_work)->work)){+.+.}-{0:0}, at: process_one_work+0x452/0xad0 [ 993.438081] #2: ffffffff85ee05c0 (rcu_read_lock){....}-{1:2}, at: rhashtable_walk_start_check+0x0/0x520 [ 993.439665] stack backtrace: [ 993.440402] CPU: 13 PID: 194 Comm: kworker/u32:2 Tainted: G W 5.14.0-rc2+ #265 [ 993.441786] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.14.0-1.module+el8.6.0+12648+6ede71a5 04/01/2014 [ 993.443304] Workqueue: gfs2_control gfs2_control_func [ 993.444147] Call Trace: [ 993.444565] dump_stack_lvl+0x56/0x6f [ 993.445186] ___might_sleep+0x191/0x1e0 [ 993.445838] down_read+0x7b/0x460 [ 993.446400] ? down_write_killable+0x2b0/0x2b0 [ 993.447141] ? find_held_lock+0xb3/0xd0 [ 993.447794] ? do_raw_spin_unlock+0xa2/0x130 [ 993.448521] dlm_unlock+0x9e/0x1a0 [ 993.449102] ? dlm_lock+0x260/0x260 [ 993.449695] ? pvclock_clocksource_read+0xdc/0x180 [ 993.450495] ? kvm_clock_get_cycles+0x14/0x20 [ 993.451210] ? ktime_get_with_offset+0xc6/0x170 [ 993.451971] gdlm_put_lock+0x29e/0x2d0 [ 993.452599] ? gfs2_cancel_delete_work+0x40/0x40 [ 993.453361] glock_hash_walk+0x16c/0x180 [ 993.454014] ? gfs2_glock_seq_stop+0x30/0x30 [ 993.454754] process_one_work+0x55e/0xad0 [ 993.455443] ? pwq_dec_nr_in_flight+0x110/0x110 [ 993.456219] worker_thread+0x65/0x5e0 [ 993.456839] ? process_one_work+0xad0/0xad0 [ 993.457524] kthread+0x1ed/0x220 [ 993.458067] ? set_kthread_struct+0x80/0x80 [ 993.458764] ret_from_fork+0x22/0x30 [ 993.459426] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1352 [ 993.460816] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 194, name: kworker/u32:2 [ 993.462172] 3 locks held by kworker/u32:2/194: [ 993.462916] #0: ffff888109c23148 ((wq_completion)gfs2_control){+.+.}-{0:0}, at: process_one_work+0x452/0xad0 [ 993.464542] #1: ffff888109507e10 ((work_completion)(&(&sdp->sd_control_work)->work)){+.+.}-{0:0}, at: process_one_work+0x452/0xad0 [ 993.466467] #2: ffffffff85ee05c0 (rcu_read_lock){....}-{1:2}, at: rhashtable_walk_start_check+0x0/0x520 [ 993.468016] CPU: 13 PID: 194 Comm: kworker/u32:2 Tainted: G W 5.14.0-rc2+ #265 [ 993.469378] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.14.0-1.module+el8.6.0+12648+6ede71a5 04/01/2014