RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1905749 - kernel-rt-debug: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:968
Summary: kernel-rt-debug: BUG: sleeping function called from invalid context at kernel...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: kernel-rt
Version: 8.6
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: rc
: 8.4
Assignee: Wander
QA Contact: Chunyu Hu
URL:
Whiteboard:
: 2052397 (view as bug list)
Depends On: 1954361 1975917
Blocks: 2047971
TreeView+ depends on / blocked
 
Reported: 2020-12-09 00:24 UTC by Chunyu Hu
Modified: 2022-06-28 14:42 UTC (History)
12 users (show)

Fixed In Version: kernel-4.18.0-369.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2047971 (view as bug list)
Environment:
Last Closed: 2022-05-10 14:41:26 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/rhel/src/kernel rhel-8 merge_requests 2091 0 None None None 2022-01-28 14:56:47 UTC
Red Hat Product Errata RHSA-2022:1975 0 None None None 2022-05-10 14:42:45 UTC

Description Chunyu Hu 2020-12-09 00:24:59 UTC
Description of problem:

QE ran CTC1 full testing, with kernel-rt debug, and since 8.4, enabled the WARNING check of restraintd. Found there's below info:

[  284.527619] BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:968
[  284.527626] in_atomic(): 1, irqs_disabled(): 0, pid: 6705, name: mandb
[  284.527631] 4 locks held by mandb/6705:
[  284.527634]  #0: ffff88853f4f2eb0 (&f->f_pos_lock){+.+.}-{0:0}, at: __fdget_pos+0xaf/0xe0
[  284.527662]  #1: ffff8885eb814048 (&sb->s_type->i_mutex_key#13){++++}-{0:0}, at: xfs_ilock+0x17d/0x590 [xfs]
[  284.527838]  #2: ffffffffa24d22c0 (rcu_read_lock){....}-{1:2}, at: blk_add_trace_bio+0x0/0x2f0
[  284.527859]  #3: ffffffffa22177a0 (running_trace_lock){+.+.}-{2:2}, at: __blk_add_trace+0x9ea/0xdf0
[  284.527877] Preemption disabled at:
[  284.527886] [<ffffffff9f936743>] get_lock_stats+0x13/0x120
[  284.527897] CPU: 53 PID: 6705 Comm: mandb Kdump: loaded Not tainted 4.18.0-259.rt7.24.el8.x86_64+debug #1
[  284.527901] Hardware name: Lenovo ThinkSystem SR630 -[7X02CTO1WW]-/-[7X02CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[  284.527904] Call Trace:
[  284.527918]  dump_stack+0x8e/0xd0
[  284.527934]  ? get_lock_stats+0x13/0x120
[  284.527944]  ___might_sleep.cold.58+0x54/0x77
[  284.527964]  rt_spin_lock+0x3e/0x120
[  284.527969]  ? __blk_add_trace+0x9ea/0xdf0
[  284.527980]  __blk_add_trace+0x9ea/0xdf0
[  284.528025]  ? rcu_read_lock_sched_held+0xe0/0xe0
[  284.528048]  blk_add_trace_bio+0x25d/0x2f0
[  284.528074]  ? recalibrate_cpu_khz+0x10/0x10
[  284.528094]  ? blk_add_trace_bio_frontmerge+0x20/0x20
[  284.528107]  generic_make_request_checks+0x1471/0x1cc0
[  284.528143]  ? blk_update_request+0x1610/0x1610
[  284.528172]  ? iomap_readpages_actor+0x4f3/0x900
[  284.528208]  generic_make_request+0x8e/0xa40
[  284.528255]  ? direct_make_request+0x2b0/0x2b0
[  284.528260]  ? iomap_readpage_actor+0xd40/0xd40
[  284.528267]  ? iomap_readpages+0x20b/0x950
[  284.528277]  ? bad_range+0x237/0x410
[  284.528288]  ? trace_event_raw_event_iomap_apply+0x430/0x430
[  284.528329]  ? submit_bio+0xcb/0x440
[  284.528338]  submit_bio+0xcb/0x440
[  284.528358]  ? generic_make_request+0xa40/0xa40
[  284.528387]  ? rcu_read_lock_bh_held+0x100/0x100
[  284.528412]  ? iomap_readpage_actor+0xd40/0xd40
[  284.528422]  iomap_readpages+0x22f/0x950
[  284.528442]  ? iomap_writepage+0xa0/0xa0
[  284.528505]  read_pages+0xf8/0x600
[  284.528535]  ? __alloc_pages_nodemask+0x80d/0xa20
[  284.528554]  ? read_cache_pages+0x6d0/0x6d0
[  284.528560]  ? __alloc_pages_slowpath+0x2a10/0x2a10
[  284.528621]  __do_page_cache_readahead+0x311/0x3c0
[  284.528663]  ? read_pages+0x600/0x600
[  284.528712]  ondemand_readahead+0x443/0xc00
[  284.528717]  ? page_cache_sync_readahead+0x18a/0x3f0
[  284.528757]  generic_file_buffered_read+0xd65/0x1ee0
[  284.528850]  ? read_cache_page_gfp+0xa0/0xa0
[  284.529003]  ? xfs_ilock+0x17d/0x590 [xfs]
[  284.529009]  ? get_lock_stats+0x18/0x120
[  284.529018]  ? put_lock_stats.isra.19+0xb/0xa0
[  284.529045]  ? lock_contended+0xa70/0xa70
[  284.529069]  ? __down_read_trylock+0x53/0x80
[  284.529222]  ? xfs_file_buffered_aio_read+0x146/0x4f0 [xfs]
[  284.529381]  xfs_file_buffered_aio_read+0x151/0x4f0 [xfs]
[  284.529534]  xfs_file_read_iter+0x2bf/0x740 [xfs]
[  284.529545]  ? rt_mutex_slowlock_locked+0x262/0x790
[  284.529572]  new_sync_read+0x39c/0x560
[  284.529582]  ? mark_held_locks+0xb7/0x120
[  284.529594]  ? do_iter_readv_writev+0x6e0/0x6e0
[  284.529660]  ? fsnotify_first_mark+0x150/0x150
[  284.529717]  vfs_read+0x147/0x350
[  284.529745]  ksys_read+0xb8/0x170
[  284.529759]  ? kernel_write+0x130/0x130
[  284.529809]  do_syscall_64+0x13c/0x670
[  284.529830]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf
[  284.529837] RIP: 0033:0x7fbade044045
[  284.529844] Code: fe ff ff 50 48 8d 3d fa c9 09 00 e8 05 fd 01 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 15 45 2d 00 8b 00 85 c0 75 0f 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89
[  284.529848] RSP: 002b:00007fffab1d75f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  284.529856] RAX: ffffffffffffffda RBX: 0000000040000000 RCX: 00007fbade044045
[  284.529860] RDX: 0000000000002000 RSI: 0000564fe832ca10 RDI: 0000000000000005
[  284.529863] RBP: 0000564fe832ca10 R08: 0000564fe81a88f0 R09: 0000000000000003
[  284.529867] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000002000
[  284.529871] R13: 0000564fe82eb060 R14: 00007fffab1d7634 R15: 0000000000001000
[  285.543383] BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:968
[  285.543391] in_atomic(): 1, irqs_disabled(): 0, pid: 6802, name: mandb
[  285.543396] 4 locks held by mandb/6802:
[  285.543400]  #0: ffff888577ae3db0 (&f->f_pos_lock){+.+.}-{0:0}, at: __fdget_pos+0xaf/0xe0
[  285.543430]  #1: ffff8884b4b5b488 (&sb->s_type->i_mutex_key#13){++++}-{0:0}, at: xfs_ilock+0x17d/0x590 [xfs]
[  285.543628]  #2: ffffffffa24d22c0 (rcu_read_lock){....}-{1:2}, at: blk_add_trace_bio+0x0/0x2f0
[  285.543652]  #3: ffffffffa22177a0 (running_trace_lock){+.+.}-{2:2}, at: __blk_add_trace+0x9ea/0xdf0
[  285.543672] Preemption disabled at:
[  285.543684] [<ffffffff9f936743>] get_lock_stats+0x13/0x120
[  285.543697] CPU: 13 PID: 6802 Comm: mandb Kdump: loaded Tainted: G        W        --------- -  - 4.18.0-259.rt7.24.el8.x86_64+debug #1
[  285.543702] Hardware name: Lenovo ThinkSystem SR630 -[7X02CTO1WW]-/-[7X02CTO1WW]-, BIOS -[IVE136T-2.10]- 03/22/2019
[  285.543705] Call Trace:
[  285.543720]  dump_stack+0x8e/0xd0
[  285.543737]  ? get_lock_stats+0x13/0x120
[  285.543748]  ___might_sleep.cold.58+0x54/0x77
[  285.543770]  rt_spin_lock+0x3e/0x120
[  285.543776]  ? __blk_add_trace+0x9ea/0xdf0
[  285.543788]  __blk_add_trace+0x9ea/0xdf0
[  285.543839]  ? rcu_read_lock_sched_held+0xe0/0xe0
[  285.543864]  blk_add_trace_bio+0x25d/0x2f0
[  285.543893]  ? recalibrate_cpu_khz+0x10/0x10
[  285.543915]  ? blk_add_trace_bio_frontmerge+0x20/0x20
[  285.543930]  generic_make_request_checks+0x1471/0x1cc0
[  285.543969]  ? blk_update_request+0x1610/0x1610
[  285.544002]  ? iomap_readpages_actor+0x4f3/0x900
[  285.544041]  generic_make_request+0x8e/0xa40
[  285.544075]  ? direct_make_request+0x2b0/0x2b0
[  285.544080]  ? iomap_readpage_actor+0xd40/0xd40
[  285.544089]  ? iomap_readpages+0x20b/0x950
[  285.544099]  ? bad_range+0x237/0x410
[  285.544112]  ? trace_event_raw_event_iomap_apply+0x430/0x430
[  285.544157]  ? submit_bio+0xcb/0x440
[  285.544167]  submit_bio+0xcb/0x440
[  285.544190]  ? generic_make_request+0xa40/0xa40
[  285.544221]  ? rcu_read_lock_bh_held+0x100/0x100
[  285.544248]  ? iomap_readpage_actor+0xd40/0xd40
[  285.544260]  iomap_readpages+0x22f/0x950
[  285.544281]  ? iomap_writepage+0xa0/0xa0
[  285.544351]  read_pages+0xf8/0x600
[  285.544384]  ? __alloc_pages_nodemask+0x80d/0xa20
[  285.544405]  ? read_cache_pages+0x6d0/0x6d0
[  285.544411]  ? __alloc_pages_slowpath+0x2a10/0x2a10
[  285.544478]  __do_page_cache_readahead+0x311/0x3c0
[  285.544525]  ? read_pages+0x600/0x600
[  285.544578]  ondemand_readahead+0x443/0xc00
[  285.544584]  ? page_cache_sync_readahead+0x18a/0x3f0
[  285.544628]  generic_file_buffered_read+0xd65/0x1ee0
[  285.544730]  ? read_cache_page_gfp+0xa0/0xa0
[  285.544888]  ? xfs_ilock+0x17d/0x590 [xfs]
[  285.544895]  ? get_lock_stats+0x18/0x120
[  285.544905]  ? put_lock_stats.isra.19+0xb/0xa0
[  285.544934]  ? lock_contended+0xa70/0xa70
[  285.544961]  ? __down_read_trylock+0x53/0x80
[  285.545119]  ? xfs_file_buffered_aio_read+0x146/0x4f0 [xfs]
[  285.545294]  xfs_file_buffered_aio_read+0x151/0x4f0 [xfs]
[  285.545462]  xfs_file_read_iter+0x2bf/0x740 [xfs]
[  285.545475]  ? rt_mutex_slowlock_locked+0x262/0x790
[  285.545503]  new_sync_read+0x39c/0x560
[  285.545515]  ? mark_held_locks+0xb7/0x120
[  285.545528]  ? do_iter_readv_writev+0x6e0/0x6e0
[  285.545600]  ? fsnotify_first_mark+0x150/0x150
[  285.545663]  vfs_read+0x147/0x350
[  285.545693]  ksys_read+0xb8/0x170
[  285.545708]  ? kernel_write+0x130/0x130
[  285.545763]  do_syscall_64+0x13c/0x670
[  285.545786]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf
[  285.545794] RIP: 0033:0x7fbade044045
[  285.545802] Code: fe ff ff 50 48 8d 3d fa c9 09 00 e8 05 fd 01 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 15 45 2d 00 8b 00 85 c0 75 0f 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89
[  285.545807] RSP: 002b:00007fffab1d75f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  285.545815] RAX: ffffffffffffffda RBX: 0000000040000000 RCX: 00007fbade044045
[  285.545820] RDX: 0000000000002000 RSI: 0000564fe832ca10 RDI: 0000000000000005
[  285.545824] RBP: 0000564fe832ca10 R08: 0000564fe81a88f0 R09: 0000000000000003
[  285.545828] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000002000
[  285.545833] R13: 0000564fe82eb060 R14: 00007fffab1d7634 R15: 0000000000001000


Version-Release number of selected component (if applicable):
4.18.0-259.rt7.24.el8.x86_64+debug

How reproducible:
once currently

Steps to Reproduce:
1. run /kernel/errata/Regression
possible subtest Test-bz502573 
https://beaker.engineering.redhat.com/recipes/9159156#task118831431,task118831439,task118831444,task118831449,task118831450
2.
3.

Actual results:
waning/call trace in the log

Expected results:
Fix such a BUG:.

Additional info:

Comment 1 Ming Lei 2020-12-09 04:30:03 UTC
Hello Chunyu,

It could be casued by acquring 'running_trace_lock' with rcu read lock held, which
disables preempt.

Can you test the following build?

     https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33591095

And just the following change is applied:

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 4be52ba6bc61..093b7d2b4879 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -46,7 +46,7 @@ static struct trace_array *blk_tr;
 static bool blk_tracer_enabled __read_mostly;
 
 static LIST_HEAD(running_trace_list);
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(running_trace_lock);
+static __cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(running_trace_lock);
 
 /* Select an alternative, minimalistic output than the original one */
 #define TRACE_BLK_OPT_CLASSIC  0x1

@@ -134,12 +134,12 @@ static void trace_note_tsk(struct task_struct *tsk)
        struct blk_trace *bt;
 
        tsk->btrace_seq = blktrace_seq;
-       spin_lock_irqsave(&running_trace_lock, flags);
+       raw_spin_lock_irqsave(&running_trace_lock, flags);
        list_for_each_entry(bt, &running_trace_list, running_list) {
                trace_note(bt, tsk->pid, BLK_TN_PROCESS, tsk->comm,
                           sizeof(tsk->comm), NULL);
        }
-       spin_unlock_irqrestore(&running_trace_lock, flags);
+       raw_spin_unlock_irqrestore(&running_trace_lock, flags);
 }
 
 static void trace_note_time(struct blk_trace *bt)
@@ -687,9 +687,9 @@ static int __blk_trace_startstop(struct request_queue *q, int start)
                        blktrace_seq++;
                        smp_mb();
                        bt->trace_state = Blktrace_running;
-                       spin_lock_irq(&running_trace_lock);
+                       raw_spin_lock_irq(&running_trace_lock);
                        list_add(&bt->running_list, &running_trace_list);
-                       spin_unlock_irq(&running_trace_lock);
+                       raw_spin_unlock_irq(&running_trace_lock);
 
                        trace_note_time(bt);
                        ret = 0;
@@ -697,9 +697,9 @@ static int __blk_trace_startstop(struct request_queue *q, int start)
        } else {
                if (bt->trace_state == Blktrace_running) {
                        bt->trace_state = Blktrace_stopped;
-                       spin_lock_irq(&running_trace_lock);
+                       raw_spin_lock_irq(&running_trace_lock);
                        list_del_init(&bt->running_list);
-                       spin_unlock_irq(&running_trace_lock);
+                       raw_spin_unlock_irq(&running_trace_lock);
                        relay_flush(bt->rchan);
                        ret = 0;
                }

Comment 2 Chunyu Hu 2020-12-09 05:54:44 UTC
(In reply to Ming Lei from comment #1)
> Hello Chunyu,
> 
> It could be casued by acquring 'running_trace_lock' with rcu read lock held,
> which
> disables preempt.
> 
> Can you test the following build?
> 
>      https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33591095
> 

The build looks like failed. And the issue I originally hit was with a kernel-rt-debug kernel.

Comment 3 Ming Lei 2020-12-09 06:16:07 UTC
(In reply to Chunyu Hu from comment #2)
> (In reply to Ming Lei from comment #1)
> > Hello Chunyu,
> > 
> > It could be casued by acquring 'running_trace_lock' with rcu read lock held,
> > which
> > disables preempt.
> > 
> > Can you test the following build?
> > 
> >      https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33591095
> > 
> 
> The build looks like failed. And the issue I originally hit was with a
> kernel-rt-debug kernel.

oops, the build should have been the following one:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33591645

But I don't know how to build kernel-rt yet, :-(

Comment 4 Chunyu Hu 2020-12-09 06:22:24 UTC
(In reply to Ming Lei from comment #3)
> (In reply to Chunyu Hu from comment #2)
> > (In reply to Ming Lei from comment #1)
> > > Hello Chunyu,
> > > 
> > > It could be casued by acquring 'running_trace_lock' with rcu read lock held,
> > > which
> > > disables preempt.
> > > 
> > > Can you test the following build?
> > > 
> > >      https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33591095
> > > 
> > 
> > The build looks like failed. And the issue I originally hit was with a
> > kernel-rt-debug kernel.
> 
> oops, the build should have been the following one:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33591645
> 
> But I don't know how to build kernel-rt yet, :-(

I have added Juri in the cc list, who works on kernel-rt. I'll submit a sched regression test with the build.

Comment 5 Juri Lelli 2020-12-09 07:21:28 UTC
(In reply to Ming Lei from comment #3)
> (In reply to Chunyu Hu from comment #2)
> > (In reply to Ming Lei from comment #1)
> > > Hello Chunyu,
> > > 
> > > It could be casued by acquring 'running_trace_lock' with rcu read lock held,
> > > which
> > > disables preempt.
> > > 
> > > Can you test the following build?
> > > 
> > >      https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33591095
> > > 
> > 
> > The build looks like failed. And the issue I originally hit was with a
> > kernel-rt-debug kernel.
> 
> oops, the build should have been the following one:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33591645
> 
> But I don't know how to build kernel-rt yet, :-(

Same procedure as for kernel, but using kernel-rt repo/branch.

git://git.host.prod.eng.bos.redhat.com/kernel-rt.git rhel-rt-8.4.0

and then

make BUILDID=".<my_id>" rh-brew

Comment 6 Ming Lei 2020-12-09 08:08:43 UTC
(In reply to Juri Lelli from comment #5)
...
> 
> Same procedure as for kernel, but using kernel-rt repo/branch.
> 
> git://git.host.prod.eng.bos.redhat.com/kernel-rt.git rhel-rt-8.4.0
> 
> and then
> 
> make BUILDID=".<my_id>" rh-brew

Hi Chunyu,

Please test the following brew build for kernel-rt?

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33597935


Hi Juri,

Thanks for sharing the kernel-rt build steps.

BTW, do you think if the following usage is a problem for PREEMPT_RT?

rcu_read_lock();                   //preempt is disabled
spin_lock_irqsave(lock, flags);    //convert to rt_spin_lock() which might sleep   
...
spin_lock_irqrestore(lock, flags);
rcu_read_unlock();

From the lockdep warning, looks the above is the reason.

I am not so sure is because I do see lots of such usage in block layer, :-)

Thanks,
Ming

Comment 7 Juri Lelli 2020-12-09 10:10:34 UTC
(In reply to Ming Lei from comment #6)
> (In reply to Juri Lelli from comment #5)
> ...
> > 
> > Same procedure as for kernel, but using kernel-rt repo/branch.
> > 
> > git://git.host.prod.eng.bos.redhat.com/kernel-rt.git rhel-rt-8.4.0
> > 
> > and then
> > 
> > make BUILDID=".<my_id>" rh-brew
> 
> Hi Chunyu,
> 
> Please test the following brew build for kernel-rt?
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33597935
> 
> 
> Hi Juri,
> 
> Thanks for sharing the kernel-rt build steps.
> 
> BTW, do you think if the following usage is a problem for PREEMPT_RT?
> 
> rcu_read_lock();                   //preempt is disabled
> spin_lock_irqsave(lock, flags);    //convert to rt_spin_lock() which might
> sleep   
> ...
> spin_lock_irqrestore(lock, flags);
> rcu_read_unlock();
> 
> From the lockdep warning, looks the above is the reason.

Indeed. With PREEMPT_RT spin_locks are converted to mutexes, so they can sleep.

> I am not so sure is because I do see lots of such usage in block layer, :-)

So, yeah. Patterns like the above are problematic. If you are so inclined, you
could have a look at current PREEMPT_RT upstream:

git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git linux-5.10.y-rt

Thanks for looking into this BTW! :-)

Comment 8 Chunyu Hu 2020-12-10 02:36:15 UTC
(In reply to Ming Lei from comment #6)
> (In reply to Juri Lelli from comment #5)
> ...
> > 
> > Same procedure as for kernel, but using kernel-rt repo/branch.
> > 
> > git://git.host.prod.eng.bos.redhat.com/kernel-rt.git rhel-rt-8.4.0
> > 
> > and then
> > 
> > make BUILDID=".<my_id>" rh-brew
> 
> Hi Chunyu,
> 
> Please test the following brew build for kernel-rt?
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=33597935
> 
> 

I submitted two x86 jobs to the kernel-rt-debug, but one of them aborted during installation, the other one finished the test. And in the finished one, I don't see the issue in comment#0, though there are many other wanings on locking/lockdeps. 

https://beaker.engineering.redhat.com/recipes/9166750#task118905648,task118905653,task118905654,task118905657,task118905658,task118905659
subtest Test-bz502573 is good.

I submitted another one this morning for double check, if you think it's enough with the finished result. Feel free to go ahead for your progress. Thanks.

Comment 9 Chunyu Hu 2020-12-14 12:19:40 UTC
(In reply to Chunyu Hu from comment #8)

> 
> I submitted another one this morning for double check, if you think it's
> enough with the finished result. Feel free to go ahead for your progress.
> Thanks.

The other job is good too. The bio warning was not shown.

Comment 10 Ming Lei 2020-12-16 07:43:22 UTC
(In reply to Chunyu Hu from comment #9)
> (In reply to Chunyu Hu from comment #8)
> 
> > 
> > I submitted another one this morning for double check, if you think it's
> > enough with the finished result. Feel free to go ahead for your progress.
> > Thanks.
> 
> The other job is good too. The bio warning was not shown.

Hi Chunyu,

The patch shouldn't make a difference, because rcu_read_lock() doesn't disable preempt
in kernel-rt because CONFIG_PREEMPT_RCU is enabled.

I tried to reproduce the issue, but can't succeed in runing simple blktrace.

So could you share us your test steps?


Thanks,
Ming

Comment 11 Chunyu Hu 2020-12-16 08:17:06 UTC

> 
> Hi Chunyu,
> 
> The patch shouldn't make a difference, because rcu_read_lock() doesn't
> disable preempt
> in kernel-rt because CONFIG_PREEMPT_RCU is enabled.
> 
> I tried to reproduce the issue, but can't succeed in runing simple blktrace.
> 
> So could you share us your test steps?

First you need to run kernel-rt-debug:
[root@hp-dl380pg8-17 Regression]# uname -r
4.18.0-260.rt7.25.el8.x86_64+debug


I checked the test sub script, it's indeed just doing blktrace on root device. And I can
reproduce it with my host currently running.

blktrace -w 15 -d /dev/mapper/rhel_hp--dl380pg8--17-root

To run this with kernel-kernel-errata-Regression

1) yum -y install kernel-kernel-errata-Regression
2) cd /mnt/tests/kernel/errata/Regression
3) make run BZLIST=502573 INST=NEVER
4) dmesg

[root@hp-dl380pg8-17 Regression]# dmesg
[  168.697555] dracut-install (3376) used greatest stack depth: 19376 bytes left
[  918.327322] BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:968
[  918.327332] in_atomic(): 1, irqs_disabled(): 0, pid: 7661, name: mandb
[  918.327339] 2 locks held by mandb/7661:
[  918.327343]  #0: ffffffffaecd22c0 (rcu_read_lock){....}-{1:2}, at: blk_add_trace_bio+0x0/0x2f0
[  918.327376]  #1: ffffffffaea177a0 (running_trace_lock){+.+.}-{2:2}, at: __blk_add_trace+0x9ea/0xdf0
[  918.327397] Preemption disabled at:
[  918.327408] [<ffffffffac136743>] get_lock_stats+0x13/0x120
[  918.327422] CPU: 3 PID: 7661 Comm: mandb Kdump: loaded Tainted: G          I      --------- -  - 4.18.0-260.rt7.25.el8.x86_64+debug #1
[  918.327426] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 07/01/2015
[  918.327431] Call Trace:
[  918.327447]  dump_stack+0x8e/0xd0
[  918.327465]  ? get_lock_stats+0x13/0x120
[  918.327475]  ___might_sleep.cold.58+0x54/0x77
[  918.327495]  rt_spin_lock+0x3e/0x120
[  918.327501]  ? __blk_add_trace+0x9ea/0xdf0
[  918.327513]  __blk_add_trace+0x9ea/0xdf0
[  918.327561]  ? rcu_read_lock_sched_held+0xe0/0xe0
[  918.327584]  blk_add_trace_bio+0x25d/0x2f0


[root@hp-dl380pg8-17 Regression]# grep 502573\( -A 18 /mnt/tests/kernel/errata/Regression/tests/rhel5.sh 
function bz502573(){
	local dir=$(pwd)
	local bin_file=$DIR_BIN/${FUNCNAME}_test
	local log_file=$DIR_LOG/${FUNCNAME}.log
	local src_file=$DIR_SRC/${FUNCNAME}_test.c
	local test_info="blktrace run makes the machine unresponsive"
	[ -n "$BZ_DEBUG" ] && local RUNTIME=25s || local RUNTIME=5m

	rlPhaseStartTest "Test ${FUNCNAME}"
		rlLogInfo "$test_info"
		rpm -q blktrace || rlRun "yum -y install blktrace"

		rlAssertRpm kernel
		rlAssertRpm blktrace
		mount |grep debugfs |grep /sys/kernel/debug || 
		rlRun "mount -t debugfs debugfs /sys/kernel/debug" 0 "Mounting debugfs"
		rlRun "Device=\`mount | grep 'on / type' | cut -d' ' -f1\`" 0 "Finding root device"
		local Runtime=15
		rlRun "blktrace -w $Runtime -d $Device" 0 "blktrace"

Comment 12 Chunyu Hu 2020-12-16 08:23:58 UTC
(In reply to Chunyu Hu from comment #11)
> 
> > 
> > Hi Chunyu,
> > 
> > The patch shouldn't make a difference, because rcu_read_lock() doesn't
> > disable preempt
> > in kernel-rt because CONFIG_PREEMPT_RCU is enabled.
> > 
> > I tried to reproduce the issue, but can't succeed in runing simple blktrace.
> > 
> > So could you share us your test steps?
> 
> First you need to run kernel-rt-debug:
> [root@hp-dl380pg8-17 Regression]# uname -r
> 4.18.0-260.rt7.25.el8.x86_64+debug
> 
> 
> I checked the test sub script, it's indeed just doing blktrace on root
> device. And I can
> reproduce it with my host currently running.
> 
> blktrace -w 15 -d /dev/mapper/rhel_hp--dl380pg8--17-root

Looks like run this directly won't trigger the mandb execution, so can't reproduce.

> 
> To run this with kernel-kernel-errata-Regression
> 
> 1) yum -y install kernel-kernel-errata-Regression
> 2) cd /mnt/tests/kernel/errata/Regression
> 3) make run BZLIST=502573 INST=NEVER
> 4) dmesg
> 


mandb ran during the process, not sure where it came from. But run as errata/Regression sub test, it's reproducible.

Comment 13 Chunyu Hu 2020-12-16 08:40:50 UTC
(In reply to Ming Lei from comment #10)
> (In reply to Chunyu Hu from comment #9)
> > (In reply to Chunyu Hu from comment #8)
> > 
> > > 
> > > I submitted another one this morning for double check, if you think it's
> > > enough with the finished result. Feel free to go ahead for your progress.
> > > Thanks.
> > 
> > The other job is good too. The bio warning was not shown.
> 
> Hi Chunyu,
> 
> The patch shouldn't make a difference, because rcu_read_lock() doesn't
> disable preempt
> in kernel-rt because CONFIG_PREEMPT_RCU is enabled.

But it indeed didn't see the same issue in the console.log in the job of comment#8.
[root@hp-dl380pg8-17 Regression]# curl -LkO https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2020/12/47999/4799924/9166750/console.log
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2618k  100 2618k    0     0   714k      0  0:00:03  0:00:03 --:--:--  714k

[root@hp-dl380pg8-17 Regression]# grep submit_bio console.log 
[root@hp-dl380pg8-17 Regression]# 

^^^^^ got nothing


While with the job in comment#0, easily got the trace in console log:

[root@hp-dl380pg8-17 Regression]# curl -LkO https://beaker.engineering.redhat.com/recipes/9159156/logs/console.log
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0   417    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 3197k  100 3197k    0     0   600k      0  0:00:05  0:00:05 --:--:--  972k
[root@hp-dl380pg8-17 Regression]# grep submit_bio console.log 
[  284.528329]  ? submit_bio+0xcb/0x440 
[  284.528338]  submit_bio+0xcb/0x440 
[  285.544157]  ? submit_bio+0xcb/0x440 
[  285.544167]  submit_bio+0xcb/0x440 
[  286.579390]  ? submit_bio+0xcb/0x440 
[  286.579399]  submit_bio+0xcb/0x440 
[  287.620651]  ? submit_bio+0xcb/0x440 
[  287.620661]  submit_bio+0xcb/0x440 
[  293.137160]  ? submit_bio+0xcb/0x440 
[  293.137169]  submit_bio+0xcb/0x440 
[  294.240292]  ? submit_bio+0xcb/0x440 
[  294.240302]  submit_bio+0xcb/0x440 

There's another explanation that it's not 100% reproducible.

Comment 32 Wander 2022-01-28 14:59:04 UTC
Notice this affects RHEL 9 as well.

Comment 35 Chunyu Hu 2022-01-28 23:51:56 UTC
(In reply to Wander from comment #32)
> Notice this affects RHEL 9 as well.

Thanks. I cloned a rhel9.0 bz2047971, Wander, would you take that?

Comment 45 errata-xmlrpc 2022-05-10 14:41:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel-rt security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1975

Comment 46 Valentin Schneider 2022-06-28 14:42:18 UTC
*** Bug 2052397 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.