1550617 – s390/dasd: triggers WARN_ON in blk_mq, hw queue runs without being mapped

Bug 1550617 - s390/dasd: triggers WARN_ON in blk_mq, hw queue runs without being mapped

Summary: s390/dasd: triggers WARN_ON in blk_mq, hw queue runs without being mapped

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	28
Hardware:	s390x
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Dan Horák
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	ZedoraTracker 1584165
TreeView+	depends on / blocked

Reported:	2018-03-01 16:00 UTC by Michal Schmidt
Modified:	2018-09-06 11:44 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:	1538550
Clones:	1584165 (view as bug list)
Environment:
Last Closed:	2018-08-29 15:14:13 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	165279	0	None	None	None	2019-06-04 09:34:01 UTC

Description Michal Schmidt 2018-03-01 16:00:45 UTC

The dasd driver triggered a kernel warning in blk_mq.

The full syslog is at:
https://bugzilla.redhat.com/attachment.cgi?id=1392197
referenced from:
https://bugzilla.redhat.com/show_bug.cgi?id=1538550#c3

+++ This bug was initially created as a clone of Bug #1538550 +++

--- Additional comment from Vendula Poncova on 2018-02-06 17:09 CET ---

I am able to reproduce the bug, but I am not sure what is the cause. However, there are some errors in the attached syslog:

[...]

15:24:37,332 WARNING kernel:Call Trace:
15:24:37,332 WARNING kernel:([<040000007cb6fc30>] 0x40000007cb6fc30)
15:24:37,332 WARNING kernel: [<00000000004f7664>] blk_mq_run_hw_queue+0x4c/0xb0 
15:24:37,332 WARNING kernel: [<00000000004f7730>] blk_mq_run_hw_queues+0x68/0x88 
15:24:37,332 WARNING kernel: [<00000000001e0610>] flush_smp_call_function_queue+0xc0/0x160 
15:24:37,332 WARNING kernel: [<0000000000116830>] smp_handle_ext_call+0xb8/0xc0 
15:24:37,332 WARNING kernel: [<000000000010bc74>] do_ext_interrupt+0xbc/0xe0 
15:24:37,332 WARNING kernel: [<00000000001b222c>] __handle_irq_event_percpu+0x64/0x240 
15:24:37,332 WARNING kernel: [<00000000001b2446>] handle_irq_event_percpu+0x3e/0x88 
15:24:37,332 WARNING kernel: [<00000000001b7154>] handle_percpu_irq+0x6c/0x98 
15:24:37,332 WARNING kernel: [<00000000001b11e2>] generic_handle_irq+0x42/0x60 
15:24:37,332 WARNING kernel: [<000000000010c08c>] do_IRQ+0x84/0xb8 
15:24:37,332 WARNING kernel: [<00000000008488d4>] ext_int_handler+0x100/0x104 
15:24:37,332 WARNING kernel: [<0000000000102be6>] enabled_wait+0x4e/0xe0 
15:24:37,332 WARNING kernel:([<0000000000000000>]           (null))
15:24:37,332 WARNING kernel: [<0000000000102f22>] arch_cpu_idle+0x32/0x48 
15:24:37,332 WARNING kernel: [<00000000001942f8>] do_idle+0xe8/0x170 
15:24:37,332 WARNING kernel: [<0000000000194556>] cpu_startup_entry+0x3e/0x48 
15:24:37,332 WARNING kernel: [<0000000000116760>] smp_start_secondary+0x120/0x138 
15:24:37,332 WARNING kernel: [<0000000000848cd6>] restart_int_handler+0x62/0x78 
15:24:37,332 WARNING kernel: [<0000000000000000>]           (null) 
15:24:37,332 WARNING kernel:Last Breaking-Event-Address:
15:24:37,332 WARNING kernel: [<00000000004f75e4>] __blk_mq_delay_run_hw_queue+0xbc/0xd8

[...]
--- Additional comment from Michal Schmidt on 2018-02-23 17:06:31 CET ---
[...]

The kernel warning looks more important. It's likely a bug in the dasd driver. 
The warning is this one:

static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
                                        unsigned long msecs)
{
>>>>    if (WARN_ON_ONCE(!blk_mq_hw_queue_mapped(hctx)))
                return;


dasd is not expressly implicated in the call trace, but it's possible for the flow to go like this:
__dasd_cleanup_cqr
  blk_mq_complete_request
    __blk_mq_complete_request
      rq->csd.func = __blk_mq_complete_request_remote;
      smp_call_function_single_async(ctx->cpu, &rq->csd);

Then a CPU gets the IPI and handles it:
flush_smp_call_function_queue
  __blk_mq_complete_request_remote
    rq->q->softirq_done_fn(rq);
      // tail call. softirq_done_fn was dasd_request_done.
      dasd_request_done
        // tail call
        blk_mq_run_hw_queues

The above only explains why the call trace looks the way it does. It does not explain the actual bug.

commit 673514aff5d797b0cdb3f09097ae0e253b5667c3
("s390/dasd: fix race during dasd initialization") supposedly fixed a similar bug before (already in v4.14).

Comment 1 Dan Horák 2018-03-01 16:25:42 UTC

the original report was on kernel-4.15.0-1.fc28.s390x, but I see the same traceback in kernel-4.16.0-0.rc2.git0.1.fc28.s390x too

[    3.353901] WARNING: CPU: 1 PID: 0 at block/blk-mq.c:1400 __blk_mq_delay_run_hw_queue+0x32/0xd8
[    3.353926] Modules linked in: sha256_s390 sha_common dasd_eckd_mod dasd_mod
[    3.353957] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.16.0-0.rc2.git0.1.fc28.s390x #1
[    3.353976] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[    3.353979] Krnl PSW : 00000000fede11d5 00000000a37d6269 (__blk_mq_delay_run_hw_queue+0x32/0xd8)
[    3.353987]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[    3.353991] Krnl GPRS: 0000000077643710 0000000000000000 00000000018d0000 0000000000000001
[    3.353994]            0000000000000000 00000000795fcab4 0000000001d13e00 0000000001d13eb0
[    3.353998]            000000007cb2bda8 000000007cb2bda8 0000000000000001 0000000000000000
[    3.354001]            00000000018d0000 00000000772b1e98 000000007cb2bc20 000000007cb2bbe8
[    3.354021] Krnl Code: 00000000004f89cc: e31021b80012        lt      %r1,440(%r2)
                          00000000004f89d2: a774000a            brc     7,4f89e6
                         #00000000004f89d6: a7f40001            brc     15,4f89d8
                         >00000000004f89da: ebaff0a00004        lmg     %r10,%r15,160(%r15)
                          00000000004f89e0: 07fe                bcr     15,%r14
                          00000000004f89e2: 47000000            bc      0,0
                          00000000004f89e6: b90400c2            lgr     %r12,%r2
                          00000000004f89ea: e31021f00002        ltg     %r1,496(%r2)
[    3.354058] Call Trace:
[    3.354062] ([<000000000000006e>] 0x6e)
[    3.354066]  [<00000000004f8b62>] blk_mq_run_hw_queue+0x82/0x158 
[    3.354071]  [<00000000004f8ca0>] blk_mq_run_hw_queues+0x68/0x88 
[    3.354076]  [<00000000004f7ece>] __blk_mq_complete_request+0x10e/0x1d8 
[    3.354081]  [<00000000004f8022>] blk_mq_complete_request+0x8a/0xc0 
[    3.354097]  [<000003ff80031c64>] dasd_block_tasklet+0x14c/0x4a0 [dasd_mod] 
[    3.354104]  [<0000000000149852>] tasklet_hi_action+0x92/0x118 
[    3.354111]  [<0000000000860da4>] __do_softirq+0xf4/0x340 
[    3.354115]  [<00000000001492f2>] irq_exit+0xba/0xc8 
[    3.354121]  [<000000000010bf98>] do_IRQ+0x80/0xb8 
[    3.354125]  [<00000000008600d4>] io_int_handler+0x130/0x280 
[    3.354128]  [<0000000000102c2e>] enabled_wait+0x4e/0xd8 
[    3.354131] ([<0000000000000000>]           (null))
[    3.354135]  [<0000000000102fa2>] arch_cpu_idle+0x42/0x58 
[    3.354141]  [<000000000019334a>] do_idle+0x15a/0x178 
[    3.354146]  [<0000000000193546>] cpu_startup_entry+0x3e/0x48 
[    3.354151]  [<0000000000116940>] smp_start_secondary+0x120/0x138 
[    3.354155]  [<00000000008607ae>] restart_int_handler+0x62/0x78 
[    3.354158]  [<0000000000000000>]           (null) 
[    3.354160] Last Breaking-Event-Address:
[    3.354165]  [<00000000004f89d6>] __blk_mq_delay_run_hw_queue+0x2e/0xd8
[    3.354167] ---[ end trace a3344d0b9198d3e7 ]---

Comment 2 Justin M. Forbes 2018-07-23 15:09:17 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.17.7-200.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 3 Justin M. Forbes 2018-08-29 15:14:13 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Comment 4 Dan Horák 2018-09-06 11:44:18 UTC

For the record - I think it went away in 4.17 kernels. I saw it in <=4.16.

Note You need to log in before you can comment on or make changes to this bug.

airlied
brueckner
bskeggs
bugproxy
dan
ewk
hannsj_uhl
hdegoede
ichavero
itamar
jarodwilson
jglisse
john.j5live
jonathan
josef
kernel-maint
linville
mchehab
mjg59
steved