| Summary: | [RHEL6] Kernel crash in bdi_remove_from_list with EMC powerpath | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Hidehiko Matsumoto <hmatsumo> |
| Component: | kernel | Assignee: | Ewan D. Milne <emilne> |
| kernel sub component: | Storage | QA Contact: | Storage QE <storage-qe> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | coughlan, djeffery, hideki.miyajima, kearnan_keith, nyamashi, revers, tatsu-ab1 |
| Version: | 6.4 | Flags: | hmatsumo:
needinfo?
(djeffery) |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-14 18:45:52 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Hidehiko Matsumoto
2016-10-03 14:20:05 UTC
The crash and warnings look likely to be side effects of how PowerPath grafts itself into the SCSI stack in a non-standard and racy way. The backing_dev_info structure which triggered the crash is part of a request_queue at 0xffff882feed924f8. This request_queue was for the scsi_device for sdhx/5:0:1:21 . Each points to the other as expected. crash> p ((struct scsi_device *)0xffff882fda0bd800)->request_queue $20 = (struct request_queue *) 0xffff882feed924f8 crash> p ((struct scsi_device *)0xffff882fda0bd800)->request_queue->queuedata $21 = (void *) 0xffff882fda0bd800 However, this isn't the request_queue as found by sdhx's gendisk: crash> dev -d | grep sdhx 134 ffff882feb6eac00 sdhx ffff88160556a278 0 0 0 0 The gendisk for sdhx shows a request_queue of 0xffff88160556a278. This is because of PowerPath. PowerPath is injecting its own request_queue structures into the normal SCSI gendisk structures. This isn't part of the SCSI and block layer design. Trying to swap out the request queues is racy, which may also be the cause of the WARNING messages seen in the logs. PowerPath changing a request_queue while a task is accessing a SCSI dev node can create race conditions and could break reference counting. And in addition to the crash-triggering task, there were 2 more, 399 and 406, also manipulating request_queue and backing_dev_info structures from PowerPath functions. EMC would need to examine how they can interact with the storage stack in a non-racy way. <snip>
BDI registration happens first in add_disk in fixed kernels:
/* Register BDI before referencing it from bdev */
bdi = &disk->queue->backing_dev_info;
bdi_register_dev(bdi, disk_devt(disk));
blk_register_region(disk_devt(disk), disk->minors, NULL,
exact_match, exact_lock, disk);
register_disk(disk);
blk_register_queue(disk);
The likely cause of the warnings is PowerPath. Code in the emcp module appears to be doing things like unregistering a bdi and then registering a bdi while a disk is live from emcp_reenable_io.
crash> bt
PID: 413 TASK: ffff882feecfd500 CPU: 6 COMMAND: "scsi_wq_5"
#0 [ffff882fec7b14b0] machine_kexec at ffffffff81035d6b
#1 [ffff882fec7b1510] crash_kexec at ffffffff810c0e22
#2 [ffff882fec7b15e0] oops_end at ffffffff81511cb0
#3 [ffff882fec7b1610] die at ffffffff8100f19b
#4 [ffff882fec7b1640] do_general_protection at ffffffff815117b2
#5 [ffff882fec7b1670] general_protection at ffffffff81510f85
[exception RIP: bdi_remove_from_list+47]
RIP: ffffffff8113c1bf RSP: ffff882fec7b1720 RFLAGS: 00010282
RAX: dead000000200200 RBX: ffff882feed92658 RCX: 0000000000000158
RDX: ffff8814de8bef88 RSI: 0000000000000246 RDI: ffffffff81fb7520
RBP: ffff882fec7b1730 R8: 0000000000000000 R9: 0000000000000000
R10: 00000000beefdead R11: 0000000000000000 R12: 0000000000000246
R13: ffff882feed927c0 R14: ffff882feb6eac00 R15: 0000000008600070
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#6 [ffff882fec7b1738] bdi_unregister at ffffffff8113c2a1
#7 [ffff882fec7b1768] emcp_bdi_unregister at ffffffffa02580ee [emcp]
#8 [ffff882fec7b1778] emcp_reenable_io at ffffffffa0261fff [emcp]
#9 [ffff882fec7b1818] emcp_map_device at ffffffffa026ce3c [emcp]
#10 [ffff882fec7b18a8] emcp_add at ffffffffa026ec3e [emcp]
#11 [ffff882fec7b1918] emcp_chg_device_notify at ffffffffa0270f35 [emcp]
#12 [ffff882fec7b1938] notifier_call_chain at ffffffff81513cb5
#13 [ffff882fec7b1978] __blocking_notifier_call_chain at ffffffff8109cfba
#14 [ffff882fec7b19c8] blocking_notifier_call_chain at ffffffff8109cff6
#15 [ffff882fec7b19d8] driver_bound at ffffffff8135f26c
#16 [ffff882fec7b19f8] driver_probe_device at ffffffff8135f39f
#17 [ffff882fec7b1a28] __device_attach at ffffffff8135f693
#18 [ffff882fec7b1a48] bus_for_each_drv at ffffffff8135e5c4
#19 [ffff882fec7b1a88] device_attach at ffffffff8135f774
#20 [ffff882fec7b1ab8] bus_probe_device at ffffffff8135e36d
#21 [ffff882fec7b1ac8] device_add at ffffffff8135c677
#22 [ffff882fec7b1b48] scsi_sysfs_add_sdev at ffffffff8137fe59
#23 [ffff882fec7b1b88] scsi_probe_and_add_lun at ffffffff8137d2a0
#24 [ffff882fec7b1cc8] __scsi_scan_target at ffffffff8137ddfc
#25 [ffff882fec7b1db8] scsi_scan_target at ffffffff8137e6c5
#26 [ffff882fec7b1e08] fc_scsi_scan_rport at ffffffffa008a8ad [scsi_transport_fc]
#27 [ffff882fec7b1e38] worker_thread at ffffffff81090be0
#28 [ffff882fec7b1ee8] kthread at ffffffff81096a36
#29 [ffff882fec7b1f48] kernel_thread at ffffffff8100c0ca
</snip>
Hi Tom, This seems like it could be similar to the issue from https://access.redhat.com/solutions/1212373. Could this be related? I have no visibility into https://bugzilla.redhat.com/show_bug.cgi?id=1111683? -Keith https://access.redhat.com/solutions/1212373 are systems exhibiting similar behavior during device remove. In this case, it's not remove but device add. We appear to be getting WARNINGs and the crash from PowerPath removing and adding bdi connections while a block device and its gendisk is already active and visible. It's not a race on delete but a violation of the block layer's expectation that the block device and its bdi will be stable while the device exists. Yes, this panic happened during device add not removal. But isn’t the behavior similar in both the cases? Even in this case also, bdi_unregister is called which in turn called bdi_remove_from_list to remove a BDI which was already removed. From the dump, BDI_pending is set on the bdi. Which mean when line 612 was executed, BDI_pending was not set. When it was executing line 618 bdi_remove_from_list(), BDI_pending is set by some other thread. From the code, BDI_pending bit is set only in one place and it is done in bdi_add_default_flusher_task() by default flusher task. Doesn’t this issue due to race between bdi_unregister and bdi-default threads like in case of BZ 1111683? Should we ask the customer to upgrade kernel version which has the fix for BZ 1111683? Appreciate your comments. 602 static void bdi_wb_shutdown(struct backing_dev_info *bdi) 603 { 609 /* 610 * If setup is pending, wait for that to complete first 611 */ 612 wait_on_bit(&bdi->state, BDI_pending, bdi_sched_wait, 613 TASK_UNINTERRUPTIBLE); 614 615 /* 616 * Make sure nobody finds us on the bdi_list anymore 617 */ 618 bdi_remove_from_list(bdi); 501 void static bdi_add_default_flusher_task(struct backing_dev_info *bdi) 502 { .. 518 if (!test_and_set_bit(BDI_pending, &bdi->state)) { 519 list_del_rcu(&bdi->bdi_list); // Removed from list here 520 521 /* 522 * We must wait for the current RCU period to end before 523 * moving to the pending list. So schedule that operation 524 * from an RCU callback. 525 */ 526 call_rcu(&bdi->rcu_head, bdi_add_to_pending); Yes, the behavior in BZ1111683 is similar to the crash. It may correct enough of the issue. It will not stop the WARN messages from removing the bdi from an active and accessible device, which creates other problems I am not sure BZ1111683 would fully protect against. Once available, such a kernel to test may not be a bad idea. Patches for BZ1111683 have already been pulled once from creating a new issue. We asked for more explanation on the panic as the last comment by David Jeffery at 2016-10-25 16:32:33 EDT --- the behavior in BZ1111683 is similar to the crash. --- is not very clear. Here is what we got from RedHat Japan. ~~~ This has nothing to do with that EMC bug. The EMC powerpath module appears to be doing something that is completely unsupportable and broken. ~~~ I think *something* talks about the WARNIGS, but can you please give us what something is here? Is this about the WARNINGS or the panic? As for the panic, which causes the panic, kernel or powerpath? The warnings look to be powerpath related. powerpath appears to be unregistering and registering bdi devices on a live, fully accessible device. This is not intended behavior and can create races which will trigger the warnings. The crash isn't as clear. There is a similar crash which is being fixed in RHEL. However, powerpath's methods of injecting itself into SCSI disk structures as part of it claiming a disk are not standard, from its unregister/register of bdi device to creating alternate gendisks and request_queues. Without a full view of what powerpath is doing, we cannot tell if it is the same issue or just a similar one triggered by powerpath's methods of inserting itself into the disk stack in ways that were never intended by the SCSI/block layer's design. - powerpath's methods of injecting itself into SCSI disk structures as part of it claiming a disk are not standard, from its unregister/register of bdi device to creating alternate gendisks and request_queues. - powerpath's methods of inserting itself into the disk stack in ways that were never intended by the SCSI/block layer's design. Can you give us what makes you think PowerPath is behaving this way? I would appreciate if you show the stack or any other stuff based on the core. Thank you. Closing as duplicate of BZ 1111683, there are fixes being made to correct issues with device removal, which according to our analysis of the crash dumps, appear to be the cause of the crashes. EMC appears to concur with this analysis. There is a pre-beta test kernel available for partners, but not for customer use. See BZ 1111683 for details. Please provide any testing feedback if it is used. If for some reason the test kernel and/or RHEL 6.9 GA does not solve this issue for the customer, then re-open this bug with further details at that point. *** This bug has been marked as a duplicate of bug 1111683 *** |