Bug 673058
Summary: | kernel panic in pg_init_done - pgpath already deleted | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Menny Hamburger <Menny_Hamburger> |
Component: | kernel | Assignee: | Mike Snitzer <msnitzer> |
Status: | CLOSED ERRATA | QA Contact: | Gris Ge <fge> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.5 | CC: | babu.moger, bdonahue, bmr, dhoward, dwa, fge, hang_shi, jpirko, lvm-team, martinez, Menny_Hamburger, msnitzer, skito |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | 5.7 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
A race could occur when an internal multipath structure (pgpath) was freed before it was used to signal the path group initialization was complete (via pg_init_done). This update includes a number of fixes that address this issue. multipath is now increasingly robust when multipathd restarts are combined with I/O operations to multipath devices and storage failures.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-07-21 09:27:31 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 683443 | ||
Attachments: |
Description
Menny Hamburger
2011-01-27 08:10:55 UTC
Babu.Moger suggested I try the following: http://git.kernel.org/linus/2bded7bd7e8b12a913b0b58167a48220560e1514 I patched the kernel by only adding the call to multipath_wait_for_pg_init_completion after flush_workqueue(kmpath_handlerd) in the multipath_dtr code. This did not solve the problem. Created attachment 475994 [details]
Fix panic in pg_init_done called from within send_mode_select
This patch is over RHEL54 code but is suitable for RHEL55/RHEL56 also.
Waiting for pg_init completion code was moved to just before the multipath is deleted.
This fix makes me wonder why some multipath release code is also done in postsuspend (in Vanilla).
Sorry The above panic is from another bug - here is the correct one: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: Unable to handle kernel paging request 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: at 0000000100000010 RIP: 2011 Jan 20 02:53:08 node0 INFO: kernel: sd 350:0:0:0: rdac: array MD3200i-c7-d7, ctlr 1, MODE_SELECT completed 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff881b713f>] :dm_multipath:pg_init_done+0x2f/0x1c0 2011 Jan 20 02:53:08 node0 ALERT: kernel: Unable to handle kernel paging request at 0000000100000010 RIP: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: PGD 0 2011 Jan 20 02:53:08 node0 ALERT: kernel: [<ffffffff881b713f>] :dm_multipath:pg_init_done+0x2f/0x1c0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: Oops: 0000 [1] 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: SMP 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: last sysfs file: /devices/platform/host353/session76/target353:0:0/353:0:0:0/rev 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: CPU 3 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: Modules linked in: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: iscsi_tcp(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: libiscsi_tcp(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: libiscsi2(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: scsi_transport_iscsi2(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: scsi_transport_iscsi(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ipmi_si(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: dell_rbu(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: scsi_dh_rdac(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: dm_rdac(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: dm_queue_depth(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: dm_round_robin(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: xt_tcpudp(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ipt_SYSRQ(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: netconsole(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: iptable_filter(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ip_tables(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: vfat(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: fat(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: usb_storage(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: mptctl(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: exa_ioctls(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: nfs(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: lockd(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: nfs_acl(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: x_tables(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: sunrpc(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ipmi_devintf(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ipmi_msghandler(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: bonding1(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: bonding(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ipv6(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: xfrm_nalgo(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: crypto_api(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: dm_mirror(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: dm_log(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: dm_multipath(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: scsi_dh(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: dm_mod(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: video(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: hwmon(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: backlight(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: sbs(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: i2c_ec(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: i2c_core(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: button(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: battery(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: asus_acpi(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ac(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: sr_mod(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: cdrom(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: joydev(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: sg(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: bnx2(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: pcspkr(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ata_piix(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: libata(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: mptsas(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: mptscsih(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: mptbase(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: scsi_transport_sas(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: sd_mod(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: scsi_mod(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ext3(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: jbd(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: uhci_hcd(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ohci_hcd(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ehci_hcd(U) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: Pid: 23040, comm: kmpath_rdacd Tainted: G 2.6.18-164sys #1 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: RIP: 0010:[<ffffffff881b713f>] 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff881b713f>] :dm_multipath:pg_init_done+0x2f/0x1c0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: RSP: 0018:ffff810081e71d40 EFLAGS: 00010293 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: RAX: ffffffff881b7110 RBX: ffff81004c488780 RCX: ffff810080721c00 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: RDX: ffff81004c488780 RSI: 0000000000000000 RDI: ffff810071521ea0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: R10: 0000000000000000 R11: ffffffff80087ee0 R12: ffff810080721c00 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: R13: ffff810087a247d8 R14: 0000000100000000 R15: ffff810071521e80 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: FS: 0000000000000000(0000) GS:ffff810107f5da40(0000) knlGS:0000000000000000 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: CR2: 0000000100000010 CR3: 00000000592df000 CR4: 00000000000006e0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: Process kmpath_rdacd (pid: 23040, threadinfo ffff810081e70000, task ffff81008a0f5080) 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: Stack: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 0000000000000282 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ffff81004c488780 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ffff81006c14fc00 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ffff810080721c00 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ffff810087a247d8 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 0000000000000000 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ffff81006c14fc62 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ffffffff883bcdf7 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 0000000010008040 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ffff81008c7f65c0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ffff810081e71e70 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ffff810081e71dd0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: Call Trace: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff883bcdf7>] :scsi_dh_rdac:send_mode_select+0x477/0x4b0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff80032183>] __wake_up+0x43/0x70 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff883bc980>] :scsi_dh_rdac:send_mode_select+0x0/0x4b0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff800561a3>] run_workqueue+0xb3/0x110 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff80052180>] worker_thread+0x0/0x150 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff800b33b0>] keventd_create_kthread+0x0/0xa0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff80052291>] worker_thread+0x111/0x150 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff8009ca00>] default_wake_function+0x0/0x10 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff80052180>] worker_thread+0x0/0x150 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff800373c9>] kthread+0xd9/0x120 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff80068fb1>] child_rip+0xa/0x11 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff800b33b0>] keventd_create_kthread+0x0/0xa0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff800372f0>] kthread+0x0/0x120 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff80068fa7>] child_rip+0x0/0x11 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: Code: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 49 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 8b 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 5e 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 10 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 77 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 7b 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 89 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: f0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: ff 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 24 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: c5 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: a8 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 8c 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 1b 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 88 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: c7 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 44 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 24 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 04 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 00 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: RIP 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: [<ffffffff881b713f>] :dm_multipath:pg_init_done+0x2f/0x1c0 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: RSP <ffff810081e71d40> 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: CR2: 0000000100000010 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 EMERG: Kernel panic - not syncing: Fatal exception 2011 Jan 20 02:53:08 172.19.58.130 NOTICE: 2011 Jan 20 02:53:08 172.19.58.130 EMERG: Rebooting in 1 seconds.. Created attachment 477286 [details]
Add a wait for pg init completion in multipath destructor
Created attachment 477288 [details]
Flush workqueues on postsuspend as well as at the destructor
After running additional tests it seems that flushing at postsuspend is also required, otherwise we will still have some queues doing work when multipathn is freed.
Note:
Both patches are over RHEL5.6
The patch from comment#4 is missing 'init_waitqueue_head(&m->pg_init_wait);' in alloc_multipath(). Taking a step back, it seems we'd want a faithful backport of the following upstream commits: 6380f26 dm mpath: add mutex to synchronize adding and flushing work 67a46da dm mpath: prevent io from work queue while suspended c2f3d24 dm mpath: reject messages when device is suspended 83c0d5d dm mpath: pass struct pgpath to pg init done f7b934c dm mpath: skip activate_path for failed paths -- unrelated but worthwhile d0259bf dm mpath: hold io until all pg_inits completed 2bded7b dm mpath: wait for pg_init completion when suspending fb61264 dm mpath: refactor pg_init This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Patches that backport the following upstream commits are available here: http://people.redhat.com/msnitzer/patches//.rhel5.7/bz673058/ 6df400a v2.6.33-rc1 dm mpath: flush workqueues before suspend completes 6380f26 v2.6.33-rc1 dm mpath: add mutex to synchronize adding and flushing work 67a46da v2.6.33-rc1 dm mpath: prevent io from work queue while suspended 83c0d5d v2.6.34-rc1 dm mpath: pass struct pgpath to pg init done f7b934c v2.6.34-rc1 dm mpath: skip activate_path for failed paths d0259bf v2.6.34-rc1 dm mpath: hold io until all pg_inits completed 2bded7b v2.6.34-rc1 dm mpath: wait for pg_init completion when suspending 6bbf79a v2.6.36-rc1 dm mpath: fix NULL pointer dereference when path parameters missing Patch(es) available in kernel-2.6.18-250.el5 Detailed testing feedback is always welcomed. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: A race could occur when an internal multipath structure (pgpath) was freed before it was used to signal the path group initialization was complete (via pg_init_done). This update includes a number of fixes that address this issue. multipath is now increasingly robust when multipathd restarts are combined with I/O operations to multipath devices and storage failures. I have tried to running these commands at same time, but not able to hit the problem: ========= for X in `seq 1 100`;do multipath -F; service multipathd restart sleep 10; done ========= for Y in `seq 1 100`;do sudo ./include_4_python.sh FC_Switch_Link_Trigger_By_WWPN down "0x10000000c990be2b"; sudo ./include_4_python.sh FC_Switch_Link_Trigger_By_WWPN up "0x10000000c990be2b"; done ========= Enviroment: NetApp 3170 ALUA via lpfc. 50 LUNs over 8 path (disabled 1 HBA port during testing, only 4 path from 1 hba port in this testing). Does this bug need "drac" hardware handler? Code reviewed. Sanity Only. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html |