Red Hat Bugzilla – Bug 680140
emc_clariion error handler panics with multiple failures
Last modified: 2011-05-19 08:05:36 EDT
Description of problem: While testing an EMC Clariion configured with device mapper multipath using SCSI-3 reservations to lock the storage (setting the hardware handler to "3 emc 0 1" to honor reservations) if multiple paths fail at the same time (powering off FC switch) then frequently a panic occurs (not 100% of the time). The stack trace follows: Modules linked in: ext3 jbd tg3 autofs4 sunrpc p4_clockmod freq_table speedstep_lib ipv6 dm_mirror dm_region_hash dm_log dm_r] Modules linked in: ext3 jbd tg3 autofs4 sunrpc p4_clockmod freq_table speedstep_lib ipv6 dm_mirror dm_region_hash dm_log dm_r]Pid: 2511, comm: kmpath_handlerd Not tainted 2.6.32-71.el6.x86_64 #1 eserver xSeries 346 -[8840A3Z]- RIP: 0010:[<ffffffff8125f9d9>] [<ffffffff8125f9d9>] strnlen+0x9/0x40 RSP: 0018:ffff880232de3b50 EFLAGS: 00010086 RAX: ffffffff8165722e RBX: ffffffff81ad9f80 RCX: 0000000000000002 RDX: 779425f86d9de564 RSI: ffffffffffffffff RDI: 779425f86d9de564 RBP: ffff880232de3b50 R08: 0000000000000073 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81ad9ba5 R13: 779425f86d9de564 R14: 00000000ffffffff R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880028300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000037164d41c0 CR3: 0000000226252000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kmpath_handlerd (pid: 2511, threadinfo ffff880232de2000, task ffff880234da0040) Stack: ffff880232de3b90 ffffffff81261820 0000000000000001 ffffffff81ad9ba5 <0> ffffffffa02c977a ffffffffa02c9778 ffff880232de3d00 ffffffff81ad9f80 <0> ffff880232de3c30 ffffffff812628e0 0000000000000004 0000000affffffff Call Trace: [<ffffffff81261820>] string+0x40/0x100 [<ffffffff812628e0>] vsnprintf+0x310/0x5f0 [<ffffffff81262cd1>] vscnprintf+0x11/0x30 [<ffffffff8106c463>] vprintk+0xc3/0x4f0 [<ffffffff814c7c23>] ? printk+0x41/0x46 [<ffffffffa0265fb0>] ? pg_init_done+0x0/0x220 [dm_multipath] [<ffffffff814c7c23>] printk+0x41/0x46 [<ffffffffa02c8c7e>] clariion_activate+0xde/0x1b0 [scsi_dh_emc] [<ffffffffa0265fb0>] ? pg_init_done+0x0/0x220 [dm_multipath] [<ffffffff8135277b>] scsi_dh_activate+0x9b/0xe0 [<ffffffffa0265f80>] ? activate_path+0x0/0x30 [dm_multipath] [<ffffffffa0265fae>] activate_path+0x2e/0x30 [dm_multipath] [<ffffffff8108c610>] worker_thread+0x170/0x2a0 [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8108c4a0>] ? worker_thread+0x0/0x2a0 [<ffffffff81091936>] kthread+0x96/0xa0 [<ffffffff810141ca>] child_rip+0xa/0x20 [<ffffffff810918a0>] ? kthread+0x0/0xa0 [<ffffffff810141c0>] ? child_rip+0x0/0x20 Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 RIP [<ffffffff8125f9d9>] strnlen+0x9/0x40 RSP <ffff880232de3b50> do_IRQ: 0.169 No irq handler for vector (irq -1) do_IRQ: 0.73 No irq handler for vector (irq -1) \uffffirq 16: nobody cared (try booting with the "irqpoll" option) handlers: [<ffffffff81383700>] (usb_hcd_irq+0x0/0x90) Disabling IRQ #16 lpfc 0000:04:04.0: 1:1303 Link Up Event x1 received Data: x1 xf7 x8 x0 x0 x0 0 [drm:radeon_legacy_get_ext_tmds_info_from_combios] *ERROR* Unsupported gpio 0 [drm:radeon_combios_external_tmds_setup] *ERROR* Unknown id 1 [drm:radeon_combios_external_tmds_setup] *ERROR* Unknown id 3 [drm:radeon_combios_external_tmds_setup] *ERROR* Unknown id 7 lpfc 0000:04:04.0: 1:(0):2754 PRLI failure DID:011600 Status:x9/x90000 Kernel panic - not syncing: Waiting TID List traversal panic occurred, switching back to text console The panic appears to occur in scsi_dh_emc.c in clariion_activate() where sdev_printk() is called after a failure. It seems that the TRESPASS command is being attempted on a path that is in the process of being failed. The device seems to be cleared before it can gather the information to log. It seems that in the failed cases it should just call the fh() and return. I build a module that does that and I could not then reproduce the failure (I was able to successfully perform the same test multiple times without a failure). I found in the bugzilla database 467709 that also deals with TRESPASS issues. I tried the updated device-mapper-multipath referenced in that bugzilla (0.4.9-37.el6.bz467709). With this version the problem did not seem to duplicate as easily but I was still able to duplicate the panic after several attempts. Version-Release number of selected component (if applicable): I am using stock RHEL 6.0 kernel/modules initially to duplicate the problem and duplicated the failure with 0.4.9-37.el6.bz467709 version of device-mapper-multipath and kpartx. In my testing I have the LUN locked with SCSI-3 persistent reservations. I am not sure if this is required to duplicate. I will do some testing without the reservations to see if the problem continues to duplicate. How reproducible: With stock RHEL 6.0 packages the panic occurs probably 90% of the time when there are multiple failures at the same time where an active path is one of them. Steps to Reproduce: 1. start activity to the Clariion array using device-mapper-multipath 2. power off fibre switch where the switch has one connection to the server and two connections to the Clariion array. Note that there is a second fibre channel switch with similar connections for from device-mapper-multipath's perspective there are 4 paths to the storage. Also note that the fibre switch that is powered off needs to be the one where the IO path is active. 3. when device-mapper-multipath detects the failed path it will try to activate the next path. Actual results: System panics when the path it tries to activate also fails. Expected results: The next path should be activated and IO's rerouted. Additional info: With the stock RHEL 6.0 bits I used the following multipath.conf setting for the Clariion: device { vendor "DGC" product ".*" product_blacklist "LUNZ" path_grouping_policy group_by_prio getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n" path_selector "round-robin 0" path_checker emc_clariion features "1 queue_if_no_path" #hardware_handler "0" hardware_handler "3 emc 0 1" prio emc #failback immediate failback 10 rr_weight uniform #no_path_retry 60 no_path_retry 12 rr_min_io 1000 } With the 0.4.9-37.el6.bz467709 version I used the following multipath.conf: device { vendor "DGC" product ".*" product_blacklist "LUNZ" path_grouping_policy group_by_prio getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n" path_selector "round-robin 0" path_checker emc_clariion features "1 queue_if_no_path" #hardware_handler "0" hardware_handler "3 emc 0 1" prio emc #failback immediate #failback 10 failback followover rr_weight uniform #no_path_retry 60 no_path_retry 12 rr_min_io 1000 }
Created attachment 480763 [details] serial port output captured from panic after FC switch powered off I duplicated the scenario where I powered a FC switch off and the system paniced. Note that the panic was different in that I did not get a stack trace. I have the log level set to 7 (echo 7 > /proc/sys/kernel/printk). I have the kernel configured to output to the serial console: title Red Hat Enterprise Linux (2.6.32-71.el6.x86_64) with serial console root (hd0,0) kernel /vmlinuz-2.6.32-71.el6.x86_64 ro root=/dev/mapper/vg_fiji-lv_root rd_LVM_LV=vg_fiji/lv_root rd_LVM_LV=vg_fiji/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=auto rhgb quiet console=ttyS0,9600n8 console=tty0 initrd /initramfs-2.6.32-71.el6.x86_64.img
I turned reservations off and was able to duplicate the same panic in clariion_activate. I will put my patched scsi_dh_emc.ko and verify that with the new device-mapper-multipath this still resolves the panic.
(In reply to comment #3) > I turned reservations off and was able to duplicate the same panic in > clariion_activate. I will put my patched scsi_dh_emc.ko and verify that with > the new device-mapper-multipath this still resolves the panic. OK, thanks for all your work on this. Please attach your fix to this bz once you've verified all works as expected.
Created attachment 480855 [details] patch to scsi_dh_emc.c to avoid panic in clariion_activate() So for with the attached patch to scsi_dh_emc.c I have not had the panic. I am instead seeing a core on multipathd on occasion though. I have not looked into the core yet. If the daemon is restarted then everything continues OK. The patch is to simply avoid printing the message in clariion_activate when there is a failure. This is perhaps fixing the symptom rather than the problem. The issue here is that it looks like the code it touching the device after it has been freed. Should there be a lock held to avoid this so rather than changing the message the right fix is to lock the device so it does not disappear while we are working on it?
Created attachment 481021 [details] gdb output from multipathd coredump during double-path failure testing either by physically powering off a fibre channel switch or simulating it the multipathd will sometimes coredump. The failure is a segmentation fault in sysfs_get_timeout() in libmultipath/discovery.c Should I write a separate bugzilla for this?
(In reply to comment #6) > Created attachment 481021 [details] > gdb output from multipathd coredump > > during double-path failure testing either by physically powering off a fibre > channel switch or simulating it the multipathd will sometimes coredump. The > failure is a segmentation fault in sysfs_get_timeout() in > libmultipath/discovery.c > > Should I write a separate bugzilla for this? I created bug 680480, thanks.
Has there been any work on this issue at Red Hat? Is the patch I provided accepted as the fix or is there a better fix that Red Hat will implement? Thanks
I posted a slightly revised patch (based on comment#5) to linux-scsi: http://marc.info/?l=linux-scsi&m=130073744426591&w=2 It'll hopefully get included in 2.6.39. This patch should be posted for inclusion in RHEL6.1 by this Thursday (making it eligible for RHEL6.1 snapshot2).
(In reply to comment #9) > I posted a slightly revised patch (based on comment#5) to linux-scsi: > http://marc.info/?l=linux-scsi&m=130073744426591&w=2 > > It'll hopefully get included in 2.6.39. This patch should be posted for > inclusion in RHEL6.1 by this Thursday (making it eligible for RHEL6.1 > snapshot2). The above referenced patch has proven unnecessary. But the following refcounting fix is needed (discovered while chasing this issue): http://marc.info/?l=linux-scsi&m=130074943509100&w=2
Patch(es) available on kernel-2.6.32-128.el6
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Deleting an SCSI (Small Computer System Interface) device attached to a device handler caused applications running in user space, which were performing I/O operations on that device, to become unresponsive. This was due to the fact that the SCSI device handler's activation did not propagate the SCSI device deletion via an error code and a callback to the Device-Mapper Multipath. With this update, deletion of an SCSI device attached to a device handler is properly handled and no longer causes certain applications to become unresponsive.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html