Bug 597789
Summary: | rhel: multipathd[2424]: segfault at 00000000000004a4 rip 000000000041eb98 rsp 00000000 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | yeylon <yeylon> | ||||||
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Gris Ge <fge> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.5 | CC: | abaron, agk, bmarzins, bmr, christophe.varoqui, coughlan, cpelland, cward, dwysocha, fge, heinzm, junichi.nomura, kueda, llim, lmb, mbroz, mshao, prajnoha, prockai, raud, rmusil, srevivo | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | Storage | ||||||||
Fixed In Version: | device-mapper-multipath-0.4.7-36.el5 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Adding and removing a path in quick succession could have caused it to be removed while the multipathd was still using it, which used to lead to a segmentation fault. This has been resolved, and multipathd no longer crashes when the path is added and quickly removed again.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2011-01-13 23:03:55 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 607911 | ||||||||
Attachments: |
|
Description
yeylon@redhat.com
2010-05-30 14:04:50 UTC
sorry I'll add the information here: Red Hat Enterprise Linux Server release 5.5 (Tikanga) - Linux blond-vdsd.qa.lab.tlv.redhat.com 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux [root@blond-vdsd ~]# rpm -qa | grep mapper device-mapper-event-1.02.39-1.el5_5.2 device-mapper-multipath-0.4.7-34.el5_5.1 device-mapper-1.02.39-1.el5_5.2 device-mapper-1.02.39-1.el5_5.2 i have a scenario that when trying to add second host to a data-center using rhevm after the initial attempt the host moves to non-operational (connection to the storage fail) after the second attempt the multipathd get segfault: multipathd[2424]: segfault at 00000000000004a4 rip 000000000041eb98 rsp 00000000 1. the data-center is iscsi 2. i gave 2 iscsi data domain 3. one NFS iso domain 4. one NFS export domain 5. enter first host to the domain 6. see that it gets the SPM role 7. add the second host 8. first it will fail 9. activate it once again 10. check dmesg Created attachment 418032 [details]
messages
Created attachment 418033 [details]
vds log
Similar issue, similar environment (/var/log/messages output) May 30 18:43:52 brown-vdsb kernel: device-mapper: table: 253:5: multipath: error getting device May 30 18:43:52 brown-vdsb kernel: device-mapper: ioctl: error adding target to table May 30 18:43:52 brown-vdsb kernel: device-mapper: table: 253:5: multipath: error getting device May 30 18:43:52 brown-vdsb kernel: device-mapper: ioctl: error adding target to table May 30 18:43:52 brown-vdsb kernel: scsi 6:0:0:0: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:64. May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:1: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:48. May 30 18:43:52 brown-vdsb kernel: scsi 7:0:0:9: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 65:80. May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:2: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:16. May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:3: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:80. May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:4: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:96. May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:5: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:112. May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:6: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:128. May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:7: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:144. May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:8: rejecting I/O to dead device May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:160. May 30 18:43:53 brown-vdsb kernel: scsi 5:0:0:10: rejecting I/O to dead device May 30 18:43:53 brown-vdsb kernel: device-mapper: multipath: Failing path 8:192. May 30 18:43:53 brown-vdsb multipathd: 14f504e46494c45006173726a4d6a2d6e6665462d4e474f32: failed to access path sdn May 30 18:43:53 brown-vdsb kernel: multipathd[1989]: segfault at 00000000000044b4 rip 000000000041ec9c rsp 0000000041dfee90 error 4 Adding the blocker flag - and once approved we'll probably want it in 5.5z - but first of all we'd be happy to understand the implications. I'm seeing it in a pure iscsi environment. Attached gdb to multipath and waited for it to crash. Got: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x4170d940 (LWP 8827)] 0x000000000041ec9c in select_pgfailback (mp=0x1422850) at propsel.c:71 71 if (mp->hwe && mp->hwe->pgfailback != FAILBACK_UNDEF) { (gdb) bt Cannot access memory at address 0x4170ced8 (gdb) p mp $1 = (struct multipath *) 0x1422850 (gdb) p mp->hwe Cannot access memory at address 0x14231f0 and just before that crash, I see in /var/log/messages: Jun 3 17:18:02 blond-vdsd multipathd: 14f504e46494c45003554695756332d71696e732d53545961: failed to access path sdh We can not reproduce this issue on rhev-hypervisor-5.5-2.2.(4.1), using rhevm(sm74) First I have hostA "up" in iscsi storage, then register HostB in this iscsi storage, the HostB will move to "up". Note: this is the "device-mapper-multipath-0.4.7-34.el5", not the "device-mapper-multipath-0.4.7-34.el5_5.1" [root@amd-1216-8-4 ~]# rpm -qa | grep mapper device-mapper-multipath-0.4.7-34.el5 device-mapper-1.02.39-1.el5_5.2 device-mapper-event-1.02.39-1.el5_5.2 I do follow steps to check: 1. There is only one iscsi data domain, and one NFS ISO domain (1) Create a iscsi data center (2) Create a cluster which is belonged to iscsi data center (3) Register a hostA(rhev-h) into this iscsi data center (4) Attache a iscsi data domain to this iscsi data center (5) Attache a NFS ISO domain to this iscsi data center (6) Then you can see the rhev-h has the SPM role (7) Register the hostB(second rhev-h) into this iscsi data center again. Then you can see hostB will moves to "up" status (8) Run "dmesg" command, can not find "multipathd[2424]: segfault" (9) Run "cat /var/log/messages" command, can not find "multipath: Failing path **" (10) Then remove the hostB from rhevm, and redo 3 times register. Each time, this hostB can move to "up" status in iscsi data center. (11) Re-install HostB, also do 3 times register, the HostB alos move to "up" 2. There is two iscsi data domains, one NFS ISO domain, and one NFS Export domain (1) Create a iscsi data center (2) Create a cluster which is belonged to iscsi data center (3) Register a rhev-h into this iscsi data center (4) Attache two iscsi data domains to this iscsi data center (5) Attache a NFS ISO domain to this iscsi data center (6) Attache a NFS Export domain to this iscsi data center (7) Then you can see the rhev-h has the SPM role (8) Register the hostB(second rhev-h) into this iscsi data center again. Then you can see hostB will moves to "up" status (9) Run "dmesg" command, can not find "multipathd[2424]: segfault" (10) Run "cat /var/log/messages" command, can not find "multipath: Failing path **" (11) Then remove the hostB from rhevm, and redo 3 times register. Each time, this hostB can move to "up" status in iscsi data center. (12) Re-install HostB, also do 3 times register, the HostB alos move to "up" Getting which I'm not sure are problematic: Program received signal SIGUSR1, User defined signal 1. [Switching to Thread 0x41eed940 (LWP 13991)] 0x00000036cfc0bc05 in pthread_sigmask (how=<value optimized out>, newmask=0x41eece00, oldmask=<value optimized out>) at ../nptl/sysdeps/pthread/pthread_sigmask.c:49 49 int result = INTERNAL_SYSCALL (rt_sigprocmask, err, 4, how, newmask, (gdb) bt #0 0x00000036cfc0bc05 in pthread_sigmask (how=<value optimized out>, newmask=0x41eece00, oldmask=<value optimized out>) at ../nptl/sysdeps/pthread/pthread_sigmask.c:49 #1 0x000000000042f6b0 in unblock_signals () at waiter.c:85 #2 0x000000000042f915 in waiteventloop (waiter=0x1764a010) at waiter.c:126 #3 0x000000000042fc3e in waitevent (et=0x1764a010) at waiter.c:193 #4 0x00000036cfc0673d in start_thread (arg=<value optimized out>) at pthread_create.c:301 #5 0x00000036cf0d3d1d in clone () from /lib64/libc.so.6 (gdb) c Continuing. [New Thread 0x41efd940 (LWP 1394)] Program received signal SIGUSR1, User defined signal 1. [Switching to Thread 0x41d6b940 (LWP 16451)] 0x00000036cfc0bc05 in pthread_sigmask (how=<value optimized out>, newmask=0x41d6ae00, oldmask=<value optimized out>) at ../nptl/sysdeps/pthread/pthread_sigmask.c:49 49 int result = INTERNAL_SYSCALL (rt_sigprocmask, err, 4, how, newmask, (gdb) c Comment #9 looks perfectly normal if a multipath device is being removed. It's stopping the waiter thread for the device. I have an idea of what callpath might be going on when this happens, but no idea why it is happening. If you could get a core dump, that would be ideal. Otherwise, I'll work up a debug package that will hopefully help me figure out why things are going wrong. Also, I see in comment #8 that this didn't recreate with device-mapper-multipath-0.4.7-34.el5 and according to earlier comments, it did with device-mapper-multipath-0.4.7-34.el5_5.1 If you could test, and see if changing this package, changes whether or not the error occurs, that would be really helpful. I have a test package that will provide some debugging information to help track down this issue. You can get it at http://porkchop.devel.redhat.com/brewroot/scratch/bmarzins/task_2519472/ This package also has the fix for 584742. If you could install this before you try to reproduce this again, it will print out aditional information to /var/log/messages. The only annoyance with this package is that it has a 1 second sleep that gets called when maps are created or reloaded. This is to make sure that the debug messages get flushed out of the log daemon buffer to syslog before you hit the segfault. Looking at blond for 505669, I noticed that this had happened as well, and finally figured out what is causing this, although I still need to figure out exactly how to fix it. But I don't need anymore debugging information. What is happening is that the path that is getting added gets removed in verify paths, and then multipathd is still using that path. O.k. I should have a fix for this. Can you try running the device-mapper-multipath-0.4.7-34.el5.2_597789_2 packages You can download them at http://porkchop.devel.redhat.com/brewroot/scratch/bmarzins/task_2533442/ These packages also include the fix for 584742. If you could install them on your machines, and verify that you can no longer reproduce these two issues, that would be great. I've build an official package with this fix. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Adding and removing a path in quick succession could have caused it to be removed while the multipathd was still using it, which used to lead to a segmentation fault. This has been resolved, and multipathd no longer crashes when the path is added and quickly removed again. Question for Yaniv Kaul: Can you please confirm that the steps described in comment#8 should lead to reproduce the issue? (In reply to comment #17) > Question for Yaniv Kaul: Can you please confirm that the steps described in > comment#8 should lead to reproduce the issue? I prefer the steps in comment 1. Yaniv Eylon: can you please confirm that you are or that you are not able to confirm that this patch fixed the problem you have reported as I'm unable to do it myself. Thanks. Code review: proper patch is present, sane and the package compile with it. (In reply to comment #20) > Yaniv Eylon: can you please confirm that you are or that you are not able to > confirm that this patch fixed the problem you have reported as I'm unable to do > it myself. > > Thanks. I'm not able to reproduce this issue for the last few weeks. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0074.html |