Bug 597789

Summary: rhel: multipathd[2424]: segfault at 00000000000004a4 rip 000000000041eb98 rsp 00000000
Product: Red Hat Enterprise Linux 5 Reporter: yeylon <yeylon>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: Gris Ge <fge>
Severity: high Docs Contact:
Priority: high    
Version: 5.5CC: abaron, agk, bmarzins, bmr, christophe.varoqui, coughlan, cpelland, cward, dwysocha, fge, heinzm, junichi.nomura, kueda, llim, lmb, mbroz, mshao, prajnoha, prockai, raud, rmusil, srevivo
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: Storage
Fixed In Version: device-mapper-multipath-0.4.7-36.el5 Doc Type: Bug Fix
Doc Text:
Adding and removing a path in quick succession could have caused it to be removed while the multipathd was still using it, which used to lead to a segmentation fault. This has been resolved, and multipathd no longer crashes when the path is added and quickly removed again.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 23:03:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 607911    
Attachments:
Description Flags
messages
none
vds log none

Description yeylon@redhat.com 2010-05-30 14:04:50 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

SCSI device sda: 156250000 512-byte hdwr sectors (80000 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
device-mapper: table: 253:14: multipath: error getting device
device-mapper: ioctl: error adding target to table
device-mapper: table: 253:14: multipath: error getting device
device-mapper: ioctl: error adding target to table
scsi 5:0:0:1: rejecting I/O to dead device
device-mapper: multipath: Failing path 8:128.
scsi 5:0:0:2: rejecting I/O to dead device
device-mapper: multipath: Failing path 8:144.
scsi 5:0:0:3: rejecting I/O to dead device
device-mapper: multipath: Failing path 8:160.
scsi 5:0:0:4: rejecting I/O to dead device
device-mapper: multipath: Failing path 8:176.
scsi 6:0:0:6: rejecting I/O to dead device
device-mapper: multipath: Failing path 8:208.
scsi 5:0:0:5: rejecting I/O to dead device
device-mapper: multipath: Failing path 8:192.
multipathd[2424]: segfault at 00000000000004a4 rip 000000000041eb98 rsp 00000000

Comment 1 yeylon@redhat.com 2010-05-30 14:11:14 UTC
sorry I'll add the information here:

Red Hat Enterprise Linux Server release 5.5 (Tikanga) - Linux blond-vdsd.qa.lab.tlv.redhat.com 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

[root@blond-vdsd ~]# rpm -qa | grep mapper
device-mapper-event-1.02.39-1.el5_5.2
device-mapper-multipath-0.4.7-34.el5_5.1
device-mapper-1.02.39-1.el5_5.2
device-mapper-1.02.39-1.el5_5.2

i have a scenario that when trying to add second host to a data-center using rhevm after the initial attempt the host moves to non-operational (connection to the storage fail)

after the second attempt the multipathd get segfault:

multipathd[2424]: segfault at 00000000000004a4 rip 000000000041eb98 rsp
00000000

1. the data-center is iscsi
2. i gave 2 iscsi data domain
3. one NFS iso domain
4. one NFS export domain

5. enter first host to the domain
6. see that it gets the SPM role
7. add the second host
8. first it will fail
9. activate it once again
10. check dmesg

Comment 2 yeylon@redhat.com 2010-05-30 14:11:49 UTC
Created attachment 418032 [details]
messages

Comment 3 yeylon@redhat.com 2010-05-30 14:13:15 UTC
Created attachment 418033 [details]
vds log

Comment 4 Yaniv Kaul 2010-05-30 16:08:24 UTC
Similar issue, similar environment (/var/log/messages output)
May 30 18:43:52 brown-vdsb kernel: device-mapper: table: 253:5: multipath: error getting device
May 30 18:43:52 brown-vdsb kernel: device-mapper: ioctl: error adding target to table
May 30 18:43:52 brown-vdsb kernel: device-mapper: table: 253:5: multipath: error getting device
May 30 18:43:52 brown-vdsb kernel: device-mapper: ioctl: error adding target to table
May 30 18:43:52 brown-vdsb kernel: scsi 6:0:0:0: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:64.
May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:1: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:48.
May 30 18:43:52 brown-vdsb kernel: scsi 7:0:0:9: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 65:80.
May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:2: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:16.
May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:3: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:80.
May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:4: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:96.
May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:5: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:112.
May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:6: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:128.
May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:7: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:144.
May 30 18:43:52 brown-vdsb kernel: scsi 5:0:0:8: rejecting I/O to dead device
May 30 18:43:52 brown-vdsb kernel: device-mapper: multipath: Failing path 8:160.
May 30 18:43:53 brown-vdsb kernel: scsi 5:0:0:10: rejecting I/O to dead device
May 30 18:43:53 brown-vdsb kernel: device-mapper: multipath: Failing path 8:192.
May 30 18:43:53 brown-vdsb multipathd: 14f504e46494c45006173726a4d6a2d6e6665462d4e474f32: failed to access path sdn
May 30 18:43:53 brown-vdsb kernel: multipathd[1989]: segfault at 00000000000044b4 rip 000000000041ec9c rsp 0000000041dfee90 error 4

Comment 5 Yaniv Kaul 2010-05-31 11:55:16 UTC
Adding the blocker flag - and once approved we'll probably want it in 5.5z - but first of all we'd be happy to understand the implications.

I'm seeing it in a pure iscsi environment.

Comment 6 Yaniv Kaul 2010-06-03 14:37:00 UTC
Attached gdb to multipath and waited for it to crash. Got:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x4170d940 (LWP 8827)]
0x000000000041ec9c in select_pgfailback (mp=0x1422850) at propsel.c:71
71              if (mp->hwe && mp->hwe->pgfailback != FAILBACK_UNDEF) {
(gdb) bt
Cannot access memory at address 0x4170ced8
(gdb) p mp
$1 = (struct multipath *) 0x1422850
(gdb) p mp->hwe
Cannot access memory at address 0x14231f0

Comment 7 Yaniv Kaul 2010-06-03 14:52:20 UTC
and just before that crash, I see in /var/log/messages:
Jun  3 17:18:02 blond-vdsd multipathd: 14f504e46494c45003554695756332d71696e732d53545961: failed to access path sdh

Comment 8 XinSun 2010-06-11 11:45:40 UTC
We can not reproduce this issue on rhev-hypervisor-5.5-2.2.(4.1), using
rhevm(sm74)
First I have hostA "up" in iscsi storage, then register HostB in this iscsi
storage, the HostB will move to "up". 
Note: this is the "device-mapper-multipath-0.4.7-34.el5", not the
"device-mapper-multipath-0.4.7-34.el5_5.1"

[root@amd-1216-8-4 ~]# rpm -qa | grep mapper
device-mapper-multipath-0.4.7-34.el5
device-mapper-1.02.39-1.el5_5.2
device-mapper-event-1.02.39-1.el5_5.2

I do follow steps to check:

1. There is only one iscsi data domain, and one NFS ISO domain
(1) Create a iscsi data center
(2) Create a cluster which is belonged to iscsi data center
(3) Register a hostA(rhev-h) into this iscsi data center
(4) Attache a iscsi data domain to this iscsi data center
(5) Attache a NFS ISO domain to this iscsi data center
(6) Then you can see the rhev-h has the SPM role
(7) Register the hostB(second rhev-h) into this iscsi data center again. Then
you can see hostB will moves to "up" status
(8) Run "dmesg" command, can not find "multipathd[2424]: segfault"
(9) Run "cat /var/log/messages" command, can not find  "multipath: Failing path
**"
(10) Then remove the hostB from rhevm, and redo 3 times register. Each time,
this hostB can move to "up" status in iscsi data center.
(11) Re-install HostB, also do 3 times register, the HostB alos move to "up"


2. There is two iscsi data domains, one NFS ISO domain, and one NFS Export
domain

(1) Create a iscsi data center
(2) Create a cluster which is belonged to iscsi data center
(3) Register a rhev-h into this iscsi data center
(4) Attache two iscsi data domains to this iscsi data center
(5) Attache a NFS ISO domain to this iscsi data center
(6) Attache a NFS Export domain to this iscsi data center
(7) Then you can see the rhev-h has the SPM role
(8) Register the hostB(second rhev-h) into this iscsi data center again. Then
you can see hostB will moves to "up" status
(9) Run "dmesg" command, can not find "multipathd[2424]: segfault"
(10) Run "cat /var/log/messages" command, can not find  "multipath: Failing
path **"
(11) Then remove the hostB from rhevm, and redo 3 times register. Each time,
this hostB can move to "up" status in iscsi data center.
(12) Re-install HostB, also do 3 times register, the HostB alos move to "up"

Comment 9 Yaniv Kaul 2010-06-13 09:51:06 UTC
Getting which I'm not sure are problematic:
Program received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x41eed940 (LWP 13991)]
0x00000036cfc0bc05 in pthread_sigmask (how=<value optimized out>,
    newmask=0x41eece00, oldmask=<value optimized out>)
    at ../nptl/sysdeps/pthread/pthread_sigmask.c:49
49        int result = INTERNAL_SYSCALL (rt_sigprocmask, err, 4, how, newmask,
(gdb) bt
#0  0x00000036cfc0bc05 in pthread_sigmask (how=<value optimized out>,
    newmask=0x41eece00, oldmask=<value optimized out>)
    at ../nptl/sysdeps/pthread/pthread_sigmask.c:49
#1  0x000000000042f6b0 in unblock_signals () at waiter.c:85
#2  0x000000000042f915 in waiteventloop (waiter=0x1764a010) at waiter.c:126
#3  0x000000000042fc3e in waitevent (et=0x1764a010) at waiter.c:193
#4  0x00000036cfc0673d in start_thread (arg=<value optimized out>)
    at pthread_create.c:301
#5  0x00000036cf0d3d1d in clone () from /lib64/libc.so.6
(gdb) c
Continuing.
[New Thread 0x41efd940 (LWP 1394)]

Program received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x41d6b940 (LWP 16451)]
0x00000036cfc0bc05 in pthread_sigmask (how=<value optimized out>,
    newmask=0x41d6ae00, oldmask=<value optimized out>)
    at ../nptl/sysdeps/pthread/pthread_sigmask.c:49
49        int result = INTERNAL_SYSCALL (rt_sigprocmask, err, 4, how, newmask,
(gdb) c

Comment 10 Ben Marzinski 2010-06-14 20:52:57 UTC
Comment #9 looks perfectly normal if a multipath device is being removed.  It's stopping the waiter thread for the device.

I have an idea of what callpath might be going on when this happens, but no idea why it is happening.  If you could get a core dump, that would be ideal.
Otherwise, I'll work up a debug package that will hopefully help me figure out why things are going wrong.

Also, I see in comment #8 that this didn't recreate with

device-mapper-multipath-0.4.7-34.el5

and according to earlier comments, it did with

device-mapper-multipath-0.4.7-34.el5_5.1

If you could test, and see if changing this package, changes whether or not the error occurs, that would be really helpful.

Comment 11 Ben Marzinski 2010-06-16 03:44:20 UTC
I have a test package that will provide some debugging information to help track down this issue.  You can get it at

http://porkchop.devel.redhat.com/brewroot/scratch/bmarzins/task_2519472/

This package also has the fix for 584742.  If you could install this before you try to reproduce this again, it will print out aditional information to /var/log/messages.

The only annoyance with this package is that it has a 1 second sleep that gets called when maps are created or reloaded.  This is to make sure that the debug messages get flushed out of the log daemon buffer to syslog before you hit the segfault.

Comment 12 Ben Marzinski 2010-06-18 00:50:55 UTC
Looking at blond for 505669, I noticed that this had happened as well, and finally figured out what is causing this, although I still need to figure out exactly how to fix it.  But I don't need anymore debugging information.

What is happening is that the path that is getting added gets removed in verify paths, and then multipathd is still using that path.

Comment 13 Ben Marzinski 2010-06-19 02:09:34 UTC
O.k. I should have a fix for this.  Can you try running the 
device-mapper-multipath-0.4.7-34.el5.2_597789_2 packages

You can download them at
http://porkchop.devel.redhat.com/brewroot/scratch/bmarzins/task_2533442/

These packages also include the fix for 584742. If you could install them on your machines, and verify that you can no longer reproduce these two issues, that would be great.

Comment 14 Ben Marzinski 2010-06-23 19:23:42 UTC
I've build an official package with this fix.

Comment 16 Jaromir Hradilek 2010-06-29 08:56:18 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
Adding and removing a path in quick succession could have caused it to be removed while the multipathd was still using it, which used to lead to a segmentation fault. This has been resolved, and multipathd no longer crashes when the path is added and quickly removed again.

Comment 17 michal novacek 2010-07-08 10:26:45 UTC
Question for Yaniv Kaul: Can you please confirm that the steps described in comment#8 should lead to reproduce the issue?

Comment 18 Yaniv Kaul 2010-07-09 11:33:20 UTC
(In reply to comment #17)
> Question for Yaniv Kaul: Can you please confirm that the steps described in
> comment#8 should lead to reproduce the issue?    

I prefer the steps in comment 1.

Comment 20 michal novacek 2010-07-12 15:10:20 UTC
Yaniv Eylon: can you please confirm that you are or that you are not able to confirm that this patch fixed the problem you have reported as I'm unable to do it myself.

Thanks.

Comment 21 michal novacek 2010-07-12 15:19:17 UTC
Code review: proper patch is present, sane and the package compile with it.

Comment 22 yeylon@redhat.com 2010-07-12 15:27:16 UTC
(In reply to comment #20)
> Yaniv Eylon: can you please confirm that you are or that you are not able to
> confirm that this patch fixed the problem you have reported as I'm unable to do
> it myself.
> 
> Thanks.    

I'm not able to reproduce this issue for the last few weeks.

Comment 25 errata-xmlrpc 2011-01-13 23:03:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0074.html