Red Hat Bugzilla – Bug 448942
[NetApp 5.3 bug] multipathd segfaults while stopping
Last modified: 2010-01-11 21:42:41 EST
Description of problem:
multipathd segfaults while stopping on RHEL5.2 GA. This is seen more frequently
with FCP setups than iSCSI.
Version: RHEL5.2 GA
# uname -a
Linux lnx199-115.lab.eng.btc.netapp.in 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
# rpm -qa | grep device
Steps to Reproduce:
1. Map about 5 LUNs to the host, with 4 paths each (5 x 4 = 20 paths).
2. Discover the LUNs and verify the maps are configured properly.
3. Stop the daemon -
# /etc/init.d/multipathd stop
multipathd segfaults while shutting down.
Below are the extracts from the /var/log/messages -
May 29 18:47:20 lnx199-115 multipathd: --------shut down-------
May 29 18:47:20 lnx199-115 kernel: multipathd: segfault at
000000000000001a rip 000000311d470fe0 rsp 00007fff280217c0 error 4
multipathd should stop gracefully.
My multipath.conf looks like below -
# cat /etc/multipath.conf
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio_callout "/sbin/mpath_prio_ontap /dev/%n"
features "1 queue_if_no_path"
This issue is independent of SELinux config. Seen with both Enforcing and Disabled.
I can't reproduce this. Can you try to hit this while running multipathd under gdb. To do this, you need the multipath debuginfo package.
# service multipathd start
# gdb multipathd <pid>
"in GBD" > continue
In another terminal
# service multipathd stop
If you can recreate this in GDB, can you get a backtrace of the segfaulting thread.
Can you try and get the information mentioned in comment #2 (a backtrace of the segfaulting thread)
I will capture the bt and upload them here.
Created attachment 317162 [details]
gdb backtraces while stoping the daemon
I followed the steps you mentioned and tried attaching the gdb. But I didn't see the daemon segfault with gdb attached. I have attached the logs which I captured when I did this.
Neither the /var/log/messages had the segfault message. I could only see messages till -
Sep 20 09:37:48 RHEL52-SANboot-110 multipathd: --------shut down-------
But I can consistently reproduce this bug, always on my box, if I don't attach gdb.
O.k. then, let's try a different method. Segfaults are supposed to generate core files, but most systems disable them by default.
to check, run
# ulimit -c
That zero means that core files are disabled. To enable the generation of any size core files, run
# ulimit -c unlimited
Now get multipathd to segfault. The core file will be located in "/". You can look at it with gdb by running
# gdb multipathd <core_file_name>
Then pull the backtraces from this. Once this is done, you probably want to disable core files again
# ulimit -c 0
I set the ulimit to unlimited and got the daemon to segfault from that terminal. But I do not see the core file in "/":
[root@RHEL52-SANboot-110 ~]# ulimit -c
[root@RHEL52-SANboot-110 ~]# /etc/init.d/multipathd status
multipathd is stopped
[root@RHEL52-SANboot-110 ~]# /etc/init.d/multipathd start
Starting multipathd daemon: [ OK ]
[root@RHEL52-SANboot-110 ~]# date
Fri Sep 26 11:20:05 IST 2008
[root@RHEL52-SANboot-110 ~]# /etc/init.d/multipathd stop
Stopping multipathd daemon: [ OK ]
I see the following messages in syslog:
Sep 26 11:20:24 RHEL52-SANboot-110 multipathd: mpath0: stop event checker thread
Sep 26 11:20:24 RHEL52-SANboot-110 multipathd: --------shut down-------
Sep 26 11:20:24 RHEL52-SANboot-110 kernel: multipathd: segfault at 000000000000001a rip 00000034da070fe0 rsp 00007fff7da6d220 error 4
But I do not see the core dump in "/":
[root@RHEL52-SANboot-110 ~]# date
Fri Sep 26 11:20:28 IST 2008
[root@RHEL52-SANboot-110 ~]# cd /
[root@RHEL52-SANboot-110 /]# ls -a
. bin etc lib64 misc opt sbin sys usr
.. boot home lost+found mnt proc selinux tftpboot var
.autofsck dev lib media net root srv tmp
To see if my settings are proper, I wrote a small C program which segfaults. This dumped the core. But, somehow I cannot see the multipathd dumping the core.
Is there anything else I should do to get the dump?
That's strange. This could be one of two issues.
1. multipathd just isn't making a core file. I'm not sure what to do in this case. To check, the easiest way is to startup multipathd, run df, shutdown multipathd, and check df again after the you see the segfault message. If the size jumped, then you are probably making a core file somewhere. If the size doesn't jump, you can try running multipath -d, which doesn't background it. I don't know why this would change things, but it's worth a shot. The other test you can do, if you don't see the df size jump when you segfault, is to start multipathd, run df, run
# killall -SEGV multipathd, and then run df again to see if the size jumped. If it did jump when you send a SIGSEGV to the process, but not when it crashes on shutdown, then I doubt that you are going to be able to get a core dump from this
2. If looking df shows a size jump, then a core file is being written somewhere. The next question is "where?" run
# cat /proc/sys/kernel/core_pattern
If this has a fully qualified path name, then your core files should be there.
# mkdir /tmp/corefiles
# chmod 777 /tmp/corefiles
# echo "/tmp/corefiles/core_%e_%p" > /proc/sys/kernel/core_pattern
This should cause all future core files to be created as
Let me know what you find out.
Please find my observations ---
>> 1. multipathd just isn't making a core file. I'm not sure what to do in this case. To check, the easiest way is to startup multipathd, run df, shutdown multipathd, and check df again after the you see the segfault message.
The size did not jump. I have set the ulimit -c unlimited.
>> If the size doesn't jump, you can try running multipath -d, which doesn't background it. I don't know why this would change things, but it's worth a shot.
When I tried without demonizing multipathd, it never crashed when I ran "/etc/init.d/multipathd stop". It always showed that it shut down properly, no segfault messages on the terminal or in syslog. No size jump.
>> The other test you can do, if you don't see the df size jump when you segfault, is to start multipathd, run df, run # killall -SEGV multipathd, and then run df again to see if the size jumped.
No messages in syslog. But the daemon looks like sefaulted (the status now shows that "multipathd dead but pid file exists"), but did not see the size jump.
>> If it did jump when you send a SIGSEGV to the process, but not when it crashes on shutdown, then I doubt that you are going to be able to get a core dump from this instance.
I did not see the size jump for either of the cases viz., SIGSEGV and "/etc/init.d/multipathd stop".
>> 2. If looking df shows a size jump, then a core file is being written
>> somewhere. The next question is "where?" run
>> # cat /proc/sys/kernel/core_pattern
The output in default case looks as below ---
# cat /proc/sys/kernel/core_pattern
I changed the pattern for core files ---
# mkdir /coredir
# chmod 777 /coredir
# echo "/coredir/core_%e_%p" > /proc/sys/kernel/core_pattern
# cat /proc/sys/kernel/core_pattern
When I started multipathd without demonizing it and then killed it with SEGV, I could see the segfault message on the terminal and could see the core dump in /coredir/ ---
# cd /coredir/
O.k. So with the coredir set up, the system produces a core file if you start the multipathd process without daemonizing it, and send it a SIGSEGV signal. Have you tried to see if the system produces a core file if you start multipathd in daemon mode, now that you've set up the coredir. I'm not sure why things would be different with the coredir, but perhaps it was a permissions thing. Speaking of which, if you have selinux set to enforcing mode, can you try turning it off and rechecking if core files are being created. I'm not sure if selinux interferes with the ability to write core files, but that doesn't seem completely crazy.
If none of that works, I suppose we'll have to do this the hard way. Can you run multipath -ll to show me what your multipath setup looks like. I can try harder to match your exact setup, to see if I can recreate this.
Here's a long shot, but its worth checking. You can download the latest device-mapper-multipath beta packages from
There were some changes to code the multipathd runs during shutdown, perhaps some of the changes will fix your problem.
Otherwise, the last resort is to start adding print statements to narrow down where things are going wrong. This is complicated by the fact that multipathd doesn't wait for it's messages to hit syslog. I can write a patch to change this and add a bunch of print statements to the shutdown code paths. Are you comfortable compiling the package from source. If so, I can just send you patches. If not, I can build a test package and post it.
Actually, there is a bug in the above packages, can you please try the packages at
Created attachment 320213 [details]
Configuration Data on RHEL 5.2 sanbooted machine.
> Have you tried to see if the system produces a core file if you start multipathd in daemon mode, now that you've set up the coredir.
Yes. I tried running the multipathd in daemon mode and sending the SIGSEGV, or stopping it using the script. I could not get the dump.
I tried several things with SELinux too. It doesn't matter, I could not get the dump in any case. (Enforced, Disabled, Permissive modes).
> Can you run multipath -ll to show me what your multipath setup looks like.
The attachment has the setup information which I used to reproduce this bug. I am using a SANbooted machine, with Emulex LP11002 cards. I have to restart the daemon several times and I see the daemon segfaulting say once in five times. Once I hit the issue, it's pretty easy to see the segfault frequently.
> You can download the latest device-mapper-multipath beta packages...
I took the rpms from the link given. But since the machine I used is a SANbooted machine (root on LUN), it did not allow me to stop the daemon (version 4.7-19). So, I tried to reproduce this bug with root on localdisks (first, with GA versions of all packages) but could not reproduce the bug with GA itself. So did not try the new rpms, as I anyway do not see the bug. Same will be the case with version 4.7-20 also.
> Otherwise, the last resort is to start adding print statements to narrow down where things are going wrong.
Unfortunately, we are here now. I am fine with the patches. I am comfortable with building packages.
The init script code to stop the daemon doesn't do anything special, it just kills the multipathd process. You can avoid the SANboot check in the initrd by just running
# killall multipathd
This should let you see if you can recreate the problem with the latest package. But I'll start working on a debugging patch for you, on the assumption that the new packages won't fix your problem.
This helped. I used "killall multipathd" to stop the daemon and I could not reproduce the bug with 4.7-19 rpms. I can still see the crash with older packages on the same setup frequently, which hints us that this bug could have got fixed in the latest version of the package.
I'm closing this bug. If you think that the latest patches might not have really fixed this, you can reopen it.