Bug 448942

Summary: [NetApp 5.3 bug] multipathd segfaults while stopping
Product: Red Hat Enterprise Linux 5 Reporter: Rajashekhar M A <rajashekhar.a>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED WORKSFORME QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2CC: agk, andriusb, bmarzins, bmr, christophe.varoqui, clasohm, cmarthal, coughlan, cward, dwysocha, edamato, egoggin, heinzm, junichi.nomura, kueda, lmb, marting, mbroz, prockai, rajashekhar.a, rsarraf, tranlan, vijayakumar, xdl-redhat-bugzilla
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-10-21 20:14:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 373081    
Attachments:
Description Flags
gdb backtraces while stoping the daemon
none
Configuration Data on RHEL 5.2 sanbooted machine. none

Description Rajashekhar M A 2008-05-29 16:03:35 UTC
Description of problem:

multipathd segfaults while stopping on RHEL5.2 GA. This is seen more frequently
with FCP setups than iSCSI.

Version:	RHEL5.2 GA
# uname -a
Linux lnx199-115.lab.eng.btc.netapp.in 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa | grep device
device-mapper-1.02.24-1.el5
device-mapper-multipath-0.4.7-17.el5
device-mapper-event-1.02.24-1.el5
device-mapper-1.02.24-1.el5

How reproducible:

Intermittant


Steps to Reproduce:

1. Map about 5 LUNs to the host, with 4 paths each (5 x 4 = 20 paths).
2. Discover the LUNs and verify the maps are configured properly.
3. Stop the daemon -
	# /etc/init.d/multipathd stop

Actual results:
multipathd segfaults while shutting down.
Below are the extracts from the /var/log/messages -

May 29 18:47:20 lnx199-115 multipathd: --------shut down-------
May 29 18:47:20 lnx199-115 kernel: multipathd[22568]: segfault at
000000000000001a rip 000000311d470fe0 rsp 00007fff280217c0 error 4

Expected results:
multipathd should stop gracefully.

My multipath.conf looks like below -

# cat /etc/multipath.conf
defaults {
        user_friendly_names yes
        max_fds 4096
}

blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^(hd|xvd)[a-z]*"
        devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
        wwid    1HITACHI_HUS103073FL3800_V3WEJSPA
}

devices {
        device {
                vendor                        "NETAPP"
                product                       "LUN"
                getuid_callout                "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout                  "/sbin/mpath_prio_ontap /dev/%n"
                features                      "1 queue_if_no_path"
                hardware_handler              "0"
                path_grouping_policy          group_by_prio
                failback                      immediate
                rr_weight                     uniform
                rr_min_io                     128
                path_checker                  directio
        }
}


Additional info:

This issue is independent of SELinux config. Seen with both Enforcing and Disabled.

Comment 2 Ben Marzinski 2008-08-26 04:07:29 UTC
I can't reproduce this. Can you try to hit this while running multipathd under gdb.  To do this, you need the multipath debuginfo package.

# service multipathd start
# gdb multipathd <pid>
"in GBD" > continue

In another terminal
# service multipathd stop

If you can recreate this in GDB, can you get a backtrace of the segfaulting thread.
> bt

Comment 3 Ben Marzinski 2008-08-28 19:29:55 UTC
Can you try and get the information mentioned in comment #2 (a backtrace of the segfaulting thread)

Comment 4 Rajashekhar M A 2008-09-05 09:42:32 UTC
I will capture the bt and upload them here.

Comment 5 Rajashekhar M A 2008-09-19 07:55:14 UTC
Created attachment 317162 [details]
gdb backtraces while stoping the daemon

Ben,

I followed the steps you mentioned and tried attaching the gdb. But I didn't see the daemon segfault with gdb attached. I have attached the logs which I captured when I did this.

Neither the /var/log/messages had the segfault message. I could only see messages till -

Sep 20 09:37:48 RHEL52-SANboot-110 multipathd: --------shut down-------

But I can consistently reproduce this bug, always on my box, if I don't attach gdb.

Comment 6 Ben Marzinski 2008-09-22 19:10:13 UTC
O.k. then, let's try a different method. Segfaults are supposed to generate core files, but most systems disable them by default.

to check, run

# ulimit -c
0

That zero means that core files are disabled. To enable the generation of any size core files, run

# ulimit -c unlimited

Now get multipathd to segfault.  The core file will be located in "/".  You can look at it with gdb by running

# gdb multipathd <core_file_name>

Then pull the backtraces from this. Once this is done, you probably want to disable core files again

# ulimit -c 0

Thanks.

Comment 7 Rajashekhar M A 2008-09-26 05:37:29 UTC
I set the ulimit to unlimited and got the daemon to segfault from that terminal. But I do not see the core file in "/":

[root@RHEL52-SANboot-110 ~]# ulimit -c
unlimited
[root@RHEL52-SANboot-110 ~]# /etc/init.d/multipathd status
multipathd is stopped
[root@RHEL52-SANboot-110 ~]# /etc/init.d/multipathd start
Starting multipathd daemon:                                [  OK  ]
[root@RHEL52-SANboot-110 ~]# date
Fri Sep 26 11:20:05 IST 2008
[root@RHEL52-SANboot-110 ~]#
[root@RHEL52-SANboot-110 ~]# /etc/init.d/multipathd stop
Stopping multipathd daemon:                                [  OK  ]
[root@RHEL52-SANboot-110 ~]#

I see the following messages in syslog:

Sep 26 11:20:24 RHEL52-SANboot-110 multipathd: mpath0: stop event checker thread
Sep 26 11:20:24 RHEL52-SANboot-110 multipathd: --------shut down-------
Sep 26 11:20:24 RHEL52-SANboot-110 kernel: multipathd[9063]: segfault at 000000000000001a rip 00000034da070fe0 rsp 00007fff7da6d220 error 4

But I do not see the core dump in "/":

[root@RHEL52-SANboot-110 ~]# date
Fri Sep 26 11:20:28 IST 2008
[root@RHEL52-SANboot-110 ~]#
[root@RHEL52-SANboot-110 ~]# cd /
[root@RHEL52-SANboot-110 /]# ls -a
.          bin   etc   lib64       misc  opt   sbin     sys       usr
..         boot  home  lost+found  mnt   proc  selinux  tftpboot  var
.autofsck  dev   lib   media       net   root  srv      tmp
[root@RHEL52-SANboot-110 /]#

To see if my settings are proper, I wrote a small C program which segfaults. This dumped the core. But, somehow I cannot see the multipathd dumping the core.

Is there anything else I should do to get the dump?

Comment 8 Ben Marzinski 2008-09-29 20:54:40 UTC
That's strange. This could be one of two issues.

1. multipathd just isn't making a core file. I'm not sure what to do in this case.  To check, the easiest way is to startup multipathd, run df, shutdown multipathd, and check df again after the you see the segfault message.  If the size jumped, then you are probably making a core file somewhere.  If the size doesn't jump, you can try running multipath -d, which doesn't background it.  I don't know why this would change things, but it's worth a shot.  The other test you can do, if you don't see the df size jump when you segfault, is to start multipathd, run df, run
# killall -SEGV multipathd, and then run df again to see if the size jumped.  If it did jump when you send a SIGSEGV to the process, but not when it crashes on shutdown, then I doubt that you are going to be able to get a core dump from this
instance.

2. If looking df shows a size jump, then a core file is being written somewhere. The next question is "where?"   run

# cat /proc/sys/kernel/core_pattern

If this has a fully qualified path name, then your core files should be there.
Otherwise, run

# mkdir /tmp/corefiles 
# chmod 777 /tmp/corefiles 
# echo "/tmp/corefiles/core_%e_%p" > /proc/sys/kernel/core_pattern 

This should cause all future core files to be created as
/tmp/corefiles/core_<executable>_<pid>

Let me know what you find out.

Comment 9 Rajashekhar M A 2008-10-02 20:21:09 UTC
Ben,

Please find my observations ---

>> 1. multipathd just isn't making a core file. I'm not sure what to do in this case. To check, the easiest way is to startup multipathd, run df, shutdown multipathd, and check df again after the you see the segfault message.

The size did not jump. I have set the ulimit -c unlimited.

>> If the size doesn't jump, you can try running multipath -d, which doesn't background it.  I don't know why this would change things, but it's worth a shot.  

When I tried without demonizing multipathd, it never crashed when I ran "/etc/init.d/multipathd stop". It always showed that it shut down properly, no segfault messages on the terminal or in syslog. No size jump.

>> The other test you can do, if you don't see the df size jump when you segfault, is to start multipathd, run df, run # killall -SEGV multipathd, and then run df again to see if the size jumped. 

No messages in syslog. But the daemon looks like sefaulted (the status now shows that "multipathd dead but pid file exists"), but did not see the size jump.

>> If it did jump when you send a SIGSEGV to the process, but not when it crashes on shutdown, then I doubt that you are going to be able to get a core dump from this instance.

I did not see the size jump for either of the cases viz., SIGSEGV and "/etc/init.d/multipathd stop".

>> 2. If looking df shows a size jump, then a core file is being written
>> somewhere. The next question is "where?"   run
>> # cat /proc/sys/kernel/core_pattern

The output in default case looks as below ---

# cat /proc/sys/kernel/core_pattern
core

I changed the pattern for core files ---

# mkdir /coredir
# chmod 777 /coredir
# echo "/coredir/core_%e_%p" > /proc/sys/kernel/core_pattern 
# cat /proc/sys/kernel/core_pattern 
/coredir/core_%e_%p

When I started multipathd without demonizing it and then killed it with SEGV, I could see the segfault message on the terminal and could see the core dump in /coredir/ ---

# cd /coredir/
# ls
core_multipathd_10997
#

Comment 10 Ben Marzinski 2008-10-03 21:04:31 UTC
O.k. So with the coredir set up, the system produces a core file if you start the multipathd process without daemonizing it, and send it a SIGSEGV signal.  Have you tried to see if the system produces a core file if you start multipathd in daemon mode, now that you've set up the coredir.  I'm not sure why things would be different with the coredir, but perhaps it was a permissions thing. Speaking of which, if you have selinux set to enforcing mode, can you try turning it off and rechecking if core files are being created.  I'm not sure if selinux interferes with the ability to write core files, but that doesn't seem completely crazy.

If none of that works, I suppose we'll have to do this the hard way. Can you run multipath -ll to show me what your multipath setup looks like.  I can try harder to match your exact setup, to see if I can recreate this.

Here's a long shot, but its worth checking.  You can download the latest device-mapper-multipath beta packages from

http://people.redhat.com/rpeterso/Experimental/RHEL5.x/dm-multipath/

There were some changes to code the multipathd runs during shutdown, perhaps some of the changes will fix your problem.

Otherwise, the last resort is to start adding print statements to narrow down where things are going wrong.  This is complicated by the fact that multipathd doesn't wait for it's messages to hit syslog.  I can write a patch to change this and add a bunch of print statements to the shutdown code paths.  Are you comfortable  compiling the package from source.  If so, I can just send you patches.  If not, I can build a test package and post it.

Comment 11 Ben Marzinski 2008-10-10 19:55:52 UTC
Actually, there is a bug in the above packages, can you please try the packages at

http://people.redhat.com/coughlan/.dm-multipath/RHEL5/

Comment 12 Rajashekhar M A 2008-10-13 18:04:53 UTC
Created attachment 320213 [details]
Configuration Data on RHEL 5.2 sanbooted machine.

Ben,

> Have you tried to see if the system produces a core file if you start multipathd in daemon mode, now that you've set up the coredir.

Yes. I tried running the multipathd in daemon mode and sending the SIGSEGV, or stopping it using the script. I could not get the dump.

I tried several things with SELinux too. It doesn't matter, I could not get the dump in any case. (Enforced, Disabled, Permissive modes).

> Can you run multipath -ll to show me what your multipath setup looks like.

The attachment has the setup information which I used to reproduce this bug. I am using a SANbooted machine, with Emulex LP11002 cards. I have to restart the daemon several times and I see the daemon segfaulting say once in five times. Once I hit the issue, it's pretty easy to see the segfault frequently.

> You can download the latest device-mapper-multipath beta packages...

I took the rpms from the link given. But since the machine I used is a SANbooted machine (root on LUN), it did not allow me to stop the daemon (version 4.7-19). So, I tried to reproduce this bug with root on localdisks (first, with GA versions of all packages) but could not reproduce the bug with GA itself. So did not try the new rpms, as I anyway do not see the bug. Same will be the case with version 4.7-20 also.

> Otherwise, the last resort is to start adding print statements to narrow down where things are going wrong. 

Unfortunately, we are here now. I am fine with the patches. I am comfortable with building packages.

Comment 13 Ben Marzinski 2008-10-13 20:06:53 UTC
The init script code to stop the daemon doesn't do anything special, it just kills the multipathd process. You can avoid the SANboot check in the initrd by just running

# killall multipathd

This should let you see if you can recreate the problem with the latest package.  But I'll start working on a debugging patch for you, on the assumption that the new packages won't fix your problem.

Comment 14 Rajashekhar M A 2008-10-14 11:27:01 UTC
Hi Ben,

This helped. I used "killall multipathd" to stop the daemon and I could not reproduce the bug with 4.7-19 rpms. I can still see the crash with older packages on the same setup frequently, which hints us that this bug could have got fixed in the latest version of the package.

Comment 15 Ben Marzinski 2008-10-21 20:14:02 UTC
I'm closing this bug. If you think that the latest patches might not have really fixed this, you can reopen it.