Bug 448942
Summary: | [NetApp 5.3 bug] multipathd segfaults while stopping | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Rajashekhar M A <rajashekhar.a> | ||||||
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 5.2 | CC: | agk, andriusb, bmarzins, bmr, christophe.varoqui, clasohm, cmarthal, coughlan, cward, dwysocha, edamato, egoggin, heinzm, junichi.nomura, kueda, lmb, marting, mbroz, prockai, rajashekhar.a, rsarraf, tranlan, vijayakumar, xdl-redhat-bugzilla | ||||||
Target Milestone: | rc | Keywords: | OtherQA | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-10-21 20:14:02 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 373081 | ||||||||
Attachments: |
|
Description
Rajashekhar M A
2008-05-29 16:03:35 UTC
I can't reproduce this. Can you try to hit this while running multipathd under gdb. To do this, you need the multipath debuginfo package.
# service multipathd start
# gdb multipathd <pid>
"in GBD" > continue
In another terminal
# service multipathd stop
If you can recreate this in GDB, can you get a backtrace of the segfaulting thread.
> bt
Can you try and get the information mentioned in comment #2 (a backtrace of the segfaulting thread) I will capture the bt and upload them here. Created attachment 317162 [details]
gdb backtraces while stoping the daemon
Ben,
I followed the steps you mentioned and tried attaching the gdb. But I didn't see the daemon segfault with gdb attached. I have attached the logs which I captured when I did this.
Neither the /var/log/messages had the segfault message. I could only see messages till -
Sep 20 09:37:48 RHEL52-SANboot-110 multipathd: --------shut down-------
But I can consistently reproduce this bug, always on my box, if I don't attach gdb.
O.k. then, let's try a different method. Segfaults are supposed to generate core files, but most systems disable them by default. to check, run # ulimit -c 0 That zero means that core files are disabled. To enable the generation of any size core files, run # ulimit -c unlimited Now get multipathd to segfault. The core file will be located in "/". You can look at it with gdb by running # gdb multipathd <core_file_name> Then pull the backtraces from this. Once this is done, you probably want to disable core files again # ulimit -c 0 Thanks. I set the ulimit to unlimited and got the daemon to segfault from that terminal. But I do not see the core file in "/": [root@RHEL52-SANboot-110 ~]# ulimit -c unlimited [root@RHEL52-SANboot-110 ~]# /etc/init.d/multipathd status multipathd is stopped [root@RHEL52-SANboot-110 ~]# /etc/init.d/multipathd start Starting multipathd daemon: [ OK ] [root@RHEL52-SANboot-110 ~]# date Fri Sep 26 11:20:05 IST 2008 [root@RHEL52-SANboot-110 ~]# [root@RHEL52-SANboot-110 ~]# /etc/init.d/multipathd stop Stopping multipathd daemon: [ OK ] [root@RHEL52-SANboot-110 ~]# I see the following messages in syslog: Sep 26 11:20:24 RHEL52-SANboot-110 multipathd: mpath0: stop event checker thread Sep 26 11:20:24 RHEL52-SANboot-110 multipathd: --------shut down------- Sep 26 11:20:24 RHEL52-SANboot-110 kernel: multipathd[9063]: segfault at 000000000000001a rip 00000034da070fe0 rsp 00007fff7da6d220 error 4 But I do not see the core dump in "/": [root@RHEL52-SANboot-110 ~]# date Fri Sep 26 11:20:28 IST 2008 [root@RHEL52-SANboot-110 ~]# [root@RHEL52-SANboot-110 ~]# cd / [root@RHEL52-SANboot-110 /]# ls -a . bin etc lib64 misc opt sbin sys usr .. boot home lost+found mnt proc selinux tftpboot var .autofsck dev lib media net root srv tmp [root@RHEL52-SANboot-110 /]# To see if my settings are proper, I wrote a small C program which segfaults. This dumped the core. But, somehow I cannot see the multipathd dumping the core. Is there anything else I should do to get the dump? That's strange. This could be one of two issues. 1. multipathd just isn't making a core file. I'm not sure what to do in this case. To check, the easiest way is to startup multipathd, run df, shutdown multipathd, and check df again after the you see the segfault message. If the size jumped, then you are probably making a core file somewhere. If the size doesn't jump, you can try running multipath -d, which doesn't background it. I don't know why this would change things, but it's worth a shot. The other test you can do, if you don't see the df size jump when you segfault, is to start multipathd, run df, run # killall -SEGV multipathd, and then run df again to see if the size jumped. If it did jump when you send a SIGSEGV to the process, but not when it crashes on shutdown, then I doubt that you are going to be able to get a core dump from this instance. 2. If looking df shows a size jump, then a core file is being written somewhere. The next question is "where?" run # cat /proc/sys/kernel/core_pattern If this has a fully qualified path name, then your core files should be there. Otherwise, run # mkdir /tmp/corefiles # chmod 777 /tmp/corefiles # echo "/tmp/corefiles/core_%e_%p" > /proc/sys/kernel/core_pattern This should cause all future core files to be created as /tmp/corefiles/core_<executable>_<pid> Let me know what you find out. Ben, Please find my observations --- >> 1. multipathd just isn't making a core file. I'm not sure what to do in this case. To check, the easiest way is to startup multipathd, run df, shutdown multipathd, and check df again after the you see the segfault message. The size did not jump. I have set the ulimit -c unlimited. >> If the size doesn't jump, you can try running multipath -d, which doesn't background it. I don't know why this would change things, but it's worth a shot. When I tried without demonizing multipathd, it never crashed when I ran "/etc/init.d/multipathd stop". It always showed that it shut down properly, no segfault messages on the terminal or in syslog. No size jump. >> The other test you can do, if you don't see the df size jump when you segfault, is to start multipathd, run df, run # killall -SEGV multipathd, and then run df again to see if the size jumped. No messages in syslog. But the daemon looks like sefaulted (the status now shows that "multipathd dead but pid file exists"), but did not see the size jump. >> If it did jump when you send a SIGSEGV to the process, but not when it crashes on shutdown, then I doubt that you are going to be able to get a core dump from this instance. I did not see the size jump for either of the cases viz., SIGSEGV and "/etc/init.d/multipathd stop". >> 2. If looking df shows a size jump, then a core file is being written >> somewhere. The next question is "where?" run >> # cat /proc/sys/kernel/core_pattern The output in default case looks as below --- # cat /proc/sys/kernel/core_pattern core I changed the pattern for core files --- # mkdir /coredir # chmod 777 /coredir # echo "/coredir/core_%e_%p" > /proc/sys/kernel/core_pattern # cat /proc/sys/kernel/core_pattern /coredir/core_%e_%p When I started multipathd without demonizing it and then killed it with SEGV, I could see the segfault message on the terminal and could see the core dump in /coredir/ --- # cd /coredir/ # ls core_multipathd_10997 # O.k. So with the coredir set up, the system produces a core file if you start the multipathd process without daemonizing it, and send it a SIGSEGV signal. Have you tried to see if the system produces a core file if you start multipathd in daemon mode, now that you've set up the coredir. I'm not sure why things would be different with the coredir, but perhaps it was a permissions thing. Speaking of which, if you have selinux set to enforcing mode, can you try turning it off and rechecking if core files are being created. I'm not sure if selinux interferes with the ability to write core files, but that doesn't seem completely crazy. If none of that works, I suppose we'll have to do this the hard way. Can you run multipath -ll to show me what your multipath setup looks like. I can try harder to match your exact setup, to see if I can recreate this. Here's a long shot, but its worth checking. You can download the latest device-mapper-multipath beta packages from http://people.redhat.com/rpeterso/Experimental/RHEL5.x/dm-multipath/ There were some changes to code the multipathd runs during shutdown, perhaps some of the changes will fix your problem. Otherwise, the last resort is to start adding print statements to narrow down where things are going wrong. This is complicated by the fact that multipathd doesn't wait for it's messages to hit syslog. I can write a patch to change this and add a bunch of print statements to the shutdown code paths. Are you comfortable compiling the package from source. If so, I can just send you patches. If not, I can build a test package and post it. Actually, there is a bug in the above packages, can you please try the packages at http://people.redhat.com/coughlan/.dm-multipath/RHEL5/ Created attachment 320213 [details] Configuration Data on RHEL 5.2 sanbooted machine. Ben, > Have you tried to see if the system produces a core file if you start multipathd in daemon mode, now that you've set up the coredir. Yes. I tried running the multipathd in daemon mode and sending the SIGSEGV, or stopping it using the script. I could not get the dump. I tried several things with SELinux too. It doesn't matter, I could not get the dump in any case. (Enforced, Disabled, Permissive modes). > Can you run multipath -ll to show me what your multipath setup looks like. The attachment has the setup information which I used to reproduce this bug. I am using a SANbooted machine, with Emulex LP11002 cards. I have to restart the daemon several times and I see the daemon segfaulting say once in five times. Once I hit the issue, it's pretty easy to see the segfault frequently. > You can download the latest device-mapper-multipath beta packages... I took the rpms from the link given. But since the machine I used is a SANbooted machine (root on LUN), it did not allow me to stop the daemon (version 4.7-19). So, I tried to reproduce this bug with root on localdisks (first, with GA versions of all packages) but could not reproduce the bug with GA itself. So did not try the new rpms, as I anyway do not see the bug. Same will be the case with version 4.7-20 also. > Otherwise, the last resort is to start adding print statements to narrow down where things are going wrong. Unfortunately, we are here now. I am fine with the patches. I am comfortable with building packages. The init script code to stop the daemon doesn't do anything special, it just kills the multipathd process. You can avoid the SANboot check in the initrd by just running # killall multipathd This should let you see if you can recreate the problem with the latest package. But I'll start working on a debugging patch for you, on the assumption that the new packages won't fix your problem. Hi Ben, This helped. I used "killall multipathd" to stop the daemon and I could not reproduce the bug with 4.7-19 rpms. I can still see the crash with older packages on the same setup frequently, which hints us that this bug could have got fixed in the latest version of the package. I'm closing this bug. If you think that the latest patches might not have really fixed this, you can reopen it. |