Bug 497041
| Summary: | Multipathd is dying when there is a failover/failback during IO. | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Senthil Kumar V <senthil-kumar.veluswamy> | ||||||||
| Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> | ||||||||
| Status: | CLOSED NEXTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | low | ||||||||||
| Version: | 5.3 | CC: | agk, bloch, bmarzins, bmr, christophe.varoqui, dwysocha, edamato, egoggin, heinzm, junichi.nomura, kueda, lmb, mbroz, prockai, senthil-kumar.veluswamy, tranlan | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2009-07-17 06:00:23 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
Created attachment 340679 [details]
multipath daemon core dump file
Adding the multipath daemon core dump file.
Any update on this? The root of this problem is that the multipath device lost its pointer to its hardware entry, (the structure that stores the config from the devices section of /etc/multipath.conf), this causing problems when the multipath device goes to reconfigure itself during your test. A recently fixed bug was also caused by multipath not having it hardware entry pointer set correctly. From what I can see in the dump, this may also fix your problem. A package that has the fix for this issue is located at: http://people.redhat.com/bmarzins/device-mapper-multipath/device-mapper-multipath-debuginfo-0.4.7-23.el5_3.4.x86_64.rpm If you could try to reproduce the problem with that package, and let me know if this fixes it, that would be great. Created attachment 343547 [details]
new multipath daemon core dump file
We tried with the package suggested device-mapper-multipath-debuginfo-0.4.7-23.el5_3.4.x86_64.rpm, but the multipath daemon still crashes while running an IO test with failover/failback. The new multipathd core dump file core.29890 is also attached Also please let us know where you think is the issues and what kind of fix you are looking at. Here is what I noticed looking at the first dump file. multipathd crashed trying to free the its selector string, because it was already freed. What's more, that selector string wasn't supposed to be freed by the multipath device at all. It belonged to the hardware handler. Do to kind of sloppy design in multipathd, if a multipath device gets all of its parameters filled in from the kernel (this could happen if you manually created a multipath device via dmsetup) the multipath structure allocates its own copy of things like its hardware handler string, features string, and selector string. If multipathd creates the multipath device, the multipath structure doesn't allocate the string it just points to an exisiting copy. If the multipath device took it's selector string from it's hardware entry (like in this case), it simply points to the string that was allocated by the hardware entry. What that all means is this; Sometimes when multipath is reconfiguring, it should free the string it has (if it had allocated the string). Sometimes, it shouldn't ( if it's simply pointing to a string that was allocated by the hardware entry). Multipath decides which to do based on by checking if its selector string pointer is pointing to the same location as the one in its hardware entry. This works as long as multipath still remembers what it's hardware entry is. Unfortunately, looking at the data in the crash dump, it appears that the multipath's hwe pointer got cleared. After this happens, the next time multipath reconfigures, it would think that it allocated the selector string (since there is no hardware entry to match it against), and free it. The first device that does this will appear O.K. However, it will have freed the string pointed to by the hardware entry. The next device that does this will be trying to free memory that has already been freed, and quite possibly reallocated. There are two ways to fix this 1. Rewrite a large chunk of multipathd to make it not do any of this. That would be wonderful, but That would need to go upstream first, and won't happen in the RHEL 5 timeframe. 2. Make sure that the multipath device doesn't lose its pointer to its hardware entry. I'm planning to go with option 2 in RHEL 5. I fixed one place where this could happen in that package I had you try. I will look at your crash dump as see if I can't find the other places where it is broken. I'm pretty sure this always has to do with the path devices disappearing and then reappearing. As a work around could you please try to delete the following line of /etc/udev/rules.d/40-multipath.rules KERNEL!="dm-[0-9]*", ACTION=="add", PROGRAM=="/bin/bash -c '/sbin/lsmod | /bin/grep ^dm_multipath'", RUN+="/sbin/multipath -v0 %M:%m" Deleting this line will stop a race between multipathd and multipath to add the path to the device. The only thing that will change with this line deleted is that your multipath devices will no longer autoconfigure without multipathd running, except on boot. In practice, this shouldn't be a noticeable change. This line has been removed for recent fedora versions, and it will not be there in RHEL6. If deleting this line fixes the problem, it will make it much easier to pinpoint where the bug is, so please let me know if it works for you. I have tried the work around that you have provided. I commented KERNEL!="dm-[0-9]*", ACTION=="add", PROGRAM=="/bin/bash -c '/sbin/lsmod | /bin/grep ^dm_multipath'", RUN+="/sbin/multipath -v0 %M:%m" in /etc/udev/rules.d/40-multipath.rules. This has fixed the problem. The daemon is not crashing any more. Thanks! That narrows down where the problem could be happening. Like I mentioned in comment #8, this work around shouldn't have any noticeable effect, so you should be fine with leaving that line commented out. Err... There's still a chance that this was already fixed. I meant to include all of the packages in Comment #3. http://people.redhat.com/bmarzins/device-mapper-multipath/device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm http://people.redhat.com/bmarzins/device-mapper-multipath/kpartx-0.4.7-23.el5_3.4.x86_64.rpm http://people.redhat.com/bmarzins/device-mapper-multipath/device-mapper-multipath-debuginfo-0.4.7-23.el5_3.4.x86_64.rpm instead of just the debuginfo package. Sorry. Can you please download all of the packages, and see if you can recreate the issue. To verify this, you will need to uncomment the line from /etc/udev/rules.d/40-multipath.rules Again, sorry about the confusion, and thanks. we are not able to install the new device-mapper-multipath and kpartx. It is getting into a cyclic dependence. Even force installation is not working. Can you post the errors that you are seeing. When i try to do a force install I get the following msgs:
rpm -Fvh kpartx-0.4.7-23.el5_3.4.x86_64.rpm
error: Failed dependencies:
kpartx = 0.4.7-23.el5 is needed by (installed) device-mapper-multipath-0.4.7-23.el5.x86_64
rpm -Fvh device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm
error: Failed dependencies:
kpartx = 0.4.7-23.el5_3.4 is needed by device-mapper-multipath-0.4.7-23.el5_3.4.x86_64
When I try to just install:
rpm -ivh kpartx-0.4.7-23.el5_3.4.x86_64.rpm Preparing... ########################################### [100%]
file /sbin/kpartx from install of kpartx-0.4.7-23.el5_3.4.x86_64 conflicts with file from package kpartx-0.4.7-23.el5.x86_64
file /sbin/kpartx.static from install of kpartx-0.4.7-23.el5_3.4.x86_64 conflicts with file from package kpartx-0.4.7-23.el5.x86_64
rpm -ivh device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm
error: Failed dependencies:
kpartx = 0.4.7-23.el5_3.4 is needed by device-mapper-multipath-0.4.7-23.el5_3.4.x86_64
You need to install both packages using a single RPM transaction: # rpm -Uvh kpartx-0.4.7-23.el5_3.4.x86_64.rpm device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm Preparing... ########################################### [100%] 1:kpartx ########################################### [ 50%] 2:device-mapper-multipath########################################### [100%] # rpm -q kpartx device-mapper-multipath kpartx-0.4.7-23.el5_3.4 device-mapper-multipath-0.4.7-23.el5_3.4 Created attachment 344574 [details]
core file with the multipath tools device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm
Bryn
I have tried with this new rpms. I am able to reproduce this issue. I have attached the new crash. But the frequency of the crash has reduced. initially the daemon used to crash with just 2 to 3 reboots of the primary controller. but now the daemon crashed after about 15 reboots.
Any update on this? The workaround from comment #9 is now in the latest 5.4 multipath build (in the fix for bz #497041). I'm closing this bug since you have already verified that this change solves the problem. If you see this problem again, you can reopen this bug. |
Description of problem: Multipath daemon is crashing when run a I/O test with failover/failback. Steps to Reproduce: 1. Start I/O using dd command on mpath device. 2. Restart the primary controller ( the controller through with the I/O is passing). Note: Sometime we may have to try the Step 2 couple of times to reproduce this crash. Actual results: Multipath daemon crashes. Expected results: I/O's should continue thru other controller. Additional info: 1.Kernel version - 2.6.18-128.el5 2.Device-mapper version - 1.02.28-2.el5. 3.Device-mapper-multipath version - 0.4.7-23.el5 4.Attached the core dump file (i.e core.8016).