Bug 497041 - Multipathd is dying when there is a failover/failback during IO.
Multipathd is dying when there is a failover/failback during IO.
Status: CLOSED NEXTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: device-mapper-multipath (Show other bugs)
5.3
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Ben Marzinski
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-22 01:30 EDT by Senthil Kumar V
Modified: 2010-01-11 21:47 EST (History)
16 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-07-17 02:00:23 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
multipath daemon core dump file (3.20 MB, application/octet-stream)
2009-04-22 01:35 EDT, Senthil Kumar V
no flags Details
new multipath daemon core dump file (2.93 MB, application/octet-stream)
2009-05-12 03:22 EDT, Senthil Kumar V
no flags Details
core file with the multipath tools device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm (3.41 MB, application/octet-stream)
2009-05-19 02:14 EDT, Senthil Kumar V
no flags Details

  None (edit)
Description Senthil Kumar V 2009-04-22 01:30:00 EDT
Description of problem:
      Multipath daemon is crashing when run a I/O test with failover/failback.

Steps to Reproduce:
1. Start I/O using dd command on mpath device.
2. Restart the primary controller ( the controller through with the I/O is passing).

Note: Sometime we may have to try the Step 2 couple of times to reproduce this crash.
  
Actual results:
Multipath daemon crashes.

Expected results:
I/O's should continue thru other controller.

Additional info:
1.Kernel version  -   2.6.18-128.el5
2.Device-mapper version - 1.02.28-2.el5.
3.Device-mapper-multipath version - 0.4.7-23.el5
4.Attached the core dump file (i.e core.8016).
Comment 1 Senthil Kumar V 2009-04-22 01:35:04 EDT
Created attachment 340679 [details]
multipath daemon core dump file

Adding the multipath daemon core dump file.
Comment 2 Senthil Kumar V 2009-05-07 01:21:39 EDT
Any update on this?
Comment 3 Ben Marzinski 2009-05-08 16:26:05 EDT
The root of this problem is that the multipath device lost its pointer to its hardware entry, (the structure that stores the config from the devices section of /etc/multipath.conf), this causing problems when the multipath device goes to reconfigure itself during your test.

A recently fixed bug was also caused by multipath not having it hardware entry pointer set correctly.   From what I can see in the dump, this may also fix your problem.  A package that has the fix for this issue is located at:

http://people.redhat.com/bmarzins/device-mapper-multipath/device-mapper-multipath-debuginfo-0.4.7-23.el5_3.4.x86_64.rpm

If you could try to reproduce the problem with that package, and let me know if  this fixes it, that would be great.
Comment 4 Senthil Kumar V 2009-05-12 03:22:58 EDT
Created attachment 343547 [details]
new multipath daemon core dump file
Comment 5 Senthil Kumar V 2009-05-12 03:25:53 EDT
We tried with the package suggested device-mapper-multipath-debuginfo-0.4.7-23.el5_3.4.x86_64.rpm, but the multipath daemon still crashes while running an IO test with failover/failback.
Comment 6 Senthil Kumar V 2009-05-12 03:28:52 EDT
The new multipathd core dump file core.29890 is also attached
Comment 7 Senthil Kumar V 2009-05-12 05:19:51 EDT
Also please let us know where you think is the issues and what kind of fix you are looking at.
Comment 8 Ben Marzinski 2009-05-12 12:31:13 EDT
Here is what I noticed looking at the first dump file. multipathd crashed trying to free the its selector string, because it was already freed.  What's more, that
selector string wasn't supposed to be freed by the multipath device at all.  It belonged to the hardware handler. 

Do to kind of sloppy design in multipathd, if a multipath device gets all of its
parameters filled in from the kernel (this could happen if you manually created a multipath device via dmsetup) the multipath structure allocates its own copy of things like its hardware handler string, features string, and selector string.  If multipathd creates the multipath device, the multipath structure doesn't allocate the string it just points to an exisiting copy.  If the multipath device took it's selector string from it's hardware entry (like in this case), it simply points to
the string that was allocated by the hardware entry.

What that all means is this; Sometimes when multipath is reconfiguring, it should free the string it has (if it had allocated the string). Sometimes, it shouldn't ( if it's simply pointing to a string that was allocated by the hardware entry).  Multipath decides which to do based on by checking if its selector string pointer is pointing to the same location as the one in its hardware entry.

This works as long as multipath still remembers what it's hardware entry is.  Unfortunately, looking at the data in the crash dump, it appears that the multipath's hwe pointer got cleared. After this happens, the next time multipath reconfigures, it would think that it allocated the selector string (since there is no hardware entry to match it against), and free it.  The first device that does this will appear O.K. However, it will have freed the string pointed to by the hardware entry.  The next device that does this will be trying to free memory that has already been freed, and quite possibly reallocated.

There are two ways to fix this
1. Rewrite a large chunk of multipathd to make it not do any of this.  That would be wonderful, but That would need to go upstream first, and won't happen in the RHEL 5 timeframe.

2. Make sure that the multipath device doesn't lose its pointer to its hardware entry.

I'm planning to go with option 2 in RHEL 5.

I fixed one place where this could happen in that package I had you try.  I will look at your crash dump as see if I can't find the other places where it is broken.  I'm pretty sure this always has to do with the path devices disappearing and then reappearing.

As a work around could you please try to delete the following line of /etc/udev/rules.d/40-multipath.rules

KERNEL!="dm-[0-9]*", ACTION=="add", PROGRAM=="/bin/bash -c '/sbin/lsmod | /bin/grep ^dm_multipath'", RUN+="/sbin/multipath -v0 %M:%m"

Deleting this line will stop a race between multipathd and multipath to add the path to the device.  The only thing that will change with this line deleted is
that your multipath devices will no longer autoconfigure without multipathd running, except on boot. In practice, this shouldn't be a noticeable change.  This line has been removed for recent fedora versions, and it will not be there in RHEL6.

If deleting this line fixes the problem, it will make it much easier to pinpoint where the bug is, so please let me know if it works for you.
Comment 9 Senthil Kumar V 2009-05-13 05:07:03 EDT
I have tried the work around that you have provided. I commented 

KERNEL!="dm-[0-9]*", ACTION=="add", PROGRAM=="/bin/bash -c '/sbin/lsmod |
/bin/grep ^dm_multipath'", RUN+="/sbin/multipath -v0 %M:%m"

in /etc/udev/rules.d/40-multipath.rules. This has fixed the problem. The daemon is not crashing any more.
Comment 10 Ben Marzinski 2009-05-13 13:00:12 EDT
Thanks! That narrows down where the problem could be happening.

Like I mentioned in comment #8, this work around shouldn't have any noticeable effect, so you should be fine with leaving that line commented out.
Comment 11 Ben Marzinski 2009-05-14 12:15:53 EDT
Err... There's still a chance that this was already fixed. I meant to include all of the packages in Comment #3.

http://people.redhat.com/bmarzins/device-mapper-multipath/device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm
http://people.redhat.com/bmarzins/device-mapper-multipath/kpartx-0.4.7-23.el5_3.4.x86_64.rpm
http://people.redhat.com/bmarzins/device-mapper-multipath/device-mapper-multipath-debuginfo-0.4.7-23.el5_3.4.x86_64.rpm

instead of just the debuginfo package. Sorry.

Can you please download all of the packages, and see if you can recreate the issue. To verify this, you will need to uncomment the line from

/etc/udev/rules.d/40-multipath.rules

Again, sorry about the confusion, and thanks.
Comment 12 Senthil Kumar V 2009-05-15 02:24:26 EDT
we are not able to install the new device-mapper-multipath and kpartx. It is getting into a cyclic dependence. Even force installation is not working.
Comment 13 Ben Marzinski 2009-05-15 11:52:13 EDT
Can you post the errors that you are seeing.
Comment 14 Senthil Kumar V 2009-05-18 07:04:34 EDT
When i try to do a force install I get the following msgs:

rpm -Fvh kpartx-0.4.7-23.el5_3.4.x86_64.rpm
error: Failed dependencies:
        kpartx = 0.4.7-23.el5 is needed by (installed) device-mapper-multipath-0.4.7-23.el5.x86_64


rpm -Fvh device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm
error: Failed dependencies:
        kpartx = 0.4.7-23.el5_3.4 is needed by device-mapper-multipath-0.4.7-23.el5_3.4.x86_64

When I try to just install:

rpm -ivh kpartx-0.4.7-23.el5_3.4.x86_64.rpm                  Preparing...                ########################################### [100%]
        file /sbin/kpartx from install of kpartx-0.4.7-23.el5_3.4.x86_64 conflicts with file from package kpartx-0.4.7-23.el5.x86_64
        file /sbin/kpartx.static from install of kpartx-0.4.7-23.el5_3.4.x86_64 conflicts with file from package kpartx-0.4.7-23.el5.x86_64

 rpm -ivh device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm 
error: Failed dependencies:
        kpartx = 0.4.7-23.el5_3.4 is needed by device-mapper-multipath-0.4.7-23.el5_3.4.x86_64
Comment 15 Bryn M. Reeves 2009-05-18 08:43:12 EDT
You need to install both packages using a single RPM transaction:

# rpm -Uvh kpartx-0.4.7-23.el5_3.4.x86_64.rpm device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm 
Preparing...                ########################################### [100%]
   1:kpartx                 ########################################### [ 50%]
   2:device-mapper-multipath########################################### [100%]

# rpm -q kpartx device-mapper-multipath
kpartx-0.4.7-23.el5_3.4
device-mapper-multipath-0.4.7-23.el5_3.4
Comment 16 Senthil Kumar V 2009-05-19 02:14:31 EDT
Created attachment 344574 [details]
core file with the multipath tools device-mapper-multipath-0.4.7-23.el5_3.4.x86_64.rpm 

Bryn

 I have tried with this new rpms. I am able to reproduce this issue. I have attached the new crash. But the frequency of the crash has reduced. initially the daemon used to crash with just 2 to 3 reboots of the primary controller. but now the daemon crashed after about 15 reboots.
Comment 17 Senthil Kumar V 2009-07-16 04:27:36 EDT
Any update on this?
Comment 18 Ben Marzinski 2009-07-17 02:00:23 EDT
The workaround from comment #9 is now in the latest 5.4 multipath build (in the fix for bz #497041). I'm closing this bug since you have already verified that this change solves the problem.  If you see this problem again, you can reopen this bug.

Note You need to log in before you can comment on or make changes to this bug.