1049637 – multipathd encounters memory corruption

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1049637 - multipathd encounters memory corruption

Summary: multipathd encounters memory corruption

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	6.6
Assignee:	Ben Marzinski
QA Contact:	yanfu,wang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1012672 (view as bug list)
Depends On:
Blocks:	1022765
TreeView+	depends on / blocked

Reported:	2014-01-07 22:13 UTC by Karandeep Chahal
Modified:	2014-10-14 07:42 UTC (History)
CC List:	14 users (show)
Fixed In Version:	device-mapper-multipath-0.4.9-75.el6
Doc Type:	Bug Fix
Doc Text:	Cause: Multipathd's sysfs device handling code could free a device structure while other data still pointed to it. Consequence: Multipathd could occasionally experience use-after-free memory corruption, causing it to crash. Fix: The sysfs device handling code for multipathd has been significantly rewritten to deal with a number of issues. As part of this, multipathd no longer will free sysfs device memory while it is still in use. Result: Multipathd no longer crashes due to use-after-free memory corruption.
Clone Of:
Environment:
Last Closed:	2014-10-14 07:42:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
var/spool/abrt folder (3.79 MB, application/x-gzip) 2014-01-07 22:13 UTC, Karandeep Chahal	no flags	Details
multipath.conf (8.12 KB, text/plain) 2014-01-10 14:56 UTC, Karandeep Chahal	no flags	Details
RHEL 5.9 multipathd segfault (190.72 KB, application/x-gzip) 2014-01-31 18:40 UTC, Karandeep Chahal	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:1555	0	normal	SHIPPED_LIVE	device-mapper-multipath bug fix and enhancement update	2014-10-14 01:27:56 UTC

Description Karandeep Chahal 2014-01-07 22:13:09 UTC

Created attachment 846858 [details]
var/spool/abrt folder

Description of problem:
Multipathd crashes when suddenly a FC switch is enabled and port count is greater than or equal to 32


Version-Release number of selected component (if applicable):
device-mapper-multipath-libs-0.4.9-56.el6_3.1.x86_64
device-mapper-multipath-0.4.9-56.el6_3.1.x86_64

Red Hat Enterprise Linux Server release 6.3 (Santiago)
Kernel \r on an \m
Linux co-test-d106.colorado.datadirectnet.com 2.6.32-279.19.1.el6.x86_64 #1 SMP Sat Nov 24 14:35:28 EST 2012 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
So, in this configuration there are 16 Linux initiators, each with 8 FC ports connected to a switch. On the target side is a DDN storage. The problem happens when the 32 storage ports are suddenly enabled on the switch fabric. It is observed that multipathd dies with what appears to be memory corruption.


Steps to Reproduce:
1. Connect at least 16 initiators with 8 FC ports each, to a FC switch
2. Zone all initiators to a storage with 32 ports (in this test we used DDN storage).
3. Suddenly enable 32 switch ports connected to the storagre.

Actual results:
multipathd crashed

Expected results:
multipathd must not crash

Additional info:
Happens on many different versions of RHEL.

Comment 2 Ben Marzinski 2014-01-09 01:10:33 UTC

Well the coredump certainly tells me that the memory has gotten corrupted, but it would be helpful to have /var/log/messages from a run where this happens to see what is going on immediately before the crash.

Comment 3 Ben Marzinski 2014-01-09 01:12:50 UTC

Also, have you tried this with recent versions of device-mapper-multipath? While this doesn't look exactly like any issue I remember, there have been memory issues fixed since RHEL-6.3.

Comment 4 Ben Marzinski 2014-01-09 05:00:07 UTC

Well, the corrupt memory looks like this:

$49 = "0000000f188b10030\000\000\000\000\000\000\000$\n\000\000\000\000\000\000\320\020\016X\325\177\000\000\000/\016X\325\177\000\000\020/\016X\325\177\000\000/devices/pci0000:00/0000:00:09.0/0000:10:00.0/0000:11:03.0/0000:13:00.0/host15/rport-15:0-17/target15:0:12/15:0:12:62/block/sdct", '\000' <repeats 315 times>

Which should help locating the problem.

Comment 5 Karandeep Chahal 2014-01-09 14:43:43 UTC

(In reply to Ben Marzinski from comment #2)
> Well the coredump certainly tells me that the memory has gotten corrupted,
> but it would be helpful to have /var/log/messages from a run where this
> happens to see what is going on immediately before the crash.

I will get someone to reproduce this and gather the var/log/messages. Would you like anything else? Perhaps enhanced multipathd logging? What version of RHEL and multipath would you like us to run? This happens fairly consistently with all versions of RHEL we tested.

Comment 6 Ben Marzinski 2014-01-09 21:49:37 UTC

well, looking at the corrupt memory itself doesn't appear horribly helpful. Actually, the only bit of data that's left in the corrupted malloc chunk (besides the data that overwrote the linked list pointer with 0x0) is "0000000f188b1003". That's pretty definitely part of the WWID from one of your paths, judging from the complete ones I found in the memory dump.

If this was a use-after-free corruption, then something would have to be writing
0x0 to the first 8 bytes of that wwid after it was freed. Looking through the code I can't find any obvious place where this could be happening. However, I only looked at variables that definitely store the wwid and get freed. If you aren't using user_friendly_names, then the multipath's alias is also set to the wwid. So, is user_friendly_names set in your /etc/multipath.conf?

Also, I didn't look at the the code for the multipaths section of /etc/multipath.conf. The multipath entries their are identified by wwid. If you have a multipaths section in your /etc/multipath.conf, and during your test you reload the multipathd configuration using

# service multipathd reload

# multipathd reconfigure

then these would also be freed, and it would be worth checking there.

So, I should probably take a look at your /etc/multipath.conf file. Could you please attach that to the bugzilla?

Another possibility is that the previous malloc chunk overwrote it's allotted memory, and corrupted this entry. But the previous entry is a path structure, and it looks fine, so this isn't likely. Also the beginning of the malloc
structure is fine. The only part got overwritten is the part that is used
by the actual data when the chunk is in-use.

Right now my best guess is that this is a use-after-free issue, but whatever old memory is being used has been freed for a while, and the space has been already reused to store the wwid.

However, there are some weirdnesses I've noticed.

When I start up gdb, I get the following message:

Missing separate debuginfo for /lib64/multipath/libpriosfa.so

We don't distribute a libpriosfa.so DSO. I found the hardware entry for your paths in the memory dump, and they are using this prio function. Is this something you wrote? If it is, it's possible that the corruption is in there, and I would have no way to track that down without the source code.

Another weirdness is that it seems pretty certain that the device-mapper code is requesting a 40 byte piece of memory. In the backtrace, I see

#5 0x000000348847a911 in __libc_malloc (bytes=40) at malloc.c:3664
#6 0x0000003489421656 in dm_zalloc_aux (s=40, file=<value optimized out>,
line=<value optimized out>) at mm/dbg_malloc.c:274

However, the chuck that the malloc code is checking is only 48 bytes, and that includes the 16 bytes of malloc overhead, so it could only be returning a 32 byte chunk to the caller. I'm not sure what to make of this. I rarely go digging through the malloc internals to try to track down a bug so I could just be misunderstanding something, but examining the structures in GDB certainly makes it appear that malloc overhead is 16 bytes on an x86_64 system (it's 8 on an i686 system). At any rate, this isn't what caused multipath to crash. The overwritten pointer is 16 bytes in, and the earlier part of the malloc structure is fine (the malloc chunk structure uses 48 bytes for freed chunks). Also, I have a lot of problems believing that malloc could be returning the wrong size memory without causing problems much sooner. I just though I'd mention it.

The version of the tools I'd like you to test is available here:

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/1012672/

This has a fix that hasn't made it into a release yet, for a problem that can cause memory corruption. If you never remove any devices before multipathd crashes in your tests, then I don't think that specific fix will help you, but it is the most uptodate code available.

Comment 7 Karandeep Chahal 2014-01-10 14:56:35 UTC

Created attachment 848235 [details]
multipath.conf

Comment 8 Karandeep Chahal 2014-01-10 15:02:46 UTC

(In reply to Ben Marzinski from comment #6)
> well, looking at the corrupt memory itself doesn't appear horribly helpful. 
> Actually, the only bit of data that's left in the corrupted malloc chunk
> (besides the data that overwrote the linked list pointer with 0x0) is
> "0000000f188b1003".  That's pretty definitely part of the WWID from one of
> your paths, judging from the complete ones I found in the memory dump.
> 
> If this was a use-after-free corruption, then something would have to be
> writing
> 0x0 to the first 8 bytes of that wwid after it was freed.  Looking through
> the code I can't find any obvious place where this could be happening.
> However, I only looked at variables that definitely store the wwid and get
> freed.  If you aren't using user_friendly_names, then the multipath's alias
> is also set to the wwid.  So, is user_friendly_names set in your
> /etc/multipath.conf?
> 
> Also, I didn't look at the the code for the multipaths section of
> /etc/multipath.conf.  The multipath entries their are identified by wwid. If
> you have a multipaths section in your /etc/multipath.conf, and during your
> test you reload the multipathd configuration using
> 
> # service multipathd reload
> 
> or
> 
> # multipathd reconfigure
> 
> then these would also be freed, and it would be worth checking there.
> 
> So, I should probably take a look at your /etc/multipath.conf file. Could
> you please attach that to the bugzilla?
> 
> Another possibility is that the previous malloc chunk overwrote it's
> allotted memory, and corrupted this entry.  But the previous entry is a path
> structure, and it looks fine, so this isn't likely. Also the beginning of
> the malloc
> structure is fine.  The only part got overwritten is the part that is used
> by the actual data when the chunk is in-use.
> 
> Right now my best guess is that this is a use-after-free issue, but whatever
> old memory is being used has been freed for a while, and the space has been
> already reused to store the wwid.
> 
> However, there are some weirdnesses I've noticed.
> 
> When I start up gdb, I get the following message:
> 
> Missing separate debuginfo for /lib64/multipath/libpriosfa.so
> 
> We don't distribute a libpriosfa.so DSO. I found the hardware entry for your
> paths in the memory dump, and they are using this prio function. Is this
> something you wrote? If it is, it's possible that the corruption is in
> there, and I would have no way to track that down without the source code.
> 
> Another weirdness is that it seems pretty certain that the device-mapper
> code is requesting a 40 byte piece of memory.  In the backtrace, I see
> 
> #5  0x000000348847a911 in __libc_malloc (bytes=40) at malloc.c:3664
> #6  0x0000003489421656 in dm_zalloc_aux (s=40, file=<value optimized out>,
>     line=<value optimized out>) at mm/dbg_malloc.c:274
> 
> However, the chuck that the malloc code is checking is only 48 bytes, and
> that includes the 16 bytes of malloc overhead, so it could only be returning
> a 32 byte chunk to the caller. I'm not sure what to make of this. I rarely
> go digging through the malloc internals to try to track down a bug so I
> could just be misunderstanding something, but examining the structures in
> GDB certainly makes it appear that malloc overhead is 16 bytes on an x86_64
> system (it's 8 on an i686 system).  At any rate, this isn't what caused
> multipath to crash.  The overwritten pointer is 16 bytes in, and the earlier
> part of the malloc structure is fine (the malloc chunk structure uses 48
> bytes for freed chunks). Also, I have a lot of problems believing that
> malloc could be returning the wrong size memory without causing problems
> much sooner.  I just though I'd mention it.
> 
> The version of the tools I'd like you to test is available here:
> 
> http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/1012672/
> 
> This has a fix that hasn't made it into a release yet, for a problem that
> can cause memory corruption.  If you never remove any devices before
> multipathd crashes in your tests, then I don't think that specific fix will
> help you, but it is the most uptodate code available.

User friendly names are not set. I have attached the multipath.conf to this bug report. 

I am sure at some point we must have issued 'service multipathd restart', but that must have happened before the test started.

libpriosfa.so is something we wrote. However, to rule out memory corruption because of it, we reproduced the same problem with prioritizer set to "alua". Also, we can provide the source code for libpriosafa if you like.

We will try this test out with the latest version as you specified in your comment.

Thanks for looking at this so quickly!

Comment 9 Karandeep Chahal 2014-01-13 17:56:35 UTC

(In reply to Ben Marzinski from comment #6)
> well, looking at the corrupt memory itself doesn't appear horribly helpful. 
> Actually, the only bit of data that's left in the corrupted malloc chunk
> (besides the data that overwrote the linked list pointer with 0x0) is
> "0000000f188b1003".  That's pretty definitely part of the WWID from one of
> your paths, judging from the complete ones I found in the memory dump.
> 
> If this was a use-after-free corruption, then something would have to be
> writing
> 0x0 to the first 8 bytes of that wwid after it was freed.  Looking through
> the code I can't find any obvious place where this could be happening.
> However, I only looked at variables that definitely store the wwid and get
> freed.  If you aren't using user_friendly_names, then the multipath's alias
> is also set to the wwid.  So, is user_friendly_names set in your
> /etc/multipath.conf?
> 
> Also, I didn't look at the the code for the multipaths section of
> /etc/multipath.conf.  The multipath entries their are identified by wwid. If
> you have a multipaths section in your /etc/multipath.conf, and during your
> test you reload the multipathd configuration using
> 
> # service multipathd reload
> 
> or
> 
> # multipathd reconfigure
> 
> then these would also be freed, and it would be worth checking there.
> 
> So, I should probably take a look at your /etc/multipath.conf file. Could
> you please attach that to the bugzilla?
> 
> Another possibility is that the previous malloc chunk overwrote it's
> allotted memory, and corrupted this entry.  But the previous entry is a path
> structure, and it looks fine, so this isn't likely. Also the beginning of
> the malloc
> structure is fine.  The only part got overwritten is the part that is used
> by the actual data when the chunk is in-use.
> 
> Right now my best guess is that this is a use-after-free issue, but whatever
> old memory is being used has been freed for a while, and the space has been
> already reused to store the wwid.
> 
> However, there are some weirdnesses I've noticed.
> 
> When I start up gdb, I get the following message:
> 
> Missing separate debuginfo for /lib64/multipath/libpriosfa.so
> 
> We don't distribute a libpriosfa.so DSO. I found the hardware entry for your
> paths in the memory dump, and they are using this prio function. Is this
> something you wrote? If it is, it's possible that the corruption is in
> there, and I would have no way to track that down without the source code.
> 
> Another weirdness is that it seems pretty certain that the device-mapper
> code is requesting a 40 byte piece of memory.  In the backtrace, I see
> 
> #5  0x000000348847a911 in __libc_malloc (bytes=40) at malloc.c:3664
> #6  0x0000003489421656 in dm_zalloc_aux (s=40, file=<value optimized out>,
>     line=<value optimized out>) at mm/dbg_malloc.c:274
> 
> However, the chuck that the malloc code is checking is only 48 bytes, and
> that includes the 16 bytes of malloc overhead, so it could only be returning
> a 32 byte chunk to the caller. I'm not sure what to make of this. I rarely
> go digging through the malloc internals to try to track down a bug so I
> could just be misunderstanding something, but examining the structures in
> GDB certainly makes it appear that malloc overhead is 16 bytes on an x86_64
> system (it's 8 on an i686 system).  At any rate, this isn't what caused
> multipath to crash.  The overwritten pointer is 16 bytes in, and the earlier
> part of the malloc structure is fine (the malloc chunk structure uses 48
> bytes for freed chunks). Also, I have a lot of problems believing that
> malloc could be returning the wrong size memory without causing problems
> much sooner.  I just though I'd mention it.
> 
> The version of the tools I'd like you to test is available here:
> 
> http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/1012672/
> 
> This has a fix that hasn't made it into a release yet, for a problem that
> can cause memory corruption.  If you never remove any devices before
> multipathd crashes in your tests, then I don't think that specific fix will
> help you, but it is the most uptodate code available.

Ben, your RHEL6 rpms seem to be helping the situation. Do you have RHEL5 rpms as well that we could test?

Comment 10 Ben Marzinski 2014-01-15 18:54:07 UTC

Have you seen this issue on RHEL5?  There are some significant differences in the code. Without figuring out which change appears to have fixed your issue, I can't really know if the effected code even exists in RHEL5.  What is the most recent RHEL6 package you tried and had fail?

Comment 11 gener 2014-01-20 19:27:40 UTC

1. Was the test files your provided .. added to the released versions of 
        device-mapper-multipath-libs-0.4.9-72...
        device-mapper-multipath-0.4.9-72...

2.  If your patched file were on the actual 0.4.9-72 files,  then what would be an anticipated release timeframe ?

3.  what is future for this problem on RH5 ?    ??do you need specific bits from RH5 failure ?

thanks,

Comment 12 Ben Marzinski 2014-01-21 00:53:24 UTC

I accidentally made comment #10 private.  Sorry.

The patches from

http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/1012672/

Will be going into the RHEL6.6 release. There will most likely be a RHEL-6.5 zstream release with the fix.  The decision on that is currently waiting on the customer to verify that the fix does resolve their issue. However, if these patches have fixed your memory corruption issue then it getting a zstream release shouldn't be a problem.

I don't have a good idea on the timeframe for a zstream release.  Since the patch is already written, my part of the whole process would take about an hour or so.  The bigger unknown is QA's schedule.

Like I mentioned in Comment #10, I'd need to know which fix solved your issue.  If it was solved by the fix for 1012672, then it's not in RHEL5. In RHEL5, none of the sysfs handling code that 1012672 fixed is in multipath.  Multipath relies on libsysfs, and there is no analog to the RHEL6 code that has the issue.

Comment 13 gener 2014-01-21 14:07:59 UTC

(In reply to Ben Marzinski from comment #10)
> Have you seen this issue on RHEL5?  There are some significant differences
> in the code. Without figuring out which change appears to have fixed your
> issue, I can't really know if the effected code even exists in RHEL5.  What
> is the most recent RHEL6 package you tried and had fail?

The following RH5  and version of multipath bits as noted below haved shown symptoms of  multipathd service locking up and you have to reboot server.  A service stop/start or restart are not effective in clearing the condition.
 
================================================================================10.36.30.46"  ==RH5.9
device-mapper-multipath-0.4.7-54.el5_9.2
================================================================================10.36.30.50"==RH5.10
device-mapper-multipath-0.4.7-59.el5
 ================================================================================10.36.30.51"==RH5.9
device-mapper-multipath-0.4.7-54.el5_9.1
 ================================================================================10.36.30.52"==RH5.10
device-mapper-multipath-0.4.7-59.el5
================================================================================10.36.30.55"==RH5.9
device-mapper-multipath-0.4.7-54.el5_9.1
 ================================================================================10.36.30.98"==RH5.10
device-mapper-multipath-0.4.7-59.el5
 ================================================================================

My RH6 stations (6.3,6.4,6.5) with your test bits continue to NOT exhibit this symptom, but my remaining RH6 stations that had the distro-released multipath versions (6.3,6.4) continue to consistently lock up.

Comment 14 Karandeep Chahal 2014-01-31 18:39:48 UTC

(In reply to Ben Marzinski from comment #12)
> I accidentally made comment #10 private.  Sorry.
> 
> The patches from
> 
> http://people.redhat.com/bmarzins/device-mapper-multipath/rpms/RHEL6/1012672/
> 
> Will be going into the RHEL6.6 release. There will most likely be a RHEL-6.5
> zstream release with the fix.  The decision on that is currently waiting on
> the customer to verify that the fix does resolve their issue. However, if
> these patches have fixed your memory corruption issue then it getting a
> zstream release shouldn't be a problem.
> 
> I don't have a good idea on the timeframe for a zstream release.  Since the
> patch is already written, my part of the whole process would take about an
> hour or so.  The bigger unknown is QA's schedule.
> 
> Like I mentioned in Comment #10, I'd need to know which fix solved your
> issue.  If it was solved by the fix for 1012672, then it's not in RHEL5. In
> RHEL5, none of the sysfs handling code that 1012672 fixed is in multipath. 
> Multipath relies on libsysfs, and there is no analog to the RHEL6 code that
> has the issue.

Hi Ben,

Attached is a core file from a rhel 5.9 multipathd crash.


device-mapper-multipath-0.4.7-54.el5_9.2
device-mapper-multipath-debuginfo-0.4.7-59.el5
device-mapper-1.02.67-2.el5
device-mapper-multipath-0.4.7-54.el5_9.2
device-mapper-1.02.67-2.el5
device-mapper-debuginfo-1.02.67-2.el5
device-mapper-event-1.02.67-2.el5
device-mapper-multipath-debuginfo-0.4.7-59.el5

Do we have a fix for this?

Thanks
Karan

Comment 15 Karandeep Chahal 2014-01-31 18:40:25 UTC

Created attachment 857949 [details]
RHEL 5.9 multipathd segfault

Comment 16 Ben Marzinski 2014-02-11 23:53:25 UTC

I don't recall ever seeing this issue before, but I'll take a look though the bugs fixed since rhel-5.9.  If I don't find anything, you should probably open a separate bugzilla, since this does not look like the issue fixed by the latest rhel-6 code.

Comment 17 Karandeep Chahal 2014-02-12 14:44:13 UTC

Filed a new defect for RHEL 5.9 bug=1064409

Comment 18 Ben Marzinski 2014-04-03 00:32:55 UTC

*** Bug 1012672 has been marked as a duplicate of this bug. ***

Comment 20 Karandeep Chahal 2014-04-07 13:51:44 UTC

Hi Wang,

Exactly what information do you need? We ran many days with the fix provided by Ben and did not encounter any problems. We ran on a big SAN with hundreds of FC initiators and storage. The initiators were mainly RHEL 6x machines.

Please let me know exactly what information is required.

Thanks
Karan

Comment 22 Ben Marzinski 2014-04-24 17:26:49 UTC

Added the fix from the test package

Comment 26 errata-xmlrpc 2014-10-14 07:42:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1555.html

Note You need to log in before you can comment on or make changes to this bug.