Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 4 product line. The current stable release is 4.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 740171

Summary:	multipath -l segfault with high number of LUNs
Product:	Red Hat Enterprise Linux 4	Reporter:	Harald Klein <hklein>
Component:	device-mapper-multipath	Assignee:	Ben Marzinski <bmarzins>
Status:	CLOSED WONTFIX	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.9	CC:	agk, bmarzins, coughlan, dwysocha, heinzm, jwest, mbroz, prajnoha, prockai, thornber
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-06-14 20:25:22 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Harald Klein 2011-09-21 08:06:27 UTC

Description of problem:

'multipath -l' segfaults on a system with 1016 LUNs:

multipath[3019]: segfault at 0000000000000000 rip 0000002a957f5c44 rsp 0000007fbffffa78 error 4

Version-Release number of selected component (if applicable):

device-mapper-multipath-0.4.5-42

Comment 2 Harald Klein 2011-09-21 08:11:39 UTC

Core was generated by `multipath -l'.
Program terminated with signal 11, Segmentation fault.

#0  0x0000002a957f5c44 in memcpy () from /lib64/tls/libc.so.6
#1  0x000000000040672e in cache_load (pathvec=0x5304a0) at cache.c:59
#2  0x00000000004050a9 in configure () at main.c:1034
#3  0x0000000000405626 in main (argc=2, argv=0x7fbffff978) at main.c:1214
#4  0x0000002a9579f4cb in __libc_check_standard_fds () from /lib64/tls/libc.so.6
#5  0x0000000000000000 in ?? ()

(gdb) print p
$2 = 0x0
(gdb) p reply
$9 = 0x0
(gdb) p len
$10 = 4315830

send_packet(fd, "dump pathvec", 13);
        recv_packet(fd, &reply, &len);

cache_load does not check if the buffer pointer returned by recv_packet (reply to "dump pathvec" is valid.

Comment 3 Ben Marzinski 2011-09-27 18:12:15 UTC

The problem I see with this bug is that recv_packet bug doesn't really look like the ultimate problem, although I agree that it needs to be fixed.

The question I have is why is recv_packet failing.  That len value doesn't see like it could possibly be right, since it's not a multiple of the size of
struct path.

My worry is that multipathd crashed.  But I see that there are multiple crash dumps.  Were these all taken in a row?  If so, I don't see how multipathd could be crashed, since in each case, multipath successfully connected to the unix socket.  Then why couldn't recv_packet even get the size of the output, and why
didn't it just hang, waiting for data?  If this is reproducible, can you try
running

# multipathd -k"dump pathvec"

It won't give you the right information, but I'd like to know if it segfaults. If it does, try running something else, like

# multipathd -k"help"

to see if that segfaults. I can easily fix this crash, but I'd like to make sure that multipathd is still working, and I'd like to make sure that the client can get the cache information.

Comment 4 Harald Klein 2011-09-28 07:42:02 UTC

it is reproducible, the three core dumps are pretty much equal. I'll gather the requested info asap.

Comment 5 Harald Klein 2011-09-28 10:15:51 UTC

It doesn't dump core or segfault (neither does the running daemon), but dies on sigpipe:

root@zhlr418b:/var/crash# multipathd -k"dump pathvec"
(null)
root@zhlr418b:/var/crash# strace multipathd -k"dump pathvec"
[...]
socket(PF_FILE, SOCK_STREAM, 0)         = 3
connect(3, {sa_family=AF_FILE, path="/var/run/multipathd.sock"}, 110) = 0
write(3, "\r\0\0\0\0\0\0\0", 8)         = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
+++ killed by SIGPIPE +++

for -k"help" it's similar..

could it be that the open file limit (1024) does relate to this issue?

br Hari

Comment 6 Harald Klein 2011-09-28 12:35:35 UTC

yes, the core dumps were taken in a row:
root@host:~# multipath -ll
multipath[3019]: segfault at 0000000000000000 rip 0000002a957f5c44 rsp 0000007fbffffa78 error 4
Segmentation fault
root@host:~# multipath -l
multipath[3032]: segfault at 0000000000000000 rip 0000002a957f5c44 rsp 0000007fbffffa78 error 4
Segmentation fault
root@host:~# multipath
multipath[3174]: segfault at 0000000000000000 rip 0000002a957f5c44 rsp 0000007fbffffa88 error 4
Segmentation fault

Comment 7 Ben Marzinski 2011-10-20 04:12:28 UTC

Sorry I haven't had a chance to work on this. I've gotten sidetracked by other bugs. Could you try stopping multipathd, and then running

# ulimit -n 10000
# multipathd

and then retrying the reproducer.  This should tell us if it's the number of open file descriptors that's causing this.