Bug 740171
Summary: | multipath -l segfault with high number of LUNs | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Harald Klein <hklein> |
Component: | device-mapper-multipath | Assignee: | Ben Marzinski <bmarzins> |
Status: | CLOSED WONTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.9 | CC: | agk, bmarzins, coughlan, dwysocha, heinzm, jwest, mbroz, prajnoha, prockai, thornber |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-06-14 20:25:22 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Harald Klein
2011-09-21 08:06:27 UTC
Core was generated by `multipath -l'. Program terminated with signal 11, Segmentation fault. #0 0x0000002a957f5c44 in memcpy () from /lib64/tls/libc.so.6 #1 0x000000000040672e in cache_load (pathvec=0x5304a0) at cache.c:59 #2 0x00000000004050a9 in configure () at main.c:1034 #3 0x0000000000405626 in main (argc=2, argv=0x7fbffff978) at main.c:1214 #4 0x0000002a9579f4cb in __libc_check_standard_fds () from /lib64/tls/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) print p $2 = 0x0 (gdb) p reply $9 = 0x0 (gdb) p len $10 = 4315830 send_packet(fd, "dump pathvec", 13); recv_packet(fd, &reply, &len); cache_load does not check if the buffer pointer returned by recv_packet (reply to "dump pathvec" is valid. The problem I see with this bug is that recv_packet bug doesn't really look like the ultimate problem, although I agree that it needs to be fixed. The question I have is why is recv_packet failing. That len value doesn't see like it could possibly be right, since it's not a multiple of the size of struct path. My worry is that multipathd crashed. But I see that there are multiple crash dumps. Were these all taken in a row? If so, I don't see how multipathd could be crashed, since in each case, multipath successfully connected to the unix socket. Then why couldn't recv_packet even get the size of the output, and why didn't it just hang, waiting for data? If this is reproducible, can you try running # multipathd -k"dump pathvec" It won't give you the right information, but I'd like to know if it segfaults. If it does, try running something else, like # multipathd -k"help" to see if that segfaults. I can easily fix this crash, but I'd like to make sure that multipathd is still working, and I'd like to make sure that the client can get the cache information. it is reproducible, the three core dumps are pretty much equal. I'll gather the requested info asap. It doesn't dump core or segfault (neither does the running daemon), but dies on sigpipe: root@zhlr418b:/var/crash# multipathd -k"dump pathvec" (null) root@zhlr418b:/var/crash# strace multipathd -k"dump pathvec" [...] socket(PF_FILE, SOCK_STREAM, 0) = 3 connect(3, {sa_family=AF_FILE, path="/var/run/multipathd.sock"}, 110) = 0 write(3, "\r\0\0\0\0\0\0\0", 8) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- +++ killed by SIGPIPE +++ for -k"help" it's similar.. could it be that the open file limit (1024) does relate to this issue? br Hari yes, the core dumps were taken in a row: root@host:~# multipath -ll multipath[3019]: segfault at 0000000000000000 rip 0000002a957f5c44 rsp 0000007fbffffa78 error 4 Segmentation fault root@host:~# multipath -l multipath[3032]: segfault at 0000000000000000 rip 0000002a957f5c44 rsp 0000007fbffffa78 error 4 Segmentation fault root@host:~# multipath multipath[3174]: segfault at 0000000000000000 rip 0000002a957f5c44 rsp 0000007fbffffa88 error 4 Segmentation fault Sorry I haven't had a chance to work on this. I've gotten sidetracked by other bugs. Could you try stopping multipathd, and then running # ulimit -n 10000 # multipathd and then retrying the reproducer. This should tell us if it's the number of open file descriptors that's causing this. |