Bug 1713459

Summary: segfault in libforeign-nvme.so
Product: Red Hat Enterprise Linux 8 Reporter: Marco Patalano <mpatalan>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: Marco Patalano <mpatalan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.1CC: agk, bmarzins, heinzm, msnitzer, prajnoha, rhandlin, zkabelac
Target Milestone: rc   
Target Release: 8.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: device-mapper-multipath-0.8.0-5.el8 Doc Type: Bug Fix
Doc Text:
Cause: Multipathd was deleting the wrong element of a vector, when removing an native multipathing nvme device. Consequence: multipathd could segfault when running on systems with nvme devices configured for native multipathing Fix: Multipathd now deletes the correct element from the vector. Result: multipathd no longer crashes when run on systems with nvme devices configured for native multipathing.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-05 22:18:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marco Patalano 2019-05-23 18:49:11 UTC
Description of problem: On a system with NVMe Native Multipath enabled, I accidentally forgot to disable/remove dm-multipath. Then during array an failover test, I observed the following segfault in /var/log/messages:

May 23 10:38:09 storageqe-14 kernel: nvme nvme3: ANA group 1: optimized.
May 23 10:38:09 storageqe-14 kernel: nvme nvme2: ANA group 1: optimized.
May 23 10:38:12 storageqe-14 kernel: nvme nvme0: NVME-FC{0}: controller connectivity lost. Awaiting Reconnect
May 23 10:38:12 storageqe-14 kernel: nvme nvme1: NVME-FC{1}: controller connectivity lost. Awaiting Reconnect
May 23 10:38:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:38:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:38:43 storageqe-14 kernel: rport-2:0-5: blocked FC remote port time out: removing rport
May 23 10:38:43 storageqe-14 kernel: rport-2:0-6: blocked FC remote port time out: removing rport
May 23 10:38:43 storageqe-14 kernel: rport-2:0-4: blocked FC remote port time out: removing rport
May 23 10:39:12 storageqe-14 kernel: nvme nvme1: NVME-FC{1}: dev_loss_tmo (60) expired while waiting for remoteport connectivity.
May 23 10:39:12 storageqe-14 kernel: nvme nvme0: NVME-FC{0}: dev_loss_tmo (60) expired while waiting for remoteport connectivity.
May 23 10:39:12 storageqe-14 kernel: nvme nvme1: Removing ctrl: NQN "nqn.1992-08.com.netapp:sn.e18bfca87d5e11e98c0800a098cbcac6:subsystem.st14_nvme_ss_1_1"
May 23 10:39:12 storageqe-14 kernel: nvme nvme0: Removing ctrl: NQN "nqn.1992-08.com.netapp:sn.e18bfca87d5e11e98c0800a098cbcac6:subsystem.st14_nvme_ss_1_1"
May 23 10:39:12 storageqe-14 multipathd[816]: nvme0c192n1: path already removed
May 23 10:39:13 storageqe-14 multipathd[816]: nvme0c256n1: path already removed
May 23 10:39:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:39:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:39:28 storageqe-14 multipathd[816]: path 2 not found in nvme0n1 any more
May 23 10:39:28 storageqe-14 multipathd[816]: path 1 not found in nvme0n1 any more
May 23 10:39:48 storageqe-14 kernel: multipathd[821]: segfault at 20 ip 00007f3faf6daf1f sp 00007f3fb3fc6920 error 6 in libforeign-nvme.so[7f3faf6d9000+6000]
May 23 10:39:48 storageqe-14 kernel: Code: 00 00 00 0f 0b f3 0f 1e fa 48 89 c3 eb 08 4c 89 f7 e8 f5 fc ff ff 48 8d bd 70 cf ff ff e8 29 fe ff ff 48 89 df e8 21 ff ff ff <c6> 04 25 20 00 00 00 00 0f 0b 4c 89 e7 e8 3f fe ff ff eb d8 4c 89
May 23 10:39:48 storageqe-14 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
May 23 10:39:48 storageqe-14 systemd[1]: Started Process Core Dump (PID 2464/UID 0).
May 23 10:39:48 storageqe-14 systemd-coredump[2465]: Process 816 (multipathd) of user 0 dumped core.#012#012Stack trace of thread 821:#012#0  0x00007f3faf6daf1f _find_controllers.cold.9 (libforeign-nvme.so)#012#1  0x00007f3faf6dcc50 _check (libforeign-nvme.so)#012#2  0x00007f3faf6dccba check (libforeign-nvme.so)#012#3  0x00007f3fb3c0464c check_foreign (libmultipath.so.0)#012#4  0x0000563f0e7e0bde checkerloop (multipathd)#012#5  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#6  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 818:#012#0  0x00007f3fb232b4c6 ppoll (libc.so.6)#012#1  0x0000563f0e7e1779 uxsock_listen (multipathd)#012#2  0x0000563f0e7dc944 uxlsnrloop (multipathd)#012#3  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#4  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 816:#012#0  0x00007f3fb2f6f4dc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)#012#1  0x0000563f0e7dad81 main (multipathd)#012#2  0x00007f3fb225d813 __libc_start_main (libc.so.6)#012#3  0x0000563f0e7db67e _start (multipathd)#012#012Stack trace of thread 820:#012#0  0x00007f3fb232b3d1 __poll (libc.so.6)#012#1  0x00007f3fb3bf1f86 uevent_listen (libmultipath.so.0)#012#2  0x0000563f0e7dc355 ueventloop (multipathd)#012#3  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#4  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 822:#012#0  0x00007f3fb2f6f4dc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)#012#1  0x00007f3fb3bf11aa uevent_dispatch (libmultipath.so.0)#012#2  0x0000563f0e7dc3ac uevqloop (multipathd)#012#3  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#4  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 823:#012#0  0x00007f3fb2330ead syscall (libc.so.6)#012#1  0x00007f3fb3184872 call_rcu_thread (liburcu.so.6)#012#2  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#3  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 819:#012#0  0x00007f3fb232b3d1 __poll (libc.so.6)#012#1  0x0000563f0e7e72f1 dmevent_loop (multipathd)#012#2  0x0000563f0e7e7d1c wait_dmevents (multipathd)#012#3  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#4  0x00007f3fb2336653 __clone (libc.so.6)
May 23 10:39:48 storageqe-14 systemd[1]: multipathd.service: Main process exited, code=killed, status=11/SEGV
May 23 10:39:48 storageqe-14 systemd[1]: multipathd.service: Failed with result 'signal'.
May 23 10:40:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:40:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:40:38 storageqe-14 kernel: perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
May 23 10:41:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:41:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:42:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:42:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:42:29 storageqe-14 systemd[1]: Starting Device-Mapper Multipath Device Controller...
May 23 10:42:29 storageqe-14 multipathd[2548]: --------start up--------
May 23 10:42:29 storageqe-14 multipathd[2548]: read /etc/multipath.conf
May 23 10:42:29 storageqe-14 multipathd[2548]: path checkers start up
May 23 10:42:29 storageqe-14 systemd[1]: Started Device-Mapper Multipath Device Controller.
May 23 10:50:21 storageqe-14 kernel: nvme nvme3: ANA group 1: inaccessible.
May 23 10:50:21 storageqe-14 kernel: nvme nvme2: ANA group 1: inaccessible.


Version-Release number of selected component (if applicable):
device-mapper-multipath-0.8.0-2.el8

How reproducible: Unknown - occurred once


Steps to Reproduce:
1. Enable NVMe Native multipath in kernal command line
2. Verify dm-multipath enabled
3. Connect to NVMe namespace on array
4. Issue a controller failover on the array

Actual results: segfault


Expected results: no segfault should be observed


Additional info:

Comment 1 Ben Marzinski 2019-06-03 22:54:00 UTC
There was a bug in a multipath vector handling function that caused it to delete the wrong element. It should be fixed now.

Comment 3 Marco Patalano 2019-09-17 13:05:04 UTC
Reproduced with device-mapper-multipath-0.8.0-3.el8. Verified the fix with device-mapper-multipath-0.8.0-5.el8.

Comment 5 errata-xmlrpc 2019-11-05 22:18:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3578