1713459 – segfault in libforeign-nvme.so

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1713459 - segfault in libforeign-nvme.so

Summary: segfault in libforeign-nvme.so

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	8.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.1
Assignee:	Ben Marzinski
QA Contact:	Marco Patalano
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-23 18:49 UTC by Marco Patalano
Modified:	2021-09-06 15:18 UTC (History)
CC List:	7 users (show)
Fixed In Version:	device-mapper-multipath-0.8.0-5.el8
Doc Type:	Bug Fix
Doc Text:	Cause: Multipathd was deleting the wrong element of a vector, when removing an native multipathing nvme device. Consequence: multipathd could segfault when running on systems with nvme devices configured for native multipathing Fix: Multipathd now deletes the correct element from the vector. Result: multipathd no longer crashes when run on systems with nvme devices configured for native multipathing.
Clone Of:
Environment:
Last Closed:	2019-11-05 22:18:16 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:3578	0	None	None	None	2019-11-05 22:18:26 UTC

Description Marco Patalano 2019-05-23 18:49:11 UTC

Description of problem: On a system with NVMe Native Multipath enabled, I accidentally forgot to disable/remove dm-multipath. Then during array an failover test, I observed the following segfault in /var/log/messages:

May 23 10:38:09 storageqe-14 kernel: nvme nvme3: ANA group 1: optimized.
May 23 10:38:09 storageqe-14 kernel: nvme nvme2: ANA group 1: optimized.
May 23 10:38:12 storageqe-14 kernel: nvme nvme0: NVME-FC{0}: controller connectivity lost. Awaiting Reconnect
May 23 10:38:12 storageqe-14 kernel: nvme nvme1: NVME-FC{1}: controller connectivity lost. Awaiting Reconnect
May 23 10:38:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:38:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:38:43 storageqe-14 kernel: rport-2:0-5: blocked FC remote port time out: removing rport
May 23 10:38:43 storageqe-14 kernel: rport-2:0-6: blocked FC remote port time out: removing rport
May 23 10:38:43 storageqe-14 kernel: rport-2:0-4: blocked FC remote port time out: removing rport
May 23 10:39:12 storageqe-14 kernel: nvme nvme1: NVME-FC{1}: dev_loss_tmo (60) expired while waiting for remoteport connectivity.
May 23 10:39:12 storageqe-14 kernel: nvme nvme0: NVME-FC{0}: dev_loss_tmo (60) expired while waiting for remoteport connectivity.
May 23 10:39:12 storageqe-14 kernel: nvme nvme1: Removing ctrl: NQN "nqn.1992-08.com.netapp:sn.e18bfca87d5e11e98c0800a098cbcac6:subsystem.st14_nvme_ss_1_1"
May 23 10:39:12 storageqe-14 kernel: nvme nvme0: Removing ctrl: NQN "nqn.1992-08.com.netapp:sn.e18bfca87d5e11e98c0800a098cbcac6:subsystem.st14_nvme_ss_1_1"
May 23 10:39:12 storageqe-14 multipathd[816]: nvme0c192n1: path already removed
May 23 10:39:13 storageqe-14 multipathd[816]: nvme0c256n1: path already removed
May 23 10:39:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:39:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:39:28 storageqe-14 multipathd[816]: path 2 not found in nvme0n1 any more
May 23 10:39:28 storageqe-14 multipathd[816]: path 1 not found in nvme0n1 any more
May 23 10:39:48 storageqe-14 kernel: multipathd[821]: segfault at 20 ip 00007f3faf6daf1f sp 00007f3fb3fc6920 error 6 in libforeign-nvme.so[7f3faf6d9000+6000]
May 23 10:39:48 storageqe-14 kernel: Code: 00 00 00 0f 0b f3 0f 1e fa 48 89 c3 eb 08 4c 89 f7 e8 f5 fc ff ff 48 8d bd 70 cf ff ff e8 29 fe ff ff 48 89 df e8 21 ff ff ff <c6> 04 25 20 00 00 00 00 0f 0b 4c 89 e7 e8 3f fe ff ff eb d8 4c 89
May 23 10:39:48 storageqe-14 systemd[1]: Created slice system-systemd\x2dcoredump.slice.
May 23 10:39:48 storageqe-14 systemd[1]: Started Process Core Dump (PID 2464/UID 0).
May 23 10:39:48 storageqe-14 systemd-coredump[2465]: Process 816 (multipathd) of user 0 dumped core.#012#012Stack trace of thread 821:#012#0  0x00007f3faf6daf1f _find_controllers.cold.9 (libforeign-nvme.so)#012#1  0x00007f3faf6dcc50 _check (libforeign-nvme.so)#012#2  0x00007f3faf6dccba check (libforeign-nvme.so)#012#3  0x00007f3fb3c0464c check_foreign (libmultipath.so.0)#012#4  0x0000563f0e7e0bde checkerloop (multipathd)#012#5  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#6  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 818:#012#0  0x00007f3fb232b4c6 ppoll (libc.so.6)#012#1  0x0000563f0e7e1779 uxsock_listen (multipathd)#012#2  0x0000563f0e7dc944 uxlsnrloop (multipathd)#012#3  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#4  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 816:#012#0  0x00007f3fb2f6f4dc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)#012#1  0x0000563f0e7dad81 main (multipathd)#012#2  0x00007f3fb225d813 __libc_start_main (libc.so.6)#012#3  0x0000563f0e7db67e _start (multipathd)#012#012Stack trace of thread 820:#012#0  0x00007f3fb232b3d1 __poll (libc.so.6)#012#1  0x00007f3fb3bf1f86 uevent_listen (libmultipath.so.0)#012#2  0x0000563f0e7dc355 ueventloop (multipathd)#012#3  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#4  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 822:#012#0  0x00007f3fb2f6f4dc pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0)#012#1  0x00007f3fb3bf11aa uevent_dispatch (libmultipath.so.0)#012#2  0x0000563f0e7dc3ac uevqloop (multipathd)#012#3  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#4  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 823:#012#0  0x00007f3fb2330ead syscall (libc.so.6)#012#1  0x00007f3fb3184872 call_rcu_thread (liburcu.so.6)#012#2  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#3  0x00007f3fb2336653 __clone (libc.so.6)#012#012Stack trace of thread 819:#012#0  0x00007f3fb232b3d1 __poll (libc.so.6)#012#1  0x0000563f0e7e72f1 dmevent_loop (multipathd)#012#2  0x0000563f0e7e7d1c wait_dmevents (multipathd)#012#3  0x00007f3fb2f692de start_thread (libpthread.so.0)#012#4  0x00007f3fb2336653 __clone (libc.so.6)
May 23 10:39:48 storageqe-14 systemd[1]: multipathd.service: Main process exited, code=killed, status=11/SEGV
May 23 10:39:48 storageqe-14 systemd[1]: multipathd.service: Failed with result 'signal'.
May 23 10:40:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:40:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:40:38 storageqe-14 kernel: perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
May 23 10:41:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:41:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:42:22 storageqe-14 restraintd[1540]: *** Current Time: Thu May 23 10:42:22 2019 Localwatchdog at:  * Disabled! *
May 23 10:42:29 storageqe-14 systemd[1]: Starting Device-Mapper Multipath Device Controller...
May 23 10:42:29 storageqe-14 multipathd[2548]: --------start up--------
May 23 10:42:29 storageqe-14 multipathd[2548]: read /etc/multipath.conf
May 23 10:42:29 storageqe-14 multipathd[2548]: path checkers start up
May 23 10:42:29 storageqe-14 systemd[1]: Started Device-Mapper Multipath Device Controller.
May 23 10:50:21 storageqe-14 kernel: nvme nvme3: ANA group 1: inaccessible.
May 23 10:50:21 storageqe-14 kernel: nvme nvme2: ANA group 1: inaccessible.


Version-Release number of selected component (if applicable):
device-mapper-multipath-0.8.0-2.el8

How reproducible: Unknown - occurred once


Steps to Reproduce:
1. Enable NVMe Native multipath in kernal command line
2. Verify dm-multipath enabled
3. Connect to NVMe namespace on array
4. Issue a controller failover on the array

Actual results: segfault


Expected results: no segfault should be observed


Additional info:

Comment 1 Ben Marzinski 2019-06-03 22:54:00 UTC

There was a bug in a multipath vector handling function that caused it to delete the wrong element. It should be fixed now.

Comment 3 Marco Patalano 2019-09-17 13:05:04 UTC

Reproduced with device-mapper-multipath-0.8.0-3.el8. Verified the fix with device-mapper-multipath-0.8.0-5.el8.

Comment 5 errata-xmlrpc 2019-11-05 22:18:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3578

Note You need to log in before you can comment on or make changes to this bug.