Bug 1459370

Summary: Segfault when reading all_devs device with no_path_retry 4
Product: Red Hat Enterprise Linux 7 Reporter: Nir Soffer <nsoffer>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED ERRATA QA Contact: Lin Li <lilin>
Severity: urgent Docs Contact: Marek Suchánek <msuchane>
Priority: urgent    
Version: 7.3CC: agk, bmarzins, bmcclain, heinzm, jbrassow, lilin, lmiksik, loberman, mgandhi, msnitzer, prajnoha, rhandlin, vanhoof, ylavi
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: device-mapper-multipath-0.4.9-112.el7 Doc Type: Bug Fix
Doc Text:
DM Multipath no longer crashes when adding a feature to an empty string Previously, the DM Multipath service terminated unexpectedly when it attempted to add a feature to the features string of a built-in device configuration that had no features string. With this update, DM Multipath first checks if the features string exists, and creates one if necessary. As a result, DM Multipath no longer crashes when trying to modify a nonexistent features string.
Story Points: ---
Clone Of:
: 1510837 (view as bug list) Environment:
Last Closed: 2018-04-10 16:10:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1298243, 1420851, 1469559, 1510837    
Attachments:
Description Flags
Output of "multipathd show config" with the scratch build none

Description Nir Soffer 2017-06-07 00:37:30 UTC
Description of problem:

Using this multipath.conf:

$ cat /etc/multpath.conf

# cat /etc/multipath.conf 
# VDSM REVISION 1.4

defaults {
    polling_interval            5
    no_path_retry               4
    user_friendly_names         no
    flush_on_last_del           yes
    fast_io_fail_tmo            5
    dev_loss_tmo                30
    max_fds                     4096
}

devices {
    device {
        all_devs                yes
        no_path_retry           4
    }
}


# multipath -ll
Segmentation fault (core dumped)


# gdb multipath
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/multipath...Reading symbols from /usr/sbin/multipath...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Missing separate debuginfos, use: debuginfo-install device-mapper-multipath-0.4.9-99.el7_3.1.x86_64
(gdb) run -ll
Starting program: /usr/sbin/multipath -ll
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff727b4ab in __strstr_sse42 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff727b4ab in __strstr_sse42 () from /lib64/libc.so.6
#1  0x00007ffff751ff1c in add_feature () from /lib64/libmultipath.so.0
#2  0x00007ffff751e2e6 in factorize_hwtable () from /lib64/libmultipath.so.0
#3  0x00007ffff751efc8 in load_config () from /lib64/libmultipath.so.0
#4  0x0000000000402009 in main ()

Version-Release number of selected component (if applicable):
# rpm -qa | grep device-mapper
device-mapper-libs-1.02.135-1.el7_3.3.x86_64
device-mapper-persistent-data-0.6.3-1.el7.x86_64
device-mapper-multipath-libs-0.4.9-99.el7_3.1.x86_64
device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64
device-mapper-1.02.135-1.el7_3.3.x86_64
device-mapper-event-1.02.135-1.el7_3.3.x86_64
device-mapper-multipath-0.4.9-99.el7_3.1.x86_64

How reproducible:
Always

Setting no_path_retry in the all_devs device to fail, multipath works normally.

Comment 2 Ben Marzinski 2017-06-07 20:54:50 UTC
Can you try the rpms at

http://download-node-02.eng.bos.redhat.com/brewroot/scratch/bmarzins/task_13374131/

and see if they fix the issue.  There is a bug in add_features when trying to add a new feature to a device configuration that doesn't already have a feature.

Comment 4 Nir Soffer 2017-06-09 15:54:08 UTC
Created attachment 1286465 [details]
Output of "multipathd show config" with the scratch build

Comment 5 Ben Marzinski 2017-06-09 22:53:15 UTC
Great. We're in the blockers only phase of rhel-7.4, so how urgent is this bugfix for you?

Comment 6 Yaniv Lavi 2017-06-11 09:58:52 UTC
(In reply to Ben Marzinski from comment #5)
> Great. We're in the blockers only phase of rhel-7.4, so how urgent is this
> bugfix for you?

This can wait to 7.4.z.

Comment 7 Nir Soffer 2017-06-11 11:11:42 UTC
(In reply to Yaniv Lavi from comment #6)
> (In reply to Ben Marzinski from comment #5)
> > Great. We're in the blockers only phase of rhel-7.4, so how urgent is this
> > bugfix for you?
> 
> This can wait to 7.4.z.

This configuration was tested by RHV QE on 2016-07-28:
https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c31

We are recommending the "no_path_retry 4" option for about a year in the users
mailing list:
http://lists.ovirt.org/pipermail/users/2016-August/041949.html

So this seems to be a regression in 7.3.

I don't know about customers cases yet, but I don't think we should wait for them.

I would like this fix in 7.3.z.

Comment 8 Yaniv Lavi 2017-06-12 10:09:46 UTC
(In reply to Nir Soffer from comment #7)
> (In reply to Yaniv Lavi from comment #6)
> > (In reply to Ben Marzinski from comment #5)
> > > Great. We're in the blockers only phase of rhel-7.4, so how urgent is this
> > > bugfix for you?
> > 
> > This can wait to 7.4.z.
> 
> This configuration was tested by RHV QE on 2016-07-28:
> https://bugzilla.redhat.com/show_bug.cgi?id=1335176#c31
> 
> We are recommending the "no_path_retry 4" option for about a year in the
> users
> mailing list:
> http://lists.ovirt.org/pipermail/users/2016-August/041949.html
> 
> So this seems to be a regression in 7.3.
> 
> I don't know about customers cases yet, but I don't think we should wait for
> them.
> 
> I would like this fix in 7.3.z.

Nir, is correct. Me comment was under the assumption this isn't a regression.
Bronce, can you mark as blocker?

Comment 9 Ben Marzinski 2017-06-12 21:56:47 UTC
For what it's worth, this isn't a regression from rhel-7.3.  It was broken there too. It was working in rhel-7.2, however.

Comment 10 Ben Marzinski 2017-06-16 16:55:49 UTC
*** Bug 1462134 has been marked as a duplicate of this bug. ***

Comment 14 Ben Marzinski 2017-09-20 00:07:56 UTC
multipath wasn't correctly adding features to a configuration if the current features string was NULL.  It now handles this correctly.

Comment 17 Lin Li 2017-11-14 12:49:26 UTC
Reproduced on device-mapper-multipath-0.4.9-111.el7
1, # rpm -qa | grep multipath
device-mapper-multipath-0.4.9-111.el7.x86_64
device-mapper-multipath-libs-0.4.9-111.el7.x86_64

2, edit /etc/multipath.conf
# cat /etc/multipath.conf
defaults {
    polling_interval            5
    no_path_retry               4
    user_friendly_names         no
    flush_on_last_del           yes
    fast_io_fail_tmo            5
    dev_loss_tmo                30
    max_fds                     4096
}

devices {
    device {
        all_devs                yes
        no_path_retry           4
    }
}

3, # multipath -ll
Segmentation fault

4,# service multipathd reload
Redirecting to /bin/systemctl reload multipathd.service
Job for multipathd.service failed because a fatal signal was delivered to the control process. See "systemctl status multipathd.service" and "journalctl -xe" for details.

5,# systemctl status multipathd.service
● multipathd.service - Device-Mapper Multipath Device Controller
   Loaded: loaded (/usr/lib/systemd/system/multipathd.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Tue 2017-11-14 07:09:22 EST; 42s ago
  Process: 13257 ExecReload=/sbin/multipathd reconfigure (code=exited, status=0/SUCCESS)
 Main PID: 1189 (code=killed, signal=SEGV)

Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: 360a98000324669436c2b45666c56786f: stop event chec...84)
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: 360a98000324669436c2b45666c567871: stop event chec...16)
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: 360a98000324669436c2b45666c567873: stop event chec...48)
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: 360a98000324669436c2b45666c567875: stop event chec...80)
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[13257]: error receiving packet
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: multipathd.service: main process exited, code=killed, s...SEGV
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: PID 1189 read from file /run/multipathd/multipathd.pid ...bie.
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: Reload failed for Device-Mapper Multipath Device Controller.
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: Unit multipathd.service entered failed state.
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: multipathd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.


6, # journalctl -xe
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com polkitd[1521]: Registered Authentication Agent for unix-process:13241:655805
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: reconfigure (operator)
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: 360a98000324669436c2b45666c56786d: stop event checker thre
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: 360a98000324669436c2b45666c56786f: stop event checker thre
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: 360a98000324669436c2b45666c567871: stop event checker thre
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: 360a98000324669436c2b45666c567873: stop event checker thre
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[1189]: 360a98000324669436c2b45666c567875: stop event checker thre
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com kernel: multipathd[1222]: segfault at 0 ip 00007f6091eb2f5b sp 00007f6093431
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com multipathd[13257]: error receiving packet
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: multipathd.service: main process exited, code=killed, status=11/
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: PID 1189 read from file /run/multipathd/multipathd.pid does not 
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: Reload failed for Device-Mapper Multipath Device Controller.
-- Subject: Unit multipathd.service has finished reloading its configuration
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit multipathd.service has finished reloading its configuration
-- 
-- The result is failed.
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: Unit multipathd.service entered failed state.
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com systemd[1]: multipathd.service failed.
Nov 14 07:09:22 storageqe-06.rhts.eng.bos.redhat.com polkitd[1521]: Unregistered Authentication Agent for unix-process:13241:6558
Nov 14 07:09:36 storageqe-06.rhts.eng.bos.redhat.com kernel: multipath[13271]: segfault at 0 ip 00007fb9904d1f5b sp 00007ffde827d






Verified on device-mapper-multipath-0.4.9-116
1, # rpm -qa | grep multipath
device-mapper-multipath-libs-0.4.9-116.el7.x86_64
device-mapper-multipath-devel-0.4.9-116.el7.x86_64
device-mapper-multipath-debuginfo-0.4.9-116.el7.x86_64
device-mapper-multipath-0.4.9-116.el7.x86_64
device-mapper-multipath-sysvinit-0.4.9-116.el7.x86_64

2, edit /etc/multipath.conf
# cat /etc/multipath.conf
defaults {
    polling_interval            5
    no_path_retry               4
    user_friendly_names         no
    flush_on_last_del           yes
    fast_io_fail_tmo            5
    dev_loss_tmo                30
    max_fds                     4096
}

devices {
    device {
        all_devs                yes
        no_path_retry           4
    }
}

3,  # multipath -ll
360a98000324669436c2b45666c56786d dm-0 NETAPP  ,LUN             
size=20G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:0 sdg 8:96  active ready running
| `- 4:0:0:0 sdl 8:176 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:0 sdb 8:16  active ready running
  `- 4:0:1:0 sdq 65:0  active ready running
360a98000324669436c2b45666c567875 dm-8 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:4 sdk 8:160 active ready running
| `- 4:0:0:4 sdp 8:240 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:4 sdf 8:80  active ready running
  `- 4:0:1:4 sdu 65:64 active ready running
360a98000324669436c2b45666c567873 dm-7 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:3 sdj 8:144 active ready running
| `- 4:0:0:3 sdo 8:224 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:3 sde 8:64  active ready running
  `- 4:0:1:3 sdt 65:48 active ready running
360a98000324669436c2b45666c567871 dm-6 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:2 sdi 8:128 active ready running
| `- 4:0:0:2 sdn 8:208 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:2 sdd 8:48  active ready running
  `- 4:0:1:2 sds 65:32 active ready running
360a98000324669436c2b45666c56786f dm-5 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:1 sdh 8:112 active ready running
| `- 4:0:0:1 sdm 8:192 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:1 sdc 8:32  active ready running
  `- 4:0:1:1 sdr 65:16 active ready running


4, # service multipathd reload
Reloading multipathd configuration (via systemctl):  [  OK  ]

5, # multipath -ll
360a98000324669436c2b45666c56786d dm-0 NETAPP  ,LUN             
size=20G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:0 sdg 8:96  active ready running
| `- 4:0:0:0 sdl 8:176 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:0 sdb 8:16  active ready running
  `- 4:0:1:0 sdq 65:0  active ready running
360a98000324669436c2b45666c567875 dm-8 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:4 sdk 8:160 active ready running
| `- 4:0:0:4 sdp 8:240 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:4 sdf 8:80  active ready running
  `- 4:0:1:4 sdu 65:64 active ready running
360a98000324669436c2b45666c567873 dm-7 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:3 sdj 8:144 active ready running
| `- 4:0:0:3 sdo 8:224 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:3 sde 8:64  active ready running
  `- 4:0:1:3 sdt 65:48 active ready running
360a98000324669436c2b45666c567871 dm-6 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:2 sdi 8:128 active ready running
| `- 4:0:0:2 sdn 8:208 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:2 sdd 8:48  active ready running
  `- 4:0:1:2 sds 65:32 active ready running
360a98000324669436c2b45666c56786f dm-5 NETAPP  ,LUN             
size=2.0G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:1:1 sdh 8:112 active ready running
| `- 4:0:0:1 sdm 8:192 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 1:0:0:1 sdc 8:32  active ready running
  `- 4:0:1:1 sdr 65:16 active ready running



Test result:
multipath no longer crashes when trying to modify the features string of built-in device configurations with no feature string.

Comment 23 errata-xmlrpc 2018-04-10 16:10:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0884