1444193 – pvresize hangs, seemingly due to changes made by multipathd 'resize map' failure

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1444193 - pvresize hangs, seemingly due to changes made by multipathd 'resize map' failure

Summary: pvresize hangs, seemingly due to changes made by multipathd 'resize map' failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	6.9
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Ben Marzinski
QA Contact:	Lin Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1374441 1448970 1461138 1507140
TreeView+	depends on / blocked

Reported:	2017-04-20 19:29 UTC by John Pittman
Modified:	2021-09-09 12:15 UTC (History)
CC List:	12 users (show)
Fixed In Version:	device-mapper-multipath-0.4.9-103.el6
Doc Type:	Bug Fix
Doc Text:	Cause: If multipath failed to resize a multipath device, the device would be left in the SUSPENDED state. Consequence: multipath devices would become unusable if a resize on them fails. Fix: If a resize fails, multipath will now resume using the device with the previous size. Result: When resizing a multipath device fails, the device is still usable.
Clone Of:
Clones:	1448970 (view as bug list)
Environment:
Last Closed:	2018-06-19 05:17:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1893	0	None	None	None	2018-06-19 05:19:13 UTC

Description John Pittman 2017-04-20 19:29:34 UTC

Description of problem:

The pvresize command hangs after a 'multipathd -k resize map <map_name>' fails.

Version-Release number of selected component (if applicable):

device-mapper-multipath-0.4.9-100.el6.x86_64
lvm2-2.02.143-12.el6.x86_64
kernel-2.6.32-642.el6.x86_64

How reproducible:

mpatha (wwid_omitted) dm-2 NETAPP,LUN
size=16G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 2:0:0:0 sdb 8:16   active ready running
| |- 3:0:1:0 sde 8:64   active ready running
| |- 5:0:0:0 sdh 8:112  active ready running
| `- 4:0:0:0 sdf 8:80   active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 2:0:1:0 sdc 8:32   active ready running
  |- 3:0:0:0 sdd 8:48   active ready running
  |- 4:0:1:0 sdg 8:96   active ready running
  `- 5:0:1:0 sdi 8:128  active ready running

- Increase backend lun by 2G

- Scan new size in for all except sdi to emulate inconsistent sizes among the sd devices

[root@host]# for i in sdb sde sdh sdf sdc sdd sdg ; do echo 1 > /sys/block/$i/device/rescan ; done

- Attempt to resize map fails as expected

[root@host]# multipathd -k'resize map mpatha'
fail

- In this test case, 'pvresize /dev/mapper/mpatha' was run after the "successful" multipath resize.  The command hung with the below backtrace.  

[root@host]# pvresize /dev/mapper/mpatha
^C^C^C^C

[ 2523.294575] INFO: task pvresize:19399 blocked for more than 120 seconds.
[ 2523.294714]       Not tainted 2.6.32-696.1.1.el6.x86_64 #1
[ 2523.294845] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2523.295088] pvresize      D 0000000000000004     0 19399  16600 0x00000080
[ 2523.295224]  ffff88081d0afa18 0000000000000086 ffff88081d0af978 ffffffff81278c54
[ 2523.295484]  ffff880818900e80 ffff880819211900 ffff88081d0af9e8 ffffffffa0004f4f
[ 2523.295748]  ffff88081d0af9d8 ffffffff811d628b ffff880818835ad8 ffff88081d0affd8
[ 2523.296004] Call Trace:
[ 2523.296129]  [<ffffffff81278c54>] ? blk_unplug+0x34/0x70
[ 2523.296266]  [<ffffffffa0004f4f>] ? dm_table_unplug_all+0x5f/0x100 [dm_mod]
[ 2523.296397]  [<ffffffff811d628b>] ? bio_alloc_bioset+0x5b/0xf0
[ 2523.296526]  [<ffffffff8154aed3>] io_schedule+0x73/0xc0
[ 2523.296661]  [<ffffffff811dab5d>] __blockdev_direct_IO_newtrunc+0xb7d/0x1270
[ 2523.296801]  [<ffffffff811d64d0>] ? blkdev_get_block+0x0/0x20
[ 2523.296930]  [<ffffffff811db2c7>] __blockdev_direct_IO+0x77/0xe0
[ 2523.297058]  [<ffffffff811d64d0>] ? blkdev_get_block+0x0/0x20
[ 2523.297187]  [<ffffffff811d7557>] blkdev_direct_IO+0x57/0x60
[ 2523.297316]  [<ffffffff811d64d0>] ? blkdev_get_block+0x0/0x20
[ 2523.297444]  [<ffffffff811303eb>] generic_file_aio_read+0x6bb/0x700
[ 2523.297574]  [<ffffffff811d80a0>] ? blkdev_get+0x10/0x20
[ 2523.297709]  [<ffffffff811d80b0>] ? blkdev_open+0x0/0xc0
[ 2523.297841]  [<ffffffff81196e67>] ? __dentry_open+0x257/0x380
[ 2523.297970]  [<ffffffff811d6a41>] blkdev_aio_read+0x51/0x80
[ 2523.298099]  [<ffffffff81199b8a>] do_sync_read+0xfa/0x140
[ 2523.298226]  [<ffffffff810a6840>] ? autoremove_wake_function+0x0/0x40
[ 2523.298351]  [<ffffffff811d686c>] ? block_ioctl+0x3c/0x40
[ 2523.298475]  [<ffffffff811af632>] ? vfs_ioctl+0x22/0xa0
[ 2523.298606]  [<ffffffff811af7d4>] ? do_vfs_ioctl+0x84/0x580
[ 2523.298736]  [<ffffffff8123ac26>] ? security_file_permission+0x16/0x20
[ 2523.298862]  [<ffffffff8119a485>] vfs_read+0xb5/0x1a0
[ 2523.298985]  [<ffffffff8119b236>] ? fget_light_pos+0x16/0x50
[ 2523.299109]  [<ffffffff8119a7d1>] sys_read+0x51/0xb0
[ 2523.299232]  [<ffffffff810ee3ce>] ? __audit_syscall_exit+0x25e/0x290
[ 2523.299358]  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

- To kill pvresize, restart multipathd from different terminal

[root@host]# service multipathd restart
ok
Stopping multipathd daemon:                                [  OK  ]
Starting multipathd daemon:                                [  OK  ]

- pvresize completes in original terminal

[root@host]# pvresize /dev/mapper/mpatha
  Physical volume "/dev/mapper/mpatha" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized

[root@host]# multipath -ll mpatha
mpatha (wwid_ommited) dm-2 NETAPP,LUN
size=16G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 2:0:0:0 sdb 8:16   active ready running
| |- 3:0:1:0 sde 8:64   active ready running
| |- 4:0:0:0 sdf 8:80   active ready running
| `- 5:0:0:0 sdh 8:112  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 2:0:1:0 sdc 8:32   active ready running
  |- 3:0:0:0 sdd 8:48   active ready running
  |- 4:0:1:0 sdg 8:96   active ready running
  `- 5:0:1:0 sdi 8:128  active ready running

[root@host]# pvs -v /dev/mapper/mpatha
  PV                 VG     Fmt  Attr PSize  PFree DevSize PV UUID                               
  /dev/mapper/mpatha testvg lvm2 a--u 16.00g 2.00g  16.00g Qn7g0o-n3Dh-4EZD-d6IC-aVce-XESt-foKRQm

Actual results:

The 'resize map' fails, which is expected and desired.  But when the command is run, it's changing something, unsure what that is.  If I do not run the 'resize map', the pvresize does not hang afterward. (It claims repeatedly that the PV is being resized; opening a separate bug on that.)  If whatever the 'resize map' command is changing is normal and expected, we can check with the lvm team to see if the pvresize hang is expected under those conditions.  Or if the 'resize map' is doing something it shouldn't be, maybe you guys can look at that.  For this reason, I started the bug in DMM.  I'll be glad to provide any data needed as we have a reproducer.

Expected results:

Ideally, the 'resize map' would fail, pvresize would not hang, and instead would show 0 devices resized.

Comment 5 Ben Marzinski 2017-04-27 22:26:12 UTC

I'm pretty sure that if you check with "dmsetup info <dev>" after the failed resize, you will see that the multipath table is in the SUSPENDED state. It looks like the resize code needs to check if the resize attempt failed during the
resume call (which it did in this case), and if so, reset the size, and redo the
resume.

Comment 6 John Pittman 2017-04-28 12:29:58 UTC

Thanks Ben; you are right.

[root@host]# multipathd -k'resize map mpatha'
fail

[root@host]# dmsetup info -C | grep mpatha
mpatha           253   2 L-sw    1    1      1 mpath-<wwid_omitted>     

[root@host]# dmsetup resume mpatha
[root@host]# dmsetup info -C | grep mpatha
mpatha           253   2 L--w    1    1      1 mpath-<wwid_omitted>   

Manual resume is able to recover.

Comment 7 Ben Marzinski 2018-01-18 18:54:01 UTC

Fixed this to automatically resume the device again with its original table, if the previous resume left the device in the suspended state.

Comment 11 Lin Li 2018-05-21 09:35:12 UTC

Reproduced on device-mapper-multipath-0.4.9-100.el6
1, [root@storageqe-20 wwids]# rpm -qa | grep multipath
device-mapper-multipath-libs-0.4.9-100.el6.x86_64
device-mapper-multipath-0.4.9-100.el6.x86_64

2, [root@storageqe-20 wwids]# multipath -ll
mpathe (353333330000007d0) dm-0 Linux,scsi_debug
size=8.0M features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 13:0:0:0 sdr 65:16 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 14:0:0:0 sds 65:32 active ready running

3, [root@storageqe-20 wwids]# echo 1 > /sys/bus/pseudo/drivers/scsi_debug/virtual_gb

4, [root@storageqe-20 wwids]# echo 1 > /sys/block/sdr/device/rescan

5, [root@storageqe-20 wwids]# multipathd resize map mpathe
fail

6, [root@storageqe-20 wwids]# dmsetup info mpathe | grep "^State:"
State:             SUSPENDED   <-------------------------------------



Verified on device-mapper-multipath-0.4.9-106.el6
1, [root@storageqe-20 wwids]# rpm -qa | grep multipath
device-mapper-multipath-libs-0.4.9-106.el6.x86_64
device-mapper-multipath-debuginfo-0.4.9-106.el6.x86_64
device-mapper-multipath-0.4.9-106.el6.x86_64

2, [root@storageqe-20 wwids]# multipath -ll
mpathe (353333330000007d0) dm-0 Linux,scsi_debug
size=8.0M features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=1 status=active
| `- 13:0:0:0 sdr 65:16 active ready running
`-+- policy='round-robin 0' prio=1 status=enabled
  `- 14:0:0:0 sds 65:32 active ready running


3, [root@storageqe-20 wwids]# echo 1 > /sys/bus/pseudo/drivers/scsi_debug/virtual_gb

4, [root@storageqe-20 wwids]# echo 1 > /sys/block/sdr/device/rescan

5, [root@storageqe-20 wwids]# multipathd resize map mpathe
fail

6, [root@storageqe-20 wwids]#  dmsetup info mpathe | grep "^State:"
State:             ACTIVE   <--------------------------------


Test result: When resizing a multipath device fails, the device is still usable.

Comment 14 errata-xmlrpc 2018-06-19 05:17:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1893

Note You need to log in before you can comment on or make changes to this bug.