Bug 1576461 - Can not mount dynamic vsphere (vmdk) disks on host because locked by dm-multipath locks device
Summary: Can not mount dynamic vsphere (vmdk) disks on host because locked by dm-multi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.7.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.7.z
Assignee: Hemant Kumar
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-09 13:53 UTC by Takeshi Larsson
Modified: 2018-05-24 12:36 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-24 12:36:06 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Takeshi Larsson 2018-05-09 13:53:27 UTC
Description of problem:
After upgrading we were not seeing any issues with mount of vmdk disk on host.
After we ran the advanced ansible installer (config.yml) it seems to install some new packages.

* device-mapper-multipath
* device-mapper-multipath-libs
 - yum.log
May 08 09:40:21 Installed: device-mapper-multipath-libs-0.4.9-111.el7_4.2.x86_64
May 08 09:40:21 Installed: device-mapper-multipath-0.4.9-111.el7_4.2.x86_64

TASK [openshift_node : Install iSCSI storage plugin dependencies] 
Tuesday 08 May 2018  09:30:54 +0200 (0:00:01.093)       0:21:15.677
ok: [redacted] => (item=iscsi-initiator-utils)
changed: [redacted] => (item=device-mapper-multipath)

After these packages were installed we restarted some infra nodes (infra is running logging and metrics, both of which are using block device storage from vsphere)

It was then we noticed that the services were unable to come back up. Looking at the event log for the pod we saw:

MountVolume.MountDevice failed for volume "pvc-4a340761-1ad1-11e8-9b04-005056821185" : failed to mount the volume as "xfs", it already contains mpath_member. Mount error: exit status 32


It was then we started investigating the host to see why it was unable to mount it.
Looking at the multipath -ll we saw that the disks where listed there. So we flushed it via multipath -f <dev> and then the disk was able to be mounted by openshift into the pod.

thats when we took a another look at installed packages from the yum.log and then we took a look at the ansible log and saw that it had installed new packages.

Thats when i remembered the CNS requires multipath packages for gluster block devices and thats when i remembered that ansible installer did not install the required packages and thats when i remembered that there was a bz about missing dependies which was supposed to ensure that those packages are supposed to be installed per default.

So due to this, we do not have HA pods as once a pod dies or is restarted it can not mount the disk. It therefore requires manual intervention on ops side to start the pods again.

Version-Release number of selected component (if applicable):
3.7.42

How reproducible:
100%

Steps to Reproduce:
1. Install 3.7.42
2. Use dynamic vsphere provisioning
3. deploy pod with vsphere disk
4. Fail

Actual results:
Can not mount

Expected results:
should mount

Additional info:

Please test supported implementations when adding new kernel modules such as mpath.


Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Hemant Kumar 2018-05-23 04:10:32 UTC
This looks like duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1550271

@Takeshi can you try multipath.conf suggested in linked BZ and see if that fixes the problem?

Can you post output of /etc/multipath.conf ?

Comment 2 Takeshi Larsson 2018-05-23 07:00:52 UTC
Hi,

Yes we have updated the multipath conf with the defaults that were missing from the OCP managed multipath.conf file.

Now it works as expected. However we had to flush the device id from wwid database and also flushing the locks with -f/-F 

Now we are just waiting for the PR that fixes this issue to be backported to 3.7.

//Takeshi

Comment 3 Hemant Kumar 2018-05-23 12:03:02 UTC
The multipath fix was already backported to openshift-ansible 3.7, https://github.com/openshift/openshift-ansible/pull/8152

If you were running with latest branch of openshift-ansible then it should already have correct configuration.

Comment 4 Takeshi Larsson 2018-05-23 13:07:45 UTC
sure, but we are running enterprise so installing via github sources is not supported ;), we're waiting for a proper minor patch release.

Comment 5 Bradley Childs 2018-05-23 17:06:18 UTC
This change is in openshift-ansible-3.7.45-1, 46, 47, and 48. The latest released version is openshift-ansible-3.7.46-1.git.0.37f607e.el7

Comment 6 Takeshi Larsson 2018-05-23 20:43:45 UTC
https://access.redhat.com/errata/RHBA-2018:1576
https://docs.openshift.com/container-platform/3.7/release_notes/ocp_3_7_release_notes.html#ocp-3-7-46-bug-fixes

Neither the errata or release notes mention this fix so I was unsure. Thanks for confirming.

Comment 7 Scott Dodson 2018-05-24 12:36:06 UTC
fix delivered in https://access.redhat.com/errata/RHBA-2018:1576


Note You need to log in before you can comment on or make changes to this bug.