Description of problem: GCP cluster on 3.9 is getting an error trying to mount a PVC Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Warning FailedMount 1h kubelet, ci-chancez-chargeback-openshift-ig-n-zsn5 MountVolume.MountDevice failed for volume "pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005" : failed to mount the volume as "ext4", it already contains mpath_member. Mount error: mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 --scope -- mount -t ext4 -o defaults /dev/disk/by-id/google-kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 Output: Running scope as unit run-50422.scope. mount: /dev/sdb is already mounted or /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 busy Warning FailedMount 1h kubelet, ci-chancez-chargeback-openshift-ig-n-zsn5 MountVolume.MountDevice failed for volume "pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005" : failed to mount the volume as "ext4", it already contains mpath_member. Mount error: mount failed: exit status 32 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 --scope -- mount -t ext4 -o defaults /dev/disk/by-id/google-kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 Output: Running scope as unit run-50436.scope. mount: /dev/sdb is already mounted or /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 busy Warning FailedMount 1h kubelet, ci-chancez-chargeback-openshift-ig-n-zsn5 MountVolume.MountDevice failed for volume "pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005" : failed to mount the volume as "ext4", it already contains mpath_member. Mount error: mount failed: exit status 32 --- Server https://internal-api.openshift.XXXXX.team.coreos.systems:8443 openshift v3.9.0-alpha.4+1f02cb5-492 kubernetes v1.9.1+a0ce1bc657 --- [chance@ci-chancez-chargeback-openshift-ig-m-0nl0 ~]$ uname -a Linux ci-chancez-chargeback-openshift-ig-m-0nl0 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux [chance@ci-chancez-chargeback-openshift-ig-m-0nl0 ~]$ cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"
To do this I used https://github.com/openshift/release from commit 00731fe9d6a3e970aa1dc727041de471744d28b8. I used the cluster/test-deploy Makefile/instructions for deployment. The following link is a diff of my vars-origin.yaml from the original https://gist.github.com/chancez/1c0f28eb05d8f4ab4e66e9c261e3329a. Besides that, I've ran ansible a few times to make a couple changes to auth settings to add Github auth but my auth settings weren't working so I reverted those, so the only major thing I can think of is re-running ansible a few times to make changes, and then again to undo those changes.
"it already contains mpath_member" is odd, that device was somehow managed by multipathd. Is the instance still available? I've never seen a multipath pd.
Indeed, the device is managed by multipath: [root@ci-chancez-chargeback-openshift-ig-n-zsn5 ~]# multipath -ll 0Google_PersistentDisk_kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e dm-0 Google ,PersistentDisk size=5.0G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 0:0:2:0 sdb 8:16 active ready running # dmsetup ls --tree 0Google_PersistentDisk_kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 (253:0) └─ (8:16) # ls -l /dev/disk/by-id/google-kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 lrwxrwxrwx. 1 root root 9 Feb 28 20:31 /dev/disk/by-id/google-kubernetes-dynamic-pvc-4dc8bbab-1cc6-11e8-be9c-42010a8e0005 -> ../../sdb
Temporarily disabled GCE PD in multipath and verified the disk was no longer managed by multipathd steps: 1. blacklist 0Google_PersistentDisk: # cat /etc/multipath.conf # LIO iSCSI # TODO: Add env variables for tweaking devices { device { vendor "LIO-ORG" user_friendly_names "yes" path_grouping_policy "failover" path_selector "round-robin 0" failback immediate path_checker "tur" prio "const" no_path_retry 120 rr_weight "uniform" } } blacklist { wwid 0Google_PersistentDisk } defaults { } 2. systemctl restart multipathd 3. verify # dmsetup ls --tree No devices found 4. format the disk: # mkfs -t ext4 /dev/sdb
I have seen same issue on one of our vSphere clusters too on 3.9.
Also met this once on Azure, but after I set a new OCP was unable to reproduce...
Met the issue hawkular-cassandra failed to start due to failed mount the pv. Version-Release number of selected component (if applicable): openshift v3.9.1 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.16 OS version: Red Hat Enterprise Linux Server release 7.5 Beta (Maipo) kernel: 3.10.0-855.el7.x86_64 Steps: 1. Deploy hawkular metrics on gcp then check the status [root@qe-dma-master-etcd-1 test]# oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-2c2e2208-1de2-11e8-833e-42010af0001e 10Gi RWO Delete Bound openshift-infra/metrics-cassandra-1 standard 7m pvc-b1a3e901-1dc3-11e8-8e76-42010af0001e 1Gi RWO Delete Bound openshift-ansible-service-broker/etcd standard 3h [root@qe-dma-master-etcd-1 test]# oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE metrics-cassandra-1 Bound pvc-2c2e2208-1de2-11e8-833e-42010af0001e 10Gi RWO standard 7m [root@qe-dma-master-etcd-1 test]# oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-k5tb6 0/1 ContainerCreating 0 6m hawkular-metrics-r8rgr 0/1 Running 9 3h heapster-lbtgx 0/1 Running 7 3h [root@qe-dma-master-etcd-1 test]# oc describe po hawkular-cassandra-1-k5tb6 Name: hawkular-cassandra-1-k5tb6 Namespace: openshift-infra Node: qe-dma-node-registry-router-1/10.240.0.31 Start Time: Fri, 02 Mar 2018 01:23:55 -0500 Labels: metrics-infra=hawkular-cassandra name=hawkular-cassandra-1 type=hawkular-cassandra Annotations: openshift.io/scc=restricted Status: Pending IP: Controlled By: ReplicationController/hawkular-cassandra-1 Containers: hawkular-cassandra-1: Container ID: Image: registry.reg-aws.openshift.com:443/openshift3/metrics-cassandra:v3.9 Image ID: Ports: 9042/TCP, 9160/TCP, 7000/TCP, 7001/TCP Command: /opt/apache-cassandra/bin/cassandra-docker.sh --cluster_name=hawkular-metrics --data_volume=/cassandra_data --internode_encryption=all --require_node_auth=true --enable_client_encryption=true --require_client_auth=true State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Limits: memory: 2G Requests: memory: 1G Readiness: exec [/opt/apache-cassandra/bin/cassandra-docker-ready.sh] delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: CASSANDRA_MASTER: true CASSANDRA_DATA_VOLUME: /cassandra_data JVM_OPTS: -Dcassandra.commitlog.ignorereplayerrors=true ENABLE_PROMETHEUS_ENDPOINT: True TRUSTSTORE_NODES_AUTHORITIES: /hawkular-cassandra-certs/tls.peer.truststore.crt TRUSTSTORE_CLIENT_AUTHORITIES: /hawkular-cassandra-certs/tls.client.truststore.crt POD_NAMESPACE: openshift-infra (v1:metadata.namespace) MEMORY_LIMIT: 2000000000 (limits.memory) CPU_LIMIT: node allocatable (limits.cpu) Mounts: /cassandra_data from cassandra-data (rw) /hawkular-cassandra-certs from hawkular-cassandra-certs (rw) /var/run/secrets/kubernetes.io/serviceaccount from cassandra-token-tsg9f (ro) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: cassandra-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: metrics-cassandra-1 ReadOnly: false hawkular-cassandra-certs: Type: Secret (a volume populated by a Secret) SecretName: hawkular-cassandra-certs Optional: false cassandra-token-tsg9f: Type: Secret (a volume populated by a Secret) SecretName: cassandra-token-tsg9f Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 6m default-scheduler Successfully assigned hawkular-cassandra-1-k5tb6 to qe-dma-node-registry-router-1 Normal SuccessfulMountVolume 6m kubelet, qe-dma-node-registry-router-1 MountVolume.SetUp succeeded for volume "cassandra-token-tsg9f" Normal SuccessfulMountVolume 6m kubelet, qe-dma-node-registry-router-1 MountVolume.SetUp succeeded for volume "hawkular-cassandra-certs" Warning FailedMount 28s (x11 over 6m) kubelet, qe-dma-node-registry-router-1 MountVolume.MountDevice failed for volume "pvc-2c2e2208-1de2-11e8-833e-42010af0001e" : failed to mount the volume as "ext4", it already contains mpath_member. Mount error: exit status 32 Warning FailedMount 23s (x3 over 4m) kubelet, qe-dma-node-registry-router-1 Unable to mount volumes for pod "hawkular-cassandra-1-k5tb6_openshift-infra(489e1083-1de2-11e8-833e-42010af0001e)": timeout expired waiting for volumes to attach/mount for pod "openshift-infra"/"hawkular-cassandra-1-k5tb6". list of unattached/unmounted volumes=[cassandra-data]
I did not meet this on GCP. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 16m default-scheduler Successfully assigned asb-etcd-1-hshdc to qe-wmeng391ah-master-etcd-1 Normal SuccessfulMountVolume 16m kubelet, qe-wmeng391ah-master-etcd-1 MountVolume.SetUp succeeded for volume "asb-token-jts4r" Normal SuccessfulMountVolume 16m kubelet, qe-wmeng391ah-master-etcd-1 MountVolume.SetUp succeeded for volume "etcd-tls" Normal SuccessfulMountVolume 16m kubelet, qe-wmeng391ah-master-etcd-1 MountVolume.SetUp succeeded for volume "etcd-auth" Normal SuccessfulMountVolume 16m kubelet, qe-wmeng391ah-master-etcd-1 MountVolume.SetUp succeeded for volume "pvc-ea9e92d8-1de6-11e8-a434-42010af00023" Normal Pulled 16m kubelet, qe-wmeng391ah-master-etcd-1 Container image "registry.access.redhat.com/rhel7/etcd:latest" already present on machine Normal Created 16m kubelet, qe-wmeng391ah-master-etcd-1 Created container Normal Started 16m kubelet, qe-wmeng391ah-master-etcd-1 Started container openshift v3.9.1 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.16 Kernel Version: 3.10.0-855.el7.x86_64 Operating System: Red Hat Enterprise Linux Atomic Host 7.5.0 docker-1.13.1-55.rhel75.git774336d.el7.x86_64 Server Version: 1.13.1 Storage Driver: overlay2
If Huamin is correct (and it looks like he is), then each time there is the "mpath_member" in the events checks also the multipathd log. On the affected machine I can see this: Feb 27 21:13:55 ci-chancez-chargeback-openshift-build-image-instance multipathd[288]: sda: spurious uevent, path already in pathvec Feb 27 21:13:55 ci-chancez-chargeback-openshift-build-image-instance multipathd[288]: 0Google_PersistentDisk_persistent-disk-0: failed in domap for addition of new path sda Feb 27 21:13:55 ci-chancez-chargeback-openshift-build-image-instance multipathd[288]: uevent trigger error It would also mean this is not something we can fix in OpenShift (see Huamin's comment #4) -- multipathd has to be configured to ignore the GCE PD disks. Disabling multipathd altogether on machines where it's not needed should work too. Also note: to reproduce this the multipathd must be installed and running on the system (not a case of Atomic Host AFAIK). I will try to create a pod with several disks in GCE and check their WWID -- if there is a collision we have the cause.
I was wrong: the workarounds I thought would work don't seem to help. Mount complains about mpath_member... This is the udev ID_FS_TYPE attribute value being set by udev to the multipath "legs". And mount refuses to mount those (since it should be the dm device that should be mounted). Might be there is an udev rule causing this attribute to be set for the GCE PD disks.
I've created a VM in GCE and "manually" attached a GCE PD (again, created in the console) and: [root@tsmetana-mp-master-etcd-1 ~]# multipath -ll 0Google_PersistentDisk_multipath-bug-test-1 dm-16 Google ,PersistentDisk size=10G features='0' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active `- 0:0:3:0 sdc 8:32 active ready running [root@tsmetana-mp-master-etcd-1 ~]# udevadm info --name=/dev/sdc P: /devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:3/0:0:3:0/block/sdc N: sdc S: disk/by-id/scsi-0Google_PersistentDisk_multipath-bug-test-1 S: disk/by-path/virtio-pci-0000:00:03.0-scsi-0:0:3:0 E: DEVLINKS=/dev/disk/by-id/scsi-0Google_PersistentDisk_multipath-bug-test-1 /dev/disk/by-path/virtio-pci-0000:00:03.0-scsi-0:0:3:0 E: DEVNAME=/dev/sdc E: DEVPATH=/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:3/0:0:3:0/block/sdc E: DEVTYPE=disk E: DM_MULTIPATH_DEVICE_PATH=1 E: DM_MULTIPATH_TIMESTAMP=1520002717 E: DM_MULTIPATH_WIPE_PARTS=1 E: ID_BUS=scsi E: ID_FS_TYPE=mpath_member E: ID_MODEL=PersistentDisk E: ID_MODEL_ENC=PersistentDisk\x20\x20 E: ID_PATH=virtio-pci-0000:00:03.0-scsi-0:0:3:0 E: ID_PATH_TAG=virtio-pci-0000_00_03_0-scsi-0_0_3_0 E: ID_REVISION=1 E: ID_SCSI=1 E: ID_SERIAL=0Google_PersistentDisk_multipath-bug-test-1 E: ID_SERIAL_SHORT=multipath-bug-test-1 E: ID_TYPE=disk E: ID_VENDOR=Google E: ID_VENDOR_ENC=Google\x20\x20 E: MAJOR=8 E: MINOR=32 E: MPATH_SBIN_PATH=/sbin E: SUBSYSTEM=block E: SYSTEMD_READY=0 E: TAGS=:systemd: E: USEC_INITIALIZED=26212172 I think the udev rules are not OK for GCE. Obviously, this disk can't be mounted: [root@tsmetana-mp-master-etcd-1 ~]# mkdir /mnt/test [root@tsmetana-mp-master-etcd-1 ~]# mount -t ext4 /dev/sdc /mnt/test mount: /dev/sdc is already mounted or /mnt/test busy OpenShift is not involved here.
assign to RHEL device mapper multipath, based on comment 7
Where did this multipath.conf come from. If you create a default multipath.conf file, by running # mpathconf --enable without an already existing multipath.conf file, it automatically sets find_multipaths yes in the defaults section. This makes multipath only claim devices when it sees that they have multiple paths, or if it has previously claimed them. If you add that find_multipaths line to /etc/multipath.conf, and run multipath -w /dev/sdc (or whatever devname the google persistent disk has) That should fix your problem. The real issue here is that a multipath.conf file without either find_multipaths or a manual blacklist will just claim all scsi devices. Who or whatever created that multipath.conf file needs to do one other the other. Like I said, the default multipath setup uses find_multipaths.
Thank you Ben, that explains the mystery. The config in question comes from this commit https://github.com/openshift/openshift-ansible/commit/2573825c06e9d3a5601b6c1492f71fd0b70b2578
ansible fix at https://github.com/openshift/openshift-ansible/pull/7367
for 3.9: https://github.com/openshift/openshift-ansible/pull/7368
Tested with the multipath.conf and the problem isn't reproducible. I'll test the openshift-ansible and choose a different cloud as regression test tomorrow. I can verify this bug now.
Faced this issue on Azure. And it is no matter if `find_multipaths yes` is set in config or not, it always claims Azure disks as multipath devices. So i need to blacklist them or disable multipath at system level. Both workarounds are not acceptable since they will not survive single ansible run...
Alex, can you post more info as in Comments 11 and 14?
3.7 fix is proposed at https://github.com/openshift/openshift-ansible/pull/8152 3.6 fix is proposed at https://github.com/openshift/openshift-ansible/pull/8151
actually 3.6 already has the fix
I have a customer who is facing this issue. multipathd is installed in customer's node and OpenShift playbook always enables this service even though they have disabled it. It this related to this issue? Or should I file another bug for playbook?
Hello, I am facing the same issue on ocp 3.7 deployed on GCP. I already tried Ben workaround proposed above and some issues. log from oc describe pod : Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-099654ed-5da2-11e8-9e4c-42010a840007 --scope -- mount -t xfs -o defaults /dev/disk/by-id/google-kubernetes-dynamic-pvc-099654ed-5da2-11e8-9e4c-42010a840007 /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-099654ed-5da2-11e8-9e4c-42010a840007 Output: Running scope as unit run-88207.scope. mount: /dev/sde is already mounted or /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-099654ed-5da2-11e8-9e4c-42010a840007 busy oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE ssp-kafka-0 1/1 Running 0 1h 172.16.5.134 ocp-a1-node-zgzf ssp-kafka-1 1/1 Running 0 1h 172.16.24.13 ocp-a1-node-rn5r ssp-kafka-2 1/1 Running 0 1h 172.16.22.25 ocp-a1-node-dtjv ssp-kafka-3 0/1 ContainerCreating 0 15m <none> ocp-a1-node-qngj ssp-kafka-4 1/1 Running 0 1h 172.16.12.65 ocp-a1-node-dd0n ssp-topic-controller-3558947362-7ljx4 1/1 Running 0 7d 172.16.18.7 ocp-a1-node-qngj ssp-zookeeper-0 0/1 ContainerCreating 0 1h <none> ocp-a1-node-qngj ssp-zookeeper-1 1/1 Running 0 1h 172.16.24.12 ocp-a1-node-rn5r ssp-zookeeper-2 1/1 Running 0 1h 172.16.22. 24 ocp-a1-node-dtjv strimzi-cluster-controller-969217113-qdrkw 1/1 Running 0 1h 172.16.10.23 ocp-a1-node-l6dt [root@ocp-a1-node-qngj ~]# df -h |grep sde [root@ocp-a1-node-qngj ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 25G 0 disk └─sda1 8:1 0 25G 0 part / sdb 8:16 0 25G 0 disk /var/lib/docker sdc 8:32 0 50G 0 disk /var/lib/origin/openshift.local.volumes sdd 8:48 0 1G 0 disk └─0Google_PersistentDisk_kubernetes-dynamic-pvc-3ef72dcd-449d-11e8-97d2-42010a84000a 253:0 0 1G 0 mpath sde 8:64 0 10G 0 disk └─0Google_PersistentDisk_kubernetes-dynamic-pvc-099654ed-5da2-11e8-9e4c-42010a840007 253:1 0 10G 0 mpath Any ideas ? Cheers, /JM
have you tried multipath -w /dev/sdd (see comment 14)?