Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1627456

Summary:

Pod failed due to timeout expired waiting for volumes to attach/mount

Product:

OpenShift Container Platform

Reporter:

jmselmi <jmselmi>

Component:

Storage

Assignee:

Matthew Wong <mawong>

Status:

CLOSED WORKSFORME

QA Contact:

Liang Xia <lxia>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

3.7.0

CC:

aos-bugs, aos-storage-staff, bchilds, hekumar, jmselmi, jokerman, mmccomas

Target Milestone:

---

Target Release:

3.7.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-01-22 17:46:19 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
atomic-openshift-node.log	none
oc get pods -o yaml	none
oc describe pod zk	none
oc describe pod kafka	none
node logs.	none
node.logs	none

Description jmselmi 2018-09-10 15:45:26 UTC

Description of problem:
one of statefulset pods failed to start. The log from the oc describe shows that the mount_point failed to mount on th pv. However, the PVC/PV is bound and the pv is presented inside the node. 

I am using ocp over gcp and the google storage as a dynamic storage.
oc describe pod my-cluster-kafka-2
3h 1m 100 kubelet, ocp-app-0 Warning FailedMount Unable to mount volumes for pod "my-cluster-kafka-2_strimzi-060d(abd0442f-a7bc-11e8-b631-42010a2c5a03)": timeout expired waiting for volumes to attach/mount for pod "strimzi-060d"/"my-cluster-kafka-2". list of unattached/unmounted volumes=[data]

However, the pv/pvc are bound and attached to the node:
oc get pv
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-bf8672a7-a7b7-11e8-b631-42010a2c5a03 1Gi RWO Delete Bound strimzi-060d/data-my-cluster-kafka-2 ssd

[root@ocp-app-0 abd0442f-a7bc-11e8-b631-42010a2c5a03]# df -h |grep pvc-bf8672a7-a7b7-11e8-b631-42010a2c5a03
/dev/sdg 1014M 33M 982M 4% /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-bf8672a7-a7b7-11e8-b631-42010a2c5a03

[root@ocp-app-0 abd0442f-a7bc-11e8-b631-42010a2c5a03]# tree volumes/
volumes/
...
├── kubernetes.io~gce-pd
│   └── pvc-bf8672a7-a7b7-11e8-b631-42010a2c5a03
│   └── kafka-log2
│   ├── cleaner-offset-checkpoint
│   ├── log-start-offset-checkpoint
│   ├── meta.properties
│   ├── recovery-point-offset-checkpoint
│   └── replication-offset-checkpoint

Version-Release number of selected component (if applicable):
3.7.57-1.git.0.c61c7cf.el7
How reproducible:
I guess dedploy a statefulset or pod with storage

Steps to Reproduce:
1. deploy pod with persitent storage
2. observe.
3.

Actual results:
pod failed to start

Expected results:


Additional info:

Comment 1 Seth Jennings 2018-09-10 18:37:49 UTC

Please provide "oc get pod -o yaml" for the pod in question and the "journalctl -u atomic-openshift-node" logs from the node where the pod is trying to start.

Comment 2 jmselmi 2018-09-10 19:25:57 UTC

Created attachment 1482223 [details]
atomic-openshift-node.log

Comment 3 jmselmi 2018-09-10 19:26:31 UTC

Created attachment 1482224 [details]
oc get pods -o yaml

Comment 4 jmselmi 2018-09-10 19:28:54 UTC

Here's the requested files.

Comment 5 Seth Jennings 2018-09-11 22:40:26 UTC

The pod seems to be running:

  containerStatuses:
  - containerID: docker://685fbb1d8195a4185c62f5455c95b136f7b92adb2211385f10de87b7a0f7e352
    image: docker.io/strimzi/kafka:0.6.0
..
    restartCount: 79
    state:
      running:
        startedAt: 2018-09-10T14:57:46Z
  - containerID: docker://6f97bdc446dbb748f117b2f8fb89a65c289ea5ae4e329743f596bb956704c577
    image: docker.io/strimzi/kafka-stunnel:0.6.0
...
    restartCount: 5
    state:
      running:
        startedAt: 2018-09-10T14:45:48Z
  hostIP: 10.44.90.9
  phase: Running

The restart count is high, likely due to the liveness probe failing.

Something is wrong with the attached node logs.

Please also include "oc describe pod" for the pod in question.

Comment 6 jmselmi 2018-09-12 13:17:47 UTC

Hello Seth,
Do you need additionnal informations ?

cheers,

Comment 7 Seth Jennings 2018-09-12 13:19:23 UTC

Maybe you can't read private comments on bz.  See comment 5.  Just made it public.

Comment 8 jmselmi 2018-09-12 13:39:49 UTC

Thanks,
They are in loop of running, creating containers. 

Attached the describe files.

Comment 9 jmselmi 2018-09-12 13:40:45 UTC

Created attachment 1482690 [details]
oc describe pod zk

zk-2

Comment 10 jmselmi 2018-09-12 13:41:13 UTC

Created attachment 1482691 [details]
oc describe pod kafka

Comment 11 Seth Jennings 2018-09-12 14:07:47 UTC

Sending to Storage to figure out why the volume manager in the kubelet is not detecting that the GPD is attached and the volume is mounted.  This is blocking pod startup.

Comment 12 Hemant Kumar 2018-09-12 14:14:37 UTC

This sounds same as https://bugzilla.redhat.com/show_bug.cgi?id=1627547 . Can we get output of `tree /dev/disk/by-id` when this happens? Does the attached device show up there at all?

Typically when a new device is attached to a node, the udev rule on the node should create appropriate symlinks in `/dev/disk/by-id` which is used by GCE-PD driver. But there is a known race/bug in udev which can cause some iSCSI events to be missed. But this was supposed to be rare and additionally GCE-PD calls udev trigger to make sure, we process them. But that has a bug too. 

https://github.com/kubernetes/kubernetes/issues/68528

Comment 13 jmselmi 2018-09-12 15:19:47 UTC

 tree /dev/disk/by-id
/dev/disk/by-id
├── dm-name-docker-8:16-33575008-074947cd333b9e6dbb160cb74ce7691c69cc148df2d1b40c62ec754035bbae6a -> ../../dm-5
├── dm-name-docker-8:16-33575008-1c4dde946843a8a45ecaacf590dbdc8112e9140c062ad5c1463e3c20a8aeabf5 -> ../../dm-13
├── dm-name-docker-8:16-33575008-2d910a71093935aaf326c7d5497d307c70f80abbad35a50542c3d420f09dbce7 -> ../../dm-21
├── dm-name-docker-8:16-33575008-48341b2a16fafa4db61f456ba128d89f626fcba597cea75e0d7a5a5f222fb18d -> ../../dm-18
├── dm-name-docker-8:16-33575008-4b41f1996a76c48a08f35752ced114cab6c5bd3df7dacc5a401362e57c320bee -> ../../dm-14
├── dm-name-docker-8:16-33575008-5ebd827825b1b7f550c43258aaf21500566a8bdb23a55bfc8151ee445c85b081 -> ../../dm-12
├── dm-name-docker-8:16-33575008-66316eed0560126eb38896f4953648f8e302df205cac78d803b1902c52344a84 -> ../../dm-20
├── dm-name-docker-8:16-33575008-6e00caaee46c2f66358d5863c8a27cc8703241345dc751272b24a688076e22e9 -> ../../dm-22
├── dm-name-docker-8:16-33575008-77ce86217fafdfdc5e565d56beff6d8656866ae89bc1f13c0fe67ae84e8e5b1a -> ../../dm-16
├── dm-name-docker-8:16-33575008-7f0b6b8892d352fcb84e3d2e66c3db82963995d4846dd7352788f4431262dada -> ../../dm-3
├── dm-name-docker-8:16-33575008-a18ebe17dc1e76952a293f5ff496fddd665de6861d77c31c1c0cc811bac6e905 -> ../../dm-11
├── dm-name-docker-8:16-33575008-a458c12f3fe02643d57a8357d07a7b4d07ac14e7cf0554431c9213bf7d7e2264 -> ../../dm-4
├── dm-name-docker-8:16-33575008-ad3a2360687fe77ab265d9f8a99475f65dbd6947705a9cde5e37ba09db3c0f96 -> ../../dm-8
├── dm-name-docker-8:16-33575008-afc2a35a1681150f54690449795b15ad40aa3b54b32cdeeb851e06e77d69fe16 -> ../../dm-15
├── dm-name-docker-8:16-33575008-beb625376915925f6d80a162d008706195397aef51b940f2e514b944c3e3a984 -> ../../dm-2
├── dm-name-docker-8:16-33575008-c93274a84844acc995691505cb65df0bd716f9d21e697e7044344e6fd6587c3c -> ../../dm-6
├── dm-name-docker-8:16-33575008-d0190aaccf7d1f1d30be8d5978c43338be850fe5b34263b2e36a8b2b00b70fb2 -> ../../dm-10
├── dm-name-docker-8:16-33575008-d7056ff3e05f3813213214f51278ea6cfb9a025b085dc17d294c519f5d29fb29 -> ../../dm-7
├── dm-name-docker-8:16-33575008-f7358b9d66dad8d7f7891bb9e192e83bd148ed6a021c3586275a6a92473b2a18 -> ../../dm-1
├── dm-name-docker-8:16-33575008-f766bf31a84b8332c9bf41ba17332943146bdc85d402811abee8dfe67dc4a0c6 -> ../../dm-9
├── dm-name-docker-8:16-33575008-f8c3f66ffb16935c6b8346cbae7eb3d4ac2354bb9c8f2bbe7c55a274f21e780c -> ../../dm-19
├── dm-name-docker-8:16-33575008-fc08547e88a4725d7b40aeec18113178a3ae382f1ad902cc7de9b6c95362e8e4 -> ../../dm-17
├── google-kubernetes-dynamic-pvc-00a22368-a7b5-11e8-bbde-42010a2c5a05 -> ../../sde
├── google-kubernetes-dynamic-pvc-6673b6cb-9bbe-11e8-8b2e-42010a2c5a03 -> ../../sdg
├── google-kubernetes-dynamic-pvc-8974f445-9bbe-11e8-8b2e-42010a2c5a03 -> ../../sdd
├── google-kubernetes-dynamic-pvc-bf8672a7-a7b7-11e8-b631-42010a2c5a03 -> ../../sdf
├── google-ocp-app-0 -> ../../sda
├── google-ocp-app-0-containers -> ../../sdb
├── google-ocp-app-0-local -> ../../sdc
├── google-ocp-app-0-part1 -> ../../sda1
├── scsi-0Google_PersistentDisk_kubernetes-dynamic-pvc-00a22368-a7b5-11e8-bbde-42010a2c5a05 -> ../../sde
├── scsi-0Google_PersistentDisk_kubernetes-dynamic-pvc-6673b6cb-9bbe-11e8-8b2e-42010a2c5a03 -> ../../sdg
├── scsi-0Google_PersistentDisk_kubernetes-dynamic-pvc-8974f445-9bbe-11e8-8b2e-42010a2c5a03 -> ../../sdd
├── scsi-0Google_PersistentDisk_kubernetes-dynamic-pvc-bf8672a7-a7b7-11e8-b631-42010a2c5a03 -> ../../sdf
├── scsi-0Google_PersistentDisk_ocp-app-0 -> ../../sda
├── scsi-0Google_PersistentDisk_ocp-app-0-containers -> ../../sdb
├── scsi-0Google_PersistentDisk_ocp-app-0-local -> ../../sdc
└── scsi-0Google_PersistentDisk_ocp-app-0-part1 -> ../../sda1

Yes, the disks are attached.

 df -h |grep 1014
/dev/sdd       1014M   63M  952M   7% /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-8974f445-9bbe-11e8-8b2e-42010a2c5a03
/dev/sde       1014M   36M  979M   4% /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-00a22368-a7b5-11e8-bbde-42010a2c5a05
/dev/sdf       1014M   33M  982M   4% /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-bf8672a7-a7b7-11e8-b631-42010a2c5a03
/dev/sdg       1014M   88M  927M   9% /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/gce-pd/mounts/kubernetes-dynamic-pvc-6673b6cb-9bbe-11e8-8b2e-42010a2c5a03


cheers

Comment 14 Hemant Kumar 2018-09-12 18:49:29 UTC

Okay, this appears to be a different issue. Can you please upload proper node logs. The log files you shared appear to be empty.

Comment 15 jmselmi 2018-09-13 08:24:56 UTC

Created attachment 1482952 [details]
node logs.

Please, find the file attached.

Comment 17 Matthew Wong 2018-09-18 16:15:10 UTC

The logs don't include when the mount failures occurred, so if you have access to the logs from when the issue occurred (sep 10th?) or you can reproduce the issue, please attach new logs. Thank you

Comment 18 jmselmi 2018-09-24 13:58:27 UTC

Created attachment 1486426 [details]
node.logs

I reproduced the case.
Please; find the logs.

Comment 19 Matthew Wong 2019-01-22 17:46:19 UTC

Closing, please reopen if the issue reoccurs with the same node logs. Logs don't show the mount error unfortunately