Description of problem: When creating a DaemonSet in 3 worker nodes, one of them could not mount the manila share: $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mypvc-rwx-ds Bound pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da 2Gi RWX csi-manila-ceph 95m $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ds-92ltz 1/1 Running 0 66m 10.131.0.39 piqin-0326-txhx4-worker-0-gvcp4 <none> <none> ds-h76hn 1/1 Running 0 66m 10.128.2.28 piqin-0326-txhx4-worker-0-69lkl <none> <none> ds-zh5wt 0/1 ContainerCreating 0 66m <none> piqin-0326-txhx4-worker-0-8n2t5 <none> <none> $ oc describe pod ds-zh5wt Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 83m Must-gather logs: http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.1550437577609321556.tar.gz default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. Warning FailedScheduling 83m default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. Normal Scheduled 83m default-scheduler Successfully assigned wduan/ds-zh5wt to piqin-0326-txhx4-worker-0-8n2t5 Warning FailedMount 35m (x6 over 79m) kubelet Unable to attach or mount volumes: unmounted volumes=[pvol], unattached volumes=[default-token-vx8pd pvol]: timed out waiting for the condition Warning FailedMount 31m (x17 over 81m) kubelet MountVolume.SetUp failed for volume "pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da" : rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount <invalid> (x29 over 81m) kubelet Unable to attach or mount volumes: unmounted volumes=[pvol], unattached volumes=[pvol default-token-vx8pd]: timed out waiting for the condition From kubelet log on node piqin-0326-txhx4-worker-0-8n2t5: Mar 26 06:38:21 piqin-0326-txhx4-worker-0-8n2t5 hyperkube[1825]: E0326 06:38:21.951030 1825 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/manila.csi.openstack.org^edf919b2-8465-4d90-bbcb-d860b15900f6 podNme: nodeName:}" failed. No retries permitted until 2021-03-26 06:38:23.950949238 +0000 UTC m=+13155.497727038 (durationBeforeRetry 2s). Error: "MountVolume.SetUp failed for volume \"pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da\" (UniqueName: "kubernetes.io/csi/manila.csi.openstack.org^edf919b2-8465-4d90-bbcb-d860b15900f6\") pod \"ds-zh5wt\" (UID: \"67c5b63b-8d16-41f1-a915-e569005725d3\") : rpc error: code = DeadlineExceeded desc = context deadline exceeded" From csi-nodeplugin-nfsplugin/csi-driver log: E0326 07:33:23.111171 1 utils.go:50] GRPC error: rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: mount Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/2c296eb7-b86f-4f53-aef4-0050bdee4fb2 /var/lib/kubelet/pods/67c5b63b-8d16-41f1-a915-e569005725d3/volumes/kubernetes.io~csi/pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da/mount Output: mount.nfs: Connection timed out From openstack-manila-csi-nodeplugin/csi-driver log full of following log: I0326 07:32:48.775599 1 builder.go:44] [ID:57] FWD GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded E0326 07:32:48.775697 1 driver.go:313] [ID:112] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded I0326 07:35:37.882531 1 builder.go:44] [ID:58] FWD GRPC error: rpc error: code = Canceled desc = context canceled E0326 07:35:37.882597 1 driver.go:313] [ID:114] GRPC error: rpc error: code = Canceled desc = context canceled oc -n openshift-manila-csi-driver logs csi-nodeplugin-nfsplugin-xxbjz|grep pvc-3 -A 6 -- Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/2c296eb7-b86f-4f53-aef4-0050bdee4fb2 /var/lib/kubelet/pods/67c5b63b-8d16-41f1-a915-e569005725d3/volumes/kubernetes.io~csi/pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da/mount Output: mount.nfs: Connection timed out E0326 08:48:28.711089 1 utils.go:50] GRPC error: rpc error: code = Internal desc = mount failed: exit status 32 Mounting command: mount Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/2c296eb7-b86f-4f53-aef4-0050bdee4fb2 /var/lib/kubelet/pods/67c5b63b-8d16-41f1-a915-e569005725d3/volumes/kubernetes.io~csi/pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da/mount Output: mount.nfs: Connection timed out E0326 08:51:28.937989 1 mount_linux.go:139] Mount failed: exit status 32 Mounting command: mount Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/1e92fc8a-ced3-476a-851f-1436b2a20cc7 /var/lib/kubelet/pods/d6dd8f60-55b6-4e8e-8fbe-4cc7b923e287/volumes/kubernetes.io~csi/pvc-b16d9426-132e-4156-9ad9-43390d088711/mount Output: mount.nfs: Connection timed out Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-03-25-225737 True False 4h44m Cluster version is 4.7.0-0.nightly-2021-03-25-225737 How reproducible: Always when scheduled pod on that node Steps to Reproduce: 1. Install OSP cluster with Manila CSI driver 2. Create DaemonSet and PVC with manila share 3. Check pod status Actual results: One pod on piqin-0326-txhx4-worker-0-8n2t5 not running because mount failed Create another pod scheduled to this node, also mount failed Expected results: All Pods should be in "Running" status Master Log: Node Log (of failed PODs): PV Dump: $ oc get pv pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da -o yaml apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/provisioned-by: manila.csi.openstack.org creationTimestamp: "2021-03-26T06:32:15Z" finalizers: - kubernetes.io/pv-protection name: pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da resourceVersion: "97389" selfLink: /api/v1/persistentvolumes/pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da uid: ce834a5d-9942-438c-be84-da3fMust-gather logs: http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.1550437577609321556.tar.gz1e2dd4e0 spec: accessModes: - ReadWriteMany capacity: storage: 2Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: mypvc-rwx-ds namespace: wduan resourceVersion: "97347" uid: 3f226ef2-1e08-4409-b4f0-e0f9a84e84da csi: driver: manila.csi.openstack.org nodePublishSecretRef: name: csi-manila-secrets namespace: openshift-manila-csi-driver nodeStageSecretRef: name: csi-manila-secrets namespace: openshift-manila-csi-driver volumeAttributes: cephfs-mounter: fuse shareAccessID: 1e3e39cb-cebe-48aa-b275-bdaa405a9a8f shareID: edf919b2-8465-4d90-bbcb-d860b15900f6 storage.kubernetes.io/csiProvisionerIdentity: 1616728874191-8081-manila.csi.openstack.org volumeHandle: edf919b2-8465-4d90-bbcb-d860b15900f6 persistentVolumeReclaimPolicy: Delete storageClassName: csi-manila-ceph volumeMode: Filesystem status: phase: Bound PVC Dump: $ oc get pvc mypvc-rwx-ds -o yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: manila.csi.openstack.org creationTimestamp: "2021-03-26T06:32:11Z" finalizers: - kubernetes.io/pvc-protection name: mypvc-rwx-ds namespace: wduan spec: accessModes: - ReadWriteMany resources: requests: storage: 2Gi storageClassName: csi-manila-ceph volumeMode: Filesystem volumeName: pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da status: accessModes: - ReadWriteMany capacity: storage: 2Gi phase: Bound StorageClass Dump (if StorageClass used by PV/PVC): Additional info: 1. Network config for piqin-0326-txhx4-worker-0-8n2t5 node sh-4.4# ip addr|grep 172 inet 172.16.34.116/20 brd 172.16.47.255 scope global dynamic noprefixroute ens4 172: veth65d6dca8@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default sh-4.4# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default host-192-168-0- 0.0.0.0 UG 100 0 0 ens3 10.128.0.0 0.0.0.0 255.252.0.0 U 0 0 0 tun0 169.254.169.254 host-192-168-0- 255.255.255.255 UGH 100 0 0 ens3 169.254.169.254 172.16.34.1 255.255.255.255 UGH 101 0 0 ens4 172.16.32.0 0.0.0.0 255.255.240.0 U 101 0 0 ens4 172.30.0.0 0.0.0.0 255.255.0.0 U 0 0 0 tun0 192.168.0.0 0.0.0.0 255.255.192.0 U 100 0 0 ens3 2. DaemonSet & PVC used --- apiVersion: apps/v1 kind: DaemonSet metadata: name: ds spec: selector: matchLabels: app: dpod template: metadata: name: dpod labels: app: dpod spec: containers: - name: myfrontend image: quay.io/openshifttest/storage@sha256:a05b96d373be86f46e76817487027a7f5b8b5f87c0ac18a246b018df11529b40 imagePullPolicy: IfNotPresent ports: - containerPort: 80 name: http-server volumeMounts: - mountPath: "/mnt/local" name: pvol volumes: - name: pvol persistentVolumeClaim: claimName: mypvc-rwx-ds --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: mypvc-rwx-ds spec: accessModes: #- ReadWriteOnce - ReadWriteMany resources: requests: storage: 2Gi storageClassName: "csi-manila-ceph" ---
Hit this issue in 4.6.0-0.nightly-2021-03-25-230637 too. One of pod is stuck in "ContainerCreating" status. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2021-03-25-230637 True False 75m Cluster version is 4.6.0-0.nightly-2021-03-25-230637 $ oc get pod NAME READY STATUS RESTARTS AGE ds-5-4jrjf 1/1 Running 0 3m27s ds-5-6qmd5 0/1 ContainerCreating 0 3m27s ds-5-m74xf 1/1 Running 0 3m27s $ oc describe pod ds-5-6qmd5 <skip> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 35s default-scheduler Successfully assigned default/ds-5-6qmd5 to piqin-0326-1-nbxx6-worker-0-st7m5 Warning FailedMount <invalid> kubelet MountVolume.SetUp failed for volume "pvc-c32cdf6e-7803-45ce-bf46-a00eb013a5f2" : rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount <invalid> kubelet Unable to attach or mount volumes: unmounted volumes=[local], unattached volumes=[local default-token-slcrx]: timed out waiting for the condition
First mustgather doesn't contain manila csi driver logs, so it's hard to understand what happened there. Manila operator didn't report any errors... The second one from Qin Ping has required logs, and there is only one error message: 2021-03-26T11:50:44.707506583Z Mounting command: mount 2021-03-26T11:50:44.707506583Z Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/93174795-380d-4331-9437-e18de9014c86 /var/lib/kubelet/pods/cb38e6bd-2f6e-4472-aac4-a87c1d5d9297/volumes/kubernetes.io~csi/pvc-c32cdf6e-7803-45ce-bf46-a00eb013a5f2/mount 2021-03-26T11:50:44.707506583Z Output: mount.nfs: Connection timed out Could it be a network issue?
Maybe. We tried the "mount -t nfs" cmd on the problematic worker node, it returned the same error. Checked the network config of the problematic worker node, lgtm. Maybe the issue of PSI cluster? Additional info: 1. Network config for piqin-0326-txhx4-worker-0-8n2t5 node sh-4.4# ip addr|grep 172 inet 172.16.34.116/20 brd 172.16.47.255 scope global dynamic noprefixroute ens4 172: veth65d6dca8@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default sh-4.4# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default host-192-168-0- 0.0.0.0 UG 100 0 0 ens3 10.128.0.0 0.0.0.0 255.252.0.0 U 0 0 0 tun0 169.254.169.254 host-192-168-0- 255.255.255.255 UGH 100 0 0 ens3 169.254.169.254 172.16.34.1 255.255.255.255 UGH 101 0 0 ens4 172.16.32.0 0.0.0.0 255.255.240.0 U 101 0 0 ens4 172.30.0.0 0.0.0.0 255.255.0.0 U 0 0 0 tun0 192.168.0.0 0.0.0.0 255.255.192.0 U 100 0 0 ens3
I tried to reproduce it several times on PSI but I couldn't, so I think this issue was caused by an unstable environment. I'm going to close this bz now. Please reopen if the issue happens again.