Bug 1943496

Summary:	[Manila CSI driver] could not mount volume in one node while other nodes work fine
Product:	OpenShift Container Platform	Reporter:	Wei Duan <wduan>
Component:	Storage	Assignee:	Mike Fedosin <mfedosin>
Storage sub component:	OpenStack CSI Drivers	QA Contact:	Jon Uriarte <juriarte>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	high
Priority:	medium	CC:	adduarte, aos-bugs, emacchi, mbooth, mfedosin, piqin, pprinett, tbarron
Version:	4.7
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-06 11:22:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Wei Duan 2021-03-26 09:07:31 UTC

Description of problem:
When creating a DaemonSet in 3 worker nodes, one of them could not mount the manila share:
$ oc get pvc
NAME           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
mypvc-rwx-ds   Bound    pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da   2Gi        RWX            csi-manila-ceph   95m


$ oc get pod -o wide
NAME                                    READY   STATUS              RESTARTS   AGE     IP             NODE                              NOMINATED NODE   READINESS GATES
ds-92ltz                                1/1     Running             0          66m     10.131.0.39    piqin-0326-txhx4-worker-0-gvcp4   <none>           <none>
ds-h76hn                                1/1     Running             0          66m     10.128.2.28    piqin-0326-txhx4-worker-0-69lkl   <none>           <none>
ds-zh5wt                                0/1     ContainerCreating   0          66m     <none>         piqin-0326-txhx4-worker-0-8n2t5   <none>           <none>

$ oc describe pod ds-zh5wt
Events:
  Type     Reason            Age                       From               Message
  ----     ------            ----                      ----               -------
  Warning  FailedScheduling  83m  Must-gather logs:
http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.1550437577609321556.tar.gz                     default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling  83m                       default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled         83m                       default-scheduler  Successfully assigned wduan/ds-zh5wt to piqin-0326-txhx4-worker-0-8n2t5
  Warning  FailedMount       35m (x6 over 79m)         kubelet            Unable to attach or mount volumes: unmounted volumes=[pvol], unattached volumes=[default-token-vx8pd pvol]: timed out waiting for the condition
  Warning  FailedMount       31m (x17 over 81m)        kubelet            MountVolume.SetUp failed for volume "pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount       <invalid> (x29 over 81m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[pvol], unattached volumes=[pvol default-token-vx8pd]: timed out waiting for the condition

  
From kubelet log on node piqin-0326-txhx4-worker-0-8n2t5:
Mar 26 06:38:21 piqin-0326-txhx4-worker-0-8n2t5 hyperkube[1825]: E0326 06:38:21.951030    1825 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/manila.csi.openstack.org^edf919b2-8465-4d90-bbcb-d860b15900f6 podNme: nodeName:}" failed. No retries permitted until 2021-03-26 06:38:23.950949238 +0000 UTC m=+13155.497727038 (durationBeforeRetry 2s). Error: "MountVolume.SetUp failed for volume \"pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da\" (UniqueName: "kubernetes.io/csi/manila.csi.openstack.org^edf919b2-8465-4d90-bbcb-d860b15900f6\") pod \"ds-zh5wt\" (UID: \"67c5b63b-8d16-41f1-a915-e569005725d3\") : rpc error: code = DeadlineExceeded desc = context deadline exceeded"

From csi-nodeplugin-nfsplugin/csi-driver log:
E0326 07:33:23.111171       1 utils.go:50] GRPC error: rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/2c296eb7-b86f-4f53-aef4-0050bdee4fb2 /var/lib/kubelet/pods/67c5b63b-8d16-41f1-a915-e569005725d3/volumes/kubernetes.io~csi/pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da/mount
Output: mount.nfs: Connection timed out

From openstack-manila-csi-nodeplugin/csi-driver log full of following log:
I0326 07:32:48.775599       1 builder.go:44] [ID:57] FWD GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0326 07:32:48.775697       1 driver.go:313] [ID:112] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0326 07:35:37.882531       1 builder.go:44] [ID:58] FWD GRPC error: rpc error: code = Canceled desc = context canceled
E0326 07:35:37.882597       1 driver.go:313] [ID:114] GRPC error: rpc error: code = Canceled desc = context canceled

oc -n openshift-manila-csi-driver logs csi-nodeplugin-nfsplugin-xxbjz|grep pvc-3 -A 6
--
Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/2c296eb7-b86f-4f53-aef4-0050bdee4fb2 /var/lib/kubelet/pods/67c5b63b-8d16-41f1-a915-e569005725d3/volumes/kubernetes.io~csi/pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da/mount
Output: mount.nfs: Connection timed out

E0326 08:48:28.711089       1 utils.go:50] GRPC error: rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/2c296eb7-b86f-4f53-aef4-0050bdee4fb2 /var/lib/kubelet/pods/67c5b63b-8d16-41f1-a915-e569005725d3/volumes/kubernetes.io~csi/pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da/mount
Output: mount.nfs: Connection timed out
E0326 08:51:28.937989       1 mount_linux.go:139] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/1e92fc8a-ced3-476a-851f-1436b2a20cc7 /var/lib/kubelet/pods/d6dd8f60-55b6-4e8e-8fbe-4cc7b923e287/volumes/kubernetes.io~csi/pvc-b16d9426-132e-4156-9ad9-43390d088711/mount
Output: mount.nfs: Connection timed out


Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-03-25-225737   True        False         4h44m   Cluster version is 4.7.0-0.nightly-2021-03-25-225737

How reproducible:
Always when scheduled pod on that node

Steps to Reproduce:
1. Install OSP cluster with Manila CSI driver
2. Create DaemonSet and PVC with manila share
3. Check pod status

Actual results:
One pod on piqin-0326-txhx4-worker-0-8n2t5 not running because mount failed
Create another pod scheduled to this node, also mount failed

Expected results:
All Pods should be in "Running" status

Master Log:

Node Log (of failed PODs):

PV Dump:
$ oc get pv pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: manila.csi.openstack.org
  creationTimestamp: "2021-03-26T06:32:15Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da
  resourceVersion: "97389"
  selfLink: /api/v1/persistentvolumes/pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da
  uid: ce834a5d-9942-438c-be84-da3fMust-gather logs:
http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.1550437577609321556.tar.gz1e2dd4e0
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 2Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: mypvc-rwx-ds
    namespace: wduan
    resourceVersion: "97347"
    uid: 3f226ef2-1e08-4409-b4f0-e0f9a84e84da
  csi:
    driver: manila.csi.openstack.org
    nodePublishSecretRef:
      name: csi-manila-secrets
      namespace: openshift-manila-csi-driver
    nodeStageSecretRef:
      name: csi-manila-secrets
      namespace: openshift-manila-csi-driver
    volumeAttributes:
      cephfs-mounter: fuse
      shareAccessID: 1e3e39cb-cebe-48aa-b275-bdaa405a9a8f
      shareID: edf919b2-8465-4d90-bbcb-d860b15900f6
      storage.kubernetes.io/csiProvisionerIdentity: 1616728874191-8081-manila.csi.openstack.org
    volumeHandle: edf919b2-8465-4d90-bbcb-d860b15900f6
  persistentVolumeReclaimPolicy: Delete
  storageClassName: csi-manila-ceph
  volumeMode: Filesystem
status:
  phase: Bound
  
PVC Dump:
$ oc get pvc mypvc-rwx-ds -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: manila.csi.openstack.org
  creationTimestamp: "2021-03-26T06:32:11Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: mypvc-rwx-ds
  namespace: wduan
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 2Gi
  storageClassName: csi-manila-ceph
  volumeMode: Filesystem
  volumeName: pvc-3f226ef2-1e08-4409-b4f0-e0f9a84e84da
status:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 2Gi
  phase: Bound
  
StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:
1. Network config for piqin-0326-txhx4-worker-0-8n2t5 node
sh-4.4# ip addr|grep 172
    inet 172.16.34.116/20 brd 172.16.47.255 scope global dynamic noprefixroute ens4
172: veth65d6dca8@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 

sh-4.4# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         host-192-168-0- 0.0.0.0         UG    100    0        0 ens3
10.128.0.0      0.0.0.0         255.252.0.0     U     0      0        0 tun0
169.254.169.254 host-192-168-0- 255.255.255.255 UGH   100    0        0 ens3
169.254.169.254 172.16.34.1     255.255.255.255 UGH   101    0        0 ens4
172.16.32.0     0.0.0.0         255.255.240.0   U     101    0        0 ens4
172.30.0.0      0.0.0.0         255.255.0.0     U     0      0        0 tun0
192.168.0.0     0.0.0.0         255.255.192.0   U     100    0        0 ens3


2. DaemonSet & PVC used
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ds
spec:
  selector:
    matchLabels:
      app: dpod
  template:
    metadata:
      name: dpod
      labels:
        app: dpod
    spec:
      containers:
      - name: myfrontend
        image: quay.io/openshifttest/storage@sha256:a05b96d373be86f46e76817487027a7f5b8b5f87c0ac18a246b018df11529b40
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 80
          name: http-server
        volumeMounts:
        - mountPath: "/mnt/local"
          name: pvol
      volumes:
      - name: pvol
        persistentVolumeClaim:
          claimName: mypvc-rwx-ds

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mypvc-rwx-ds
spec:
  accessModes:
  #- ReadWriteOnce
  - ReadWriteMany
  resources:
    requests:
      storage: 2Gi
  storageClassName: "csi-manila-ceph"

---

Comment 2 Qin Ping 2021-03-26 11:53:03 UTC

Hit this issue in 4.6.0-0.nightly-2021-03-25-230637 too.

One of pod is stuck in "ContainerCreating" status.
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-03-25-230637   True        False         75m     Cluster version is 4.6.0-0.nightly-2021-03-25-230637


$ oc get pod
NAME         READY   STATUS              RESTARTS   AGE
ds-5-4jrjf   1/1     Running             0          3m27s
ds-5-6qmd5   0/1     ContainerCreating   0          3m27s
ds-5-m74xf   1/1     Running             0          3m27s


$ oc describe pod ds-5-6qmd5
<skip>
Events:
  Type     Reason       Age        From               Message
  ----     ------       ----       ----               -------
  Normal   Scheduled    35s        default-scheduler  Successfully assigned default/ds-5-6qmd5 to piqin-0326-1-nbxx6-worker-0-st7m5
  Warning  FailedMount  <invalid>  kubelet            MountVolume.SetUp failed for volume "pvc-c32cdf6e-7803-45ce-bf46-a00eb013a5f2" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount  <invalid>  kubelet            Unable to attach or mount volumes: unmounted volumes=[local], unattached volumes=[local default-token-slcrx]: timed out waiting for the condition

Comment 4 Mike Fedosin 2021-03-29 17:44:22 UTC

First mustgather doesn't contain manila csi driver logs, so it's hard to understand what happened there. Manila operator didn't report any errors...

The second one from Qin Ping has required logs, and there is only one error message:

2021-03-26T11:50:44.707506583Z Mounting command: mount
2021-03-26T11:50:44.707506583Z Mounting arguments: -t nfs 172.16.32.1:/volumes/_nogroup/93174795-380d-4331-9437-e18de9014c86 /var/lib/kubelet/pods/cb38e6bd-2f6e-4472-aac4-a87c1d5d9297/volumes/kubernetes.io~csi/pvc-c32cdf6e-7803-45ce-bf46-a00eb013a5f2/mount
2021-03-26T11:50:44.707506583Z Output: mount.nfs: Connection timed out

Could it be a network issue?

Comment 5 Qin Ping 2021-03-31 01:43:37 UTC

Maybe. We tried the "mount -t nfs" cmd on the problematic worker node, it returned the same error.

Checked the network config of the problematic worker node, lgtm. Maybe the issue of PSI cluster?
Additional info:
1. Network config for piqin-0326-txhx4-worker-0-8n2t5 node
sh-4.4# ip addr|grep 172
    inet 172.16.34.116/20 brd 172.16.47.255 scope global dynamic noprefixroute ens4
172: veth65d6dca8@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 

sh-4.4# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         host-192-168-0- 0.0.0.0         UG    100    0        0 ens3
10.128.0.0      0.0.0.0         255.252.0.0     U     0      0        0 tun0
169.254.169.254 host-192-168-0- 255.255.255.255 UGH   100    0        0 ens3
169.254.169.254 172.16.34.1     255.255.255.255 UGH   101    0        0 ens4
172.16.32.0     0.0.0.0         255.255.240.0   U     101    0        0 ens4
172.30.0.0      0.0.0.0         255.255.0.0     U     0      0        0 tun0
192.168.0.0     0.0.0.0         255.255.192.0   U     100    0        0 ens3

Comment 9 Mike Fedosin 2021-05-06 11:22:20 UTC

I tried to reproduce it several times on PSI but I couldn't, so I think this issue was caused by an unstable environment.
I'm going to close this bz now. Please reopen if the issue happens again.