During upgrade to 4.6, cluster-storage-operator (CSO) should adopt Manila CSI driver installed from OLM. Steps to Reproduce: 1. Install Manila operator from OLM + creates CR for it 2. Run an application (Deployment/DaemonSet) that uses Manila PVs 3. Upgrade to 4.6 Actual results: The old OLM-based CSI driver is running. Expected results: - During upgrade, the operator switches from OLM to CVO / CSO one. - New CSO-based CSI driver is running. - The application still works. - User can delete the application (incl. PVCs) and the volumes are unmounted + PVs are deleted. - User can start a new application and PVs are provisioned and mounted.
Looking around the cluster, CSI driver on node piqin-0807-zlcd6-worker-xtvlr is not working: $ oc -n openshift-manila-csi-driver logs csi-nodeplugin-nfsplugin-2kj9l I0807 06:07:09.462191 1 nfs.go:49] Driver: nfs.csi.k8s.io version: 2.0.0 I0807 06:07:09.462292 1 nfs.go:99] Enabling volume access mode: SINGLE_NODE_WRITER I0807 06:07:09.462297 1 nfs.go:99] Enabling volume access mode: SINGLE_NODE_READER_ONLY I0807 06:07:09.462300 1 nfs.go:99] Enabling volume access mode: MULTI_NODE_READER_ONLY I0807 06:07:09.462302 1 nfs.go:99] Enabling volume access mode: MULTI_NODE_SINGLE_WRITER I0807 06:07:09.462305 1 nfs.go:99] Enabling volume access mode: MULTI_NODE_MULTI_WRITER I0807 06:07:09.462312 1 nfs.go:110] Enabling controller service capability: UNKNOWN I0807 06:07:09.466682 1 server.go:92] Listening for connections on address: &net.UnixAddr{Name:"/plugin/csi.sock", Net:"unix"} E0807 07:04:55.834619 1 utils.go:50] GRPC error: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/6c9305ff-4f2d-4151-af88-a615317ed519/volumes/kubernetes.io~csi/pvc-0caf4758-7215-4893-8e8e-4554d562a4b9/mount: input/output error E0807 07:04:55.835217 1 utils.go:50] GRPC error: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/6c9305ff-4f2d-4151-af88-a615317ed519/volumes/kubernetes.io~csi/pvc-0caf4758-7215-4893-8e8e-4554d562a4b9/mount: input/output error [the error is repeated forever] $ oc -n openshift-manila-csi-driver logs openstack-manila-csi-nodeplugin-7q2dh csi-driver I0807 06:07:28.361733 1 driver.go:124] Driver: manila.csi.openstack.org I0807 06:07:28.361835 1 driver.go:125] Driver version: 0.9.0@ I0807 06:07:28.361839 1 driver.go:126] CSI spec version: 1.2.0 I0807 06:07:28.361843 1 driver.go:129] Operating on NFS shares I0807 06:07:28.361848 1 driver.go:134] Topology awareness disabled I0807 06:07:28.361858 1 driver.go:197] Enabling controller service capability: CREATE_DELETE_VOLUME I0807 06:07:28.361861 1 driver.go:197] Enabling controller service capability: CREATE_DELETE_SNAPSHOT I0807 06:07:28.361865 1 driver.go:216] Enabling volume access mode: MULTI_NODE_MULTI_WRITER I0807 06:07:28.361868 1 driver.go:216] Enabling volume access mode: MULTI_NODE_SINGLE_WRITER I0807 06:07:28.361871 1 driver.go:216] Enabling volume access mode: MULTI_NODE_READER_ONLY I0807 06:07:28.361873 1 driver.go:216] Enabling volume access mode: SINGLE_NODE_WRITER I0807 06:07:28.361875 1 driver.go:216] Enabling volume access mode: SINGLE_NODE_READER_ONLY I0807 06:07:28.363694 1 connection.go:261] Probing CSI driver for readiness I0807 06:07:28.366036 1 driver.go:262] proxying CSI driver nfs.csi.k8s.io version 2.0.0 I0807 06:07:28.366578 1 driver.go:227] Enabling node service capability: UNKNOWN I0807 06:07:28.366926 1 driver.go:326] listening for connections on &net.UnixAddr{Name:"/var/lib/kubelet/plugins/manila.csi.openstack.org/csi.sock", Net:"unix"} I0807 06:32:54.562756 1 builder.go:44] [ID:4] FWD GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded E0807 06:32:54.563194 1 driver.go:313] [ID:18] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded I0807 06:34:55.082383 1 builder.go:44] [ID:5] FWD GRPC error: rpc error: code = Canceled desc = context canceled E0807 06:34:55.082837 1 driver.go:313] [ID:20] GRPC error: rpc error: code = Canceled desc = context canceled [Again, the error is repeated forever]
What's interesting with the node is that it reports Kubernetes 1.18 (=OCP 4.5) and not 4.6: $ oc get node piqin-0807-zlcd6-master-0 Ready master 7h11m v0.0.0-master+$Format:%h$ piqin-0807-zlcd6-master-1 Ready master 7h11m v0.0.0-master+$Format:%h$ piqin-0807-zlcd6-master-2 Ready master 7h11m v0.0.0-master+$Format:%h$ piqin-0807-zlcd6-worker-h4xs4 Ready worker 7h1m v0.0.0-master+$Format:%h$ piqin-0807-zlcd6-worker-kxhzx Ready worker 6h59m v0.0.0-master+$Format:%h$ piqin-0807-zlcd6-worker-xtvlr Ready,SchedulingDisabled worker 6h57m v1.18.3+08c38ef
Looks like `clientaddr=10.129.2.6` is the ip of nfs plugin pod, when upgrading the csi manila operator is deleted and the nfs plugin pods are deleted too. So, the umount is hung there.
It seems that the NFS driver tries to unmount the volume, but it times out. On the node, dmesg says: [26637.477459] nfs: server 172.16.32.1 not responding, still trying [26655.396484] nfs: server 172.16.32.1 not responding, timed out [26667.172003] nfs: server 172.16.32.1 not responding, timed out [26689.187229] nfs: server 172.16.32.1 not responding, timed out It could be related to `clientaddr=10.129.2.6` used in mount options of the 4.5 version of the driver - driver pod with this IP address no longer exist.
I filed 1867152 for the unmount bug, because it affects also 4.5, without upgrade to 4.6, and we may need to fix it there too. Not sure what's the right status of *this* bug though... Do you agree that the *operator* was correctly removed from OLM and adopted by CVO? The upgrade failed, but we track that in bug #1867152
Verified with: 4.5.9 -> 4.6.0-0.nightly-2020-09-14-225526
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196