1862523 – Implement migration of Manila operator from OLM to CSO during 4.5->4.6 upgrade

Bug 1862523 - Implement migration of Manila operator from OLM to CSO during 4.5->4.6 upgrade

Summary: Implement migration of Manila operator from OLM to CSO during 4.5->4.6 upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jan Safranek
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:	1867152
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-31 15:52 UTC by Jan Safranek
Modified:	2020-10-27 16:22 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Release Note
Doc Text:	There already is a release note for migrating Manila driver from OLM to CVO.
Clone Of:
Environment:
Last Closed:	2020-10-27 16:21:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-storage-operator pull 69	0	None	closed	Bug 1862523: Add migration controller	2020-12-10 07:40:06 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:22:21 UTC

Description Jan Safranek 2020-07-31 15:52:33 UTC

During upgrade to 4.6, cluster-storage-operator (CSO) should adopt Manila CSI driver installed from OLM.

Steps to Reproduce:
1. Install Manila operator from OLM + creates CR for it
2. Run an application (Deployment/DaemonSet) that uses Manila PVs
3. Upgrade to 4.6

Actual results:
The old OLM-based CSI driver is running.

Expected results:
- During upgrade, the operator switches from OLM to CVO / CSO one.
- New CSO-based CSI driver is running.
- The application still works.
- User can delete the application (incl. PVCs) and the volumes are unmounted + PVs are deleted.
- User can start a new application and PVs are provisioned and mounted.

Comment 6 Jan Safranek 2020-08-07 10:11:26 UTC

Looking around the cluster, CSI driver on node piqin-0807-zlcd6-worker-xtvlr is not working:

$ oc -n openshift-manila-csi-driver logs csi-nodeplugin-nfsplugin-2kj9l
I0807 06:07:09.462191       1 nfs.go:49] Driver: nfs.csi.k8s.io version: 2.0.0
I0807 06:07:09.462292       1 nfs.go:99] Enabling volume access mode: SINGLE_NODE_WRITER
I0807 06:07:09.462297       1 nfs.go:99] Enabling volume access mode: SINGLE_NODE_READER_ONLY
I0807 06:07:09.462300       1 nfs.go:99] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0807 06:07:09.462302       1 nfs.go:99] Enabling volume access mode: MULTI_NODE_SINGLE_WRITER
I0807 06:07:09.462305       1 nfs.go:99] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0807 06:07:09.462312       1 nfs.go:110] Enabling controller service capability: UNKNOWN
I0807 06:07:09.466682       1 server.go:92] Listening for connections on address: &net.UnixAddr{Name:"/plugin/csi.sock", Net:"unix"}
E0807 07:04:55.834619       1 utils.go:50] GRPC error: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/6c9305ff-4f2d-4151-af88-a615317ed519/volumes/kubernetes.io~csi/pvc-0caf4758-7215-4893-8e8e-4554d562a4b9/mount: input/output error
E0807 07:04:55.835217       1 utils.go:50] GRPC error: rpc error: code = Internal desc = stat /var/lib/kubelet/pods/6c9305ff-4f2d-4151-af88-a615317ed519/volumes/kubernetes.io~csi/pvc-0caf4758-7215-4893-8e8e-4554d562a4b9/mount: input/output error

[the error is repeated forever]

$ oc -n openshift-manila-csi-driver logs openstack-manila-csi-nodeplugin-7q2dh csi-driver
I0807 06:07:28.361733       1 driver.go:124] Driver: manila.csi.openstack.org
I0807 06:07:28.361835       1 driver.go:125] Driver version: 0.9.0@
I0807 06:07:28.361839       1 driver.go:126] CSI spec version: 1.2.0
I0807 06:07:28.361843       1 driver.go:129] Operating on NFS shares
I0807 06:07:28.361848       1 driver.go:134] Topology awareness disabled
I0807 06:07:28.361858       1 driver.go:197] Enabling controller service capability: CREATE_DELETE_VOLUME
I0807 06:07:28.361861       1 driver.go:197] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0807 06:07:28.361865       1 driver.go:216] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0807 06:07:28.361868       1 driver.go:216] Enabling volume access mode: MULTI_NODE_SINGLE_WRITER
I0807 06:07:28.361871       1 driver.go:216] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0807 06:07:28.361873       1 driver.go:216] Enabling volume access mode: SINGLE_NODE_WRITER
I0807 06:07:28.361875       1 driver.go:216] Enabling volume access mode: SINGLE_NODE_READER_ONLY
I0807 06:07:28.363694       1 connection.go:261] Probing CSI driver for readiness
I0807 06:07:28.366036       1 driver.go:262] proxying CSI driver nfs.csi.k8s.io version 2.0.0
I0807 06:07:28.366578       1 driver.go:227] Enabling node service capability: UNKNOWN
I0807 06:07:28.366926       1 driver.go:326] listening for connections on &net.UnixAddr{Name:"/var/lib/kubelet/plugins/manila.csi.openstack.org/csi.sock", Net:"unix"}
I0807 06:32:54.562756       1 builder.go:44] [ID:4] FWD GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E0807 06:32:54.563194       1 driver.go:313] [ID:18] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0807 06:34:55.082383       1 builder.go:44] [ID:5] FWD GRPC error: rpc error: code = Canceled desc = context canceled
E0807 06:34:55.082837       1 driver.go:313] [ID:20] GRPC error: rpc error: code = Canceled desc = context canceled

[Again, the error is repeated forever]

Comment 7 Jan Safranek 2020-08-07 10:19:19 UTC

What's interesting with the node is that it reports Kubernetes 1.18 (=OCP 4.5) and not 4.6:

$ oc get node
piqin-0807-zlcd6-master-0       Ready                      master   7h11m   v0.0.0-master+$Format:%h$
piqin-0807-zlcd6-master-1       Ready                      master   7h11m   v0.0.0-master+$Format:%h$
piqin-0807-zlcd6-master-2       Ready                      master   7h11m   v0.0.0-master+$Format:%h$
piqin-0807-zlcd6-worker-h4xs4   Ready                      worker   7h1m    v0.0.0-master+$Format:%h$
piqin-0807-zlcd6-worker-kxhzx   Ready                      worker   6h59m   v0.0.0-master+$Format:%h$
piqin-0807-zlcd6-worker-xtvlr   Ready,SchedulingDisabled   worker   6h57m   v1.18.3+08c38ef

Comment 8 Qin Ping 2020-08-07 11:24:03 UTC

Looks like `clientaddr=10.129.2.6` is the ip of nfs plugin pod, when upgrading the csi manila operator is deleted and the nfs plugin pods are deleted too.
So, the umount is hung there.

Comment 9 Jan Safranek 2020-08-07 12:18:16 UTC

It seems that the NFS driver tries to unmount the volume, but it times out. On the node, dmesg says:

[26637.477459] nfs: server 172.16.32.1 not responding, still trying
[26655.396484] nfs: server 172.16.32.1 not responding, timed out
[26667.172003] nfs: server 172.16.32.1 not responding, timed out
[26689.187229] nfs: server 172.16.32.1 not responding, timed out

It could be related to `clientaddr=10.129.2.6` used in mount options of the 4.5 version of the driver - driver pod with this IP address no longer exist.

Comment 10 Jan Safranek 2020-08-07 14:02:22 UTC

I filed 1867152 for the unmount bug, because it affects also 4.5, without upgrade to 4.6, and we may need to fix it there too.

Not sure what's the right status of *this* bug though... Do you agree that the *operator* was correctly removed from OLM and adopted by CVO? The upgrade failed, but we track that in bug #1867152

Comment 11 Qin Ping 2020-09-15 06:42:11 UTC

Verified with: 4.5.9 -> 4.6.0-0.nightly-2020-09-14-225526

Comment 13 errata-xmlrpc 2020-10-27 16:21:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.