Bug 1882394

Summary:	CSO stuck on message "the cluster operator storage has not yet successfully rolled out" while downgrading from 4.6 -> 4.5
Product:	OpenShift Container Platform	Reporter:	Fabio Bertinatto <fbertina>
Component:	Storage	Assignee:	Fabio Bertinatto <fbertina>
Storage sub component:	Operators	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	aos-bugs, eparis, fbertina, jsafrane, piqin, pmali, sdodson, tsze, wduan, wking
Version:	4.5	Keywords:	Reopened, TestBlocker
Target Milestone:	---
Target Release:	4.5.z
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1877316	Environment:
Last Closed:	2020-10-26 15:11:50 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1877316
Bug Blocks:

Comment 3 Fabio Bertinatto 2020-09-29 07:17:24 UTC

PR [1] has been approved, waiting for patch manager to tag it.

[1] https://github.com/openshift/cluster-storage-operator/pull/91

Comment 5 Wei Duan 2020-10-09 07:07:17 UTC

Hi Fabio, I performed an downgrade from 4.6.0-rc.0 to 4.5.0-0.nightly-2020-10-07-231808(should contain the fix) but re-produced this problem.

Storage co did not roll out: 
    [wduan@MINT 01_general]$ oc get clusterversion
    NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.6.0-rc.0   True        True          40m     Unable to apply 4.5.0-0.nightly-2020-10-07-231808: the cluster operator storage has not yet successfully rolled out

    [wduan@MINT verification-tests]$ oc get co storage
    NAME      VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
    storage   4.6.0-rc.0   True        False         False      3h34m

From cluster-storage-operator log, ConfigMap lock looks like be deleted(?), but operator was not able to become the leader:
    [wduan@MINT 01_general]$ oc -n openshift-cluster-storage-operator logs pod/cluster-storage-operator-86d6fbc996-7l8z7
    {"level":"info","ts":1602214988.450045,"logger":"cmd","msg":"Go Version: go1.13.4"}
    {"level":"info","ts":1602214988.450071,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
    {"level":"info","ts":1602214988.4500754,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0"}
    {"level":"info","ts":1602214988.897236,"logger":"cmd","msg":"Found ConfigMap lock without metadata.ownerReferences, deleting"}
    {"level":"info","ts":1602214988.950875,"logger":"leader","msg":"Trying to become the leader."}
    {"level":"info","ts":1602214989.1881082,"logger":"leader","msg":"Not the leader. Waiting."}
    {"level":"info","ts":1602214990.3181906,"logger":"leader","msg":"Not the leader. Waiting."}
    ...
    {"level":"info","ts":1602226999.4777808,"logger":"leader","msg":"Not the leader. Waiting."}
    {"level":"info","ts":1602227016.1669917,"logger":"leader","msg":"Not the leader. Waiting."}

cluster-storage-operator-lock CM:
    [wduan@MINT verification-tests]$ oc -n openshift-cluster-storage-operator get cm cluster-storage-operator-lock -o yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
    annotations:
        control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"","leaseDurationSeconds":0,"acquireTime":null,"renewTime":null,"leaderTransitions":0}'
    creationTimestamp: "2020-10-09T03:43:08Z"
    managedFields:
    - apiVersion: v1
        fieldsType: FieldsV1
        fieldsV1:
        f:metadata:
            f:annotations:
            .: {}
            f:control-plane.alpha.kubernetes.io/leader: {}
        manager: cluster-storage-operator
        operation: Update
        time: "2020-10-09T03:43:09Z"
    name: cluster-storage-operator-lock
    namespace: openshift-cluster-storage-operator
    resourceVersion: "54649"
    selfLink: /api/v1/namespaces/openshift-cluster-storage-operator/configmaps/cluster-storage-operator-lock
    uid: 59894bf2-5b37-48be-9006-6cf08b427e2c

Comment 6 Fabio Bertinatto 2020-10-09 09:39:36 UTC

CSO 4.5 deleted the ConfigMap here:

> {"level":"info","ts":1602214988.897236,"logger":"cmd","msg":"Found ConfigMap lock without metadata.ownerReferences, deleting"}

This epoch translates to: Friday, October 9, 2020 3:43:08.950 AM

However, the ConfigMap from the description was created on "2020-10-09T03:43:08Z".

This means that something else (CSO 4.6) created the ConfigMap right after it was deleted by CSO 4.5.

Apparently both CSOs were running at the same time for a short period of time. This is possible because they utilize different leader election approaches.

Comment 8 Wei Duan 2020-10-13 08:13:44 UTC

Hi Fabio, 

With the https://github.com/openshift/cluster-storage-operator/pull/96, we verified pass for upgrade/downgrade for path 4.5 -> 4.6 -> 4.5

Comment 10 To Hung Sze 2020-10-15 17:30:42 UTC

*** Bug 1877899 has been marked as a duplicate of this bug. ***

Comment 11 Wei Duan 2020-10-16 00:37:37 UTC

Verified pass.
Perform upgarde/downgrade successfully for 4.5 <-> 4.5 and 4.5 <-> 4.6, also checked ci upgrade from 4.4 is success. So change the status to VERIFIED.

Comment 14 errata-xmlrpc 2020-10-26 15:11:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.16 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4268

Comment 15 Lalatendu Mohanty 2021-01-08 18:08:23 UTC

*** Bug 1877899 has been marked as a duplicate of this bug. ***