Bug 1882394 - CSO stuck on message "the cluster operator storage has not yet successfully rolled out" while downgrading from 4.6 -> 4.5
Summary: CSO stuck on message "the cluster operator storage has not yet successfully r...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.5
Hardware: All
OS: All
medium
high
Target Milestone: ---
: 4.5.z
Assignee: Fabio Bertinatto
QA Contact: Wei Duan
URL:
Whiteboard:
: 1877899 (view as bug list)
Depends On: 1877316
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-24 13:02 UTC by Fabio Bertinatto
Modified: 2021-01-08 18:08 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1877316
Environment:
Last Closed: 2020-10-26 15:11:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-storage-operator pull 91 0 None closed Bug 1882394: Delete lock created by CSO 4.6 2021-01-26 14:57:59 UTC
Red Hat Product Errata RHBA-2020:4268 0 None None None 2020-10-26 15:12:17 UTC

Comment 3 Fabio Bertinatto 2020-09-29 07:17:24 UTC
PR [1] has been approved, waiting for patch manager to tag it.

[1] https://github.com/openshift/cluster-storage-operator/pull/91

Comment 5 Wei Duan 2020-10-09 07:07:17 UTC
Hi Fabio, I performed an downgrade from 4.6.0-rc.0 to 4.5.0-0.nightly-2020-10-07-231808(should contain the fix) but re-produced this problem.

Storage co did not roll out: 
    [wduan@MINT 01_general]$ oc get clusterversion
    NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.6.0-rc.0   True        True          40m     Unable to apply 4.5.0-0.nightly-2020-10-07-231808: the cluster operator storage has not yet successfully rolled out

    [wduan@MINT verification-tests]$ oc get co storage
    NAME      VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
    storage   4.6.0-rc.0   True        False         False      3h34m

From cluster-storage-operator log, ConfigMap lock looks like be deleted(?), but operator was not able to become the leader:
    [wduan@MINT 01_general]$ oc -n openshift-cluster-storage-operator logs pod/cluster-storage-operator-86d6fbc996-7l8z7
    {"level":"info","ts":1602214988.450045,"logger":"cmd","msg":"Go Version: go1.13.4"}
    {"level":"info","ts":1602214988.450071,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
    {"level":"info","ts":1602214988.4500754,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0"}
    {"level":"info","ts":1602214988.897236,"logger":"cmd","msg":"Found ConfigMap lock without metadata.ownerReferences, deleting"}
    {"level":"info","ts":1602214988.950875,"logger":"leader","msg":"Trying to become the leader."}
    {"level":"info","ts":1602214989.1881082,"logger":"leader","msg":"Not the leader. Waiting."}
    {"level":"info","ts":1602214990.3181906,"logger":"leader","msg":"Not the leader. Waiting."}
    ...
    {"level":"info","ts":1602226999.4777808,"logger":"leader","msg":"Not the leader. Waiting."}
    {"level":"info","ts":1602227016.1669917,"logger":"leader","msg":"Not the leader. Waiting."}

cluster-storage-operator-lock CM:
    [wduan@MINT verification-tests]$ oc -n openshift-cluster-storage-operator get cm cluster-storage-operator-lock -o yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
    annotations:
        control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"","leaseDurationSeconds":0,"acquireTime":null,"renewTime":null,"leaderTransitions":0}'
    creationTimestamp: "2020-10-09T03:43:08Z"
    managedFields:
    - apiVersion: v1
        fieldsType: FieldsV1
        fieldsV1:
        f:metadata:
            f:annotations:
            .: {}
            f:control-plane.alpha.kubernetes.io/leader: {}
        manager: cluster-storage-operator
        operation: Update
        time: "2020-10-09T03:43:09Z"
    name: cluster-storage-operator-lock
    namespace: openshift-cluster-storage-operator
    resourceVersion: "54649"
    selfLink: /api/v1/namespaces/openshift-cluster-storage-operator/configmaps/cluster-storage-operator-lock
    uid: 59894bf2-5b37-48be-9006-6cf08b427e2c

Comment 6 Fabio Bertinatto 2020-10-09 09:39:36 UTC
CSO 4.5 deleted the ConfigMap here:

> {"level":"info","ts":1602214988.897236,"logger":"cmd","msg":"Found ConfigMap lock without metadata.ownerReferences, deleting"}

This epoch translates to: Friday, October 9, 2020 3:43:08.950 AM

However, the ConfigMap from the description was created on "2020-10-09T03:43:08Z".

This means that something else (CSO 4.6) created the ConfigMap right after it was deleted by CSO 4.5.

Apparently both CSOs were running at the same time for a short period of time. This is possible because they utilize different leader election approaches.

Comment 8 Wei Duan 2020-10-13 08:13:44 UTC
Hi Fabio, 

With the https://github.com/openshift/cluster-storage-operator/pull/96, we verified pass for upgrade/downgrade for path 4.5 -> 4.6 -> 4.5

Comment 10 To Hung Sze 2020-10-15 17:30:42 UTC
*** Bug 1877899 has been marked as a duplicate of this bug. ***

Comment 11 Wei Duan 2020-10-16 00:37:37 UTC
Verified pass.
Perform upgarde/downgrade successfully for 4.5 <-> 4.5 and 4.5 <-> 4.6, also checked ci upgrade from 4.4 is success. So change the status to VERIFIED.

Comment 14 errata-xmlrpc 2020-10-26 15:11:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.16 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4268

Comment 15 Lalatendu Mohanty 2021-01-08 18:08:23 UTC
*** Bug 1877899 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.