Bug 1688582

Summary: [Upgrade] cluster-storage-operator incorrectly reporting status and version after the upgrade
Product: OpenShift Container Platform Reporter: Wenqi He <wehe>
Component: StorageAssignee: Bradley Childs <bchilds>
Status: CLOSED ERRATA QA Contact: Wenqi He <wehe>
Severity: medium Docs Contact:
Priority: high    
Version: 4.1.0CC: aos-bugs, aos-storage-staff, bbennett, bchilds, chaoyang, eparis, jokerman, mmccomas, sponnaga, wsun
Target Milestone: ---Keywords: BetaBlocker
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:45:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Wenqi He 2019-03-14 03:16:41 UTC
Description of problem:
After upgrade OCP 4.0, the cluster-storage-operator doesn't show the right change of its conditions including time and status

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-04-234414   True        False         21m     Cluster version is 4.0.0-0.nightly-2019-03-04-234414


How reproducible:
Always

Steps to Reproduce:
1. Install OCP 4.0 next-gen on AWS
2. Upgrade it to new version
3. Check the status of cluster-storage-operator

Actual results:
After upgrade the cluster-storage-operator does not show correctly:
$ oc describe clusteroperator cluster-storage-operator
Name:         cluster-storage-operator
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-03-13T06:43:15Z
  Generation:          1
  Resource Version:    9384
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/cluster-storage-operator
  UID:                 47858015-455b-11e9-b02f-02e5db6b6df6
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-03-13T06:43:15Z
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-03-13T06:43:15Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-03-13T06:43:15Z
    Status:                False
    Type:                  Failing
  Extension:               <nil>
  Related Objects:         <nil>
  Versions:
    Name:     operator
    Version:  0.0.1
Events:       <none>

But Actually the pods has updated along with the upgrade:
$ oc get pods cluster-storage-operator-54d9899ccc-588bs -oyaml -n openshift-cluster-storage-operator
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "ips": [
              "10.128.0.70"
          ],
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: restricted
  creationTimestamp: 2019-03-13T08:31:41Z
  generateName: cluster-storage-operator-54d9899ccc-
  labels:
    name: cluster-storage-operator
    pod-template-hash: 54d9899ccc
  name: cluster-storage-operator-54d9899ccc-588bs
  namespace: openshift-cluster-storage-operator
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: cluster-storage-operator-54d9899ccc
    uid: 6d70c2e8-456a-11e9-9148-06295a0e4212
  resourceVersion: "79560"
  selfLink: /api/v1/namespaces/openshift-cluster-storage-operator/pods/cluster-storage-operator-54d9899ccc-588bs
  uid: 6d7584cb-456a-11e9-9148-06295a0e4212
spec:
  containers:
  - command:
    - cluster-storage-operator
    env:
    - name: WATCH_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: OPERATOR_NAME
      value: cluster-storage-operator
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c925fe57d348c88842b9bf12fa546e3ec101e6b080bcb3bf224750e7f373946c
    imagePullPolicy: IfNotPresent
    name: cluster-storage-operator
    ports:
    - containerPort: 60000
      name: metrics
      protocol: TCP
    resources: {}
    securityContext:
      capabilities:
        drop:
        - KILL
        - MKNOD
        - SETGID
        - SETUID
      procMount: Default
      runAsUser: 1000340000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: cluster-storage-operator-token-jbvzv
      readOnly: true
  dnsPolicy: ClusterFirst
  imagePullSecrets:
  - name: cluster-storage-operator-dockercfg-p7898
  nodeName: ip-172-31-128-77.us-east-2.compute.internal
  nodeSelector:
    node-role.kubernetes.io/master: ""
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000340000
    seLinuxOptions:
      level: s0:c18,c17
  serviceAccount: cluster-storage-operator
  serviceAccountName: cluster-storage-operator
  terminationGracePeriodSeconds: 30
  tolerations:
  - operator: Exists
  volumes:
  - name: cluster-storage-operator-token-jbvzv
    secret:
      defaultMode: 420
      secretName: cluster-storage-operator-token-jbvzv
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-03-13T08:31:41Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-03-13T08:31:57Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2019-03-13T08:31:57Z
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: 2019-03-13T08:31:41Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://922e12ec2e38d980755c6a3c42a353657ee3c7a838e33f6ce1fecb0e190d01d9
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c925fe57d348c88842b9bf12fa546e3ec101e6b080bcb3bf224750e7f373946c
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c925fe57d348c88842b9bf12fa546e3ec101e6b080bcb3bf224750e7f373946c
    lastState: {}
    name: cluster-storage-operator
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2019-03-13T08:31:54Z
  hostIP: 172.31.128.77
  phase: Running
  podIP: 10.128.0.70
  qosClass: BestEffort
  startTime: 2019-03-13T08:31:41Z


Expected results:
clusteroperator's condition should also be updated

Additional info:
Related to bug#1686121

Comment 1 Matthew Wong 2019-03-14 16:38:22 UTC
https://github.com/openshift/cluster-storage-operator/pull/16 should fix this, it's not in the installer yet (currently 0.14)

Comment 2 Bradley Childs 2019-03-14 21:20:57 UTC
On second thought, moving this over to the installer component in case there's some work to be done there.

Comment 3 Eric Paris 2019-03-14 22:33:33 UTC
Putting this back on storage. This a bug in your code. Seems like it sits modified until QE tests a nightly.

Comment 4 Wenqi He 2019-03-19 08:31:07 UTC
I have tested with the upgrade between below two versions, still not changed after upgrade:
Before upgrade:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-18-200009   True        False         4h55m   Cluster version is 4.0.0-0.nightly-2019-03-18-200009

$ oc get pods -oyaml -n openshift-cluster-storage-operator | grep image
      image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:27e821eabac565c10d0f8833dc812a26d5803c5847b12f2d5c1951d4b257d96f
$ oc image info quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:27e821eabac565c10d0f8833dc812a26d5803c5847b12f2d5c1951d4b257d96f | grep io.openshift.build.commit.id
             io.openshift.build.commit.id=4cdc1e782067eacd0eed79cc886b023868498194

Which above shows the image contains the fix already.

$ oc get pods -oyaml -n openshift-cluster-storage-operator | grep conditions -A 25
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2019-03-19T02:52:35Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2019-03-19T02:52:46Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2019-03-19T02:52:46Z
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: 2019-03-19T02:52:35Z
      status: "True"
      type: PodScheduled

$ oc describe clusteroperator storage | grep conditions
  Conditions:
    Last Transition Time:  2019-03-19T02:52:47Z
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-03-19T02:52:47Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-03-19T02:52:47Z
    Status:                False
    Type:                  Failing

After upgrade:
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-18-223058   True        False         7m55s   Cluster version is 4.0.0-0.nightly-2019-03-18-223058

$ oc get pods -oyaml -n openshift-cluster-storage-operator | grep conditions -A 25
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2019-03-19T08:15:23Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2019-03-19T08:15:34Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2019-03-19T08:15:34Z
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: 2019-03-19T08:15:23Z
      status: "True"
      type: PodScheduled
$ oc get clusteroperator
NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
service-catalog-apiserver             4.0.0-0.nightly-2019-03-18-223058   True        False         False     12m
service-catalog-controller-manager    4.0.0-0.nightly-2019-03-18-223058   True        False         False     69s
storage                               4.0.0-0.nightly-2019-03-18-223058   True        False         False     5h35m

$ oc describe clusteroperator storage | grep Conditions -A 15
  Conditions:
    Last Transition Time:  2019-03-19T02:52:47Z
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-03-19T02:52:47Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-03-19T02:52:47Z
    Status:                False
    Type:                  Failing

Comment 5 Matthew Wong 2019-03-19 16:26:03 UTC
I don't understand, the clusteroperator has Available True and is reporting version 4.0.0-0.nightly-2019-03-18-223058. Is this not expected? What is the expected behaviour?

Comment 6 Matthew Wong 2019-03-19 20:39:14 UTC
Is it because Last Transition Time hasn't changed? It is bit tricky because unlike other operators the storage operator's operand is just a StorageClass at the moment, the class is either created or not, there is no in-between state. But it might make sense to make the status Progressing for the x milliseconds during which the version does not match.

Comment 7 Wenqi He 2019-03-20 01:56:41 UTC
Hi Matthew,
Yes, you are right, the bug reported because of the Last Transition Time has not changed, and when we `$ oc get clusteroperator` it still shows SINCE 5h35m ago while other operators shows just created after the upgrade. BTW, I am requested by manager to report this bug and it also related to the bug#1686121 which I mentioned in Description at the beginning. It said each operator owner should be responsible to each operator. Sorry let you confuse.

Comment 8 Matthew Wong 2019-03-20 18:31:38 UTC
No problem, I did not read carefully. I've opened a PR: https://github.com/openshift/cluster-storage-operator/pull/22

Comment 9 Chao Yang 2019-03-25 10:08:06 UTC
Hi mawong,

I tried to check this bug, but the last Transition Time still not update

Before upgrade:
oc get pods -oyaml -n openshift-cluster-storage-operator | grep conditions -A 25
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2019-03-25T03:16:14Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2019-03-25T03:16:24Z
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2019-03-25T03:16:24Z
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: 2019-03-25T03:16:14Z
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: cri-o://45039e6910f5a7f4a166852d9fdc390a07a8894ff38213e9559730cbaa280a7d
      image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dd43925dad987bc527cb625dc0316e1ddbf92c45c1fbb3198b344ecbb028f541
      imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dd43925dad987bc527cb625dc0316e1ddbf92c45c1fbb3198b344ecbb028f541
      lastState: {}
      name: cluster-storage-operator
      ready: true
      restartCount: 0
      state:

After upgrade:
oc describe clusteroperator storage
Name:         storage
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-03-25T03:16:24Z
  Generation:          1
  Resource Version:    107915
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/storage
  UID:                 5ef48151-4eac-11e9-b15a-0227bde3ac64
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-03-25T03:16:24Z
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-03-25T03:16:24Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-03-25T03:16:24Z
    Status:                False
    Type:                  Failing
  Extension:               <nil>
  Related Objects:         <nil>
  Versions:
    Name:     operator
    Version:  4.0.0-0.nightly-2019-03-23-183709
Events:       <none>

Comment 10 Matthew Wong 2019-03-25 16:51:22 UTC
Please try on nightly 4.0.0-0.alpha-2019-03-24-171037 or later, it worked for me (I am using libvirt, I bump the release version & the clusterstorage goes Available(2019-03-25T16:42:43Z)->Progressing(2019-03-25T16:47:39Z)->Available(2019-03-25T16:47:39Z))

  - lastTransitionTime: 2019-03-25T16:47:39Z
    message: Unsupported platform for storageclass creation
    status: "True"
    type: Available
  - lastTransitionTime: 2019-03-25T16:42:43Z
    status: "False"
    type: Failing
  - lastTransitionTime: 2019-03-25T16:47:39Z
    status: "False"
    type: Progressing

Comment 11 Matthew Wong 2019-03-25 17:58:59 UTC
The change is present in 4.0.0-0.alpha-2019-03-22-235916 https://origin-release.svc.ci.openshift.org/releasestream/4.0.0-0.alpha/release/4.0.0-0.alpha-2019-03-25-172405?from=4.0.0-0.alpha-2019-03-22-235916, so it should work if the pre-upgrade version is at least 4.0.0-0.alpha-2019-03-22-235916

Comment 13 Chao Yang 2019-03-27 04:25:06 UTC
It is passed when upgrade from 4.0.0-0.nightly-2019-03-25-180911 to 4.0.0-0.nightly-2019-03-26-034754

 oc describe clusteroperator storage
Name:         storage
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2019-03-26T06:01:36Z
  Generation:          1
  Resource Version:    545806
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/storage
  UID:                 9d6d30bd-4f8c-11e9-8526-0ab0e92a085a
Spec:
Status:
  Conditions:
    Last Transition Time:  2019-03-27T04:05:40Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2019-03-26T06:01:36Z
    Status:                False
    Type:                  Failing
    Last Transition Time:  2019-03-27T04:05:40Z
    Status:                False
    Type:                  Progressing
  Extension:               <nil>
  Related Objects:         <nil>
  Versions:
    Name:     operator
    Version:  4.0.0-0.nightly-2019-03-26-034754
Events:       <none>

Comment 14 Wenqi He 2019-03-27 05:36:52 UTC
@chaoyang, Thanks to help me drive here!

Comment 17 errata-xmlrpc 2019-06-04 10:45:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758