Bug 1905133 - operator conditions special-resource-operator
Summary: operator conditions special-resource-operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Special Resource Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.7.0
Assignee: Zvonko Kosic
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-07 15:58 UTC by Seth Jennings
Modified: 2021-02-24 15:40 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
operator conditions special-resource-operator
Last Closed: 2021-02-24 15:40:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:40:49 UTC

Description Seth Jennings 2020-12-07 15:58:01 UTC
test:
operator conditions special-resource-operator 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=operator+conditions+special-resource-operator

level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator special-resource-operator is still updating 

This fails on all platforms

https://sippy.ci.openshift.org/testdetails?release=4.7&test=operator+conditions+special-resource-operator

4.7 promotions are completely blocked

https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.7.0-0.nightly

Comment 1 jamo luhrsen 2020-12-07 17:35:13 UTC
I think this is the PR that addresses this issue:
https://github.com/openshift/ocp-build-data/pull/786

Comment 4 Walid A. 2021-01-26 07:13:02 UTC
Attempted to deploy SRO from https://github.com/openshift-psap/special-resource-operator.git on OCP 4.7 entitled cluster with SPECIALRESOURCE=nvidia-gpu make, after deploying Node Feature Discovery Operator from Operator Hub and and adding a new machineset for a new node of g4dn.xlarge instance with GPU resource.  The nvidia-gpu-driver-container-rhel8-bltsm is stuck in CrashLoopBackOff and keep restarting, but I can still see "special resource operator" as part of the other cluster operators, version says "0.0.1-snapshot":

[root@ip-172-31-45-145 special-resource-operator]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      7h25m
baremetal                                  4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
cloud-credential                           4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
cluster-autoscaler                         4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
config-operator                            4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
console                                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
csi-snapshot-controller                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
dns                                        4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
etcd                                       4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
image-registry                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
ingress                                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
kube-apiserver                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
kube-controller-manager                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
kube-scheduler                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
kube-storage-version-migrator              4.7.0-0.nightly-2021-01-22-134922   True        False         False      7h23m
machine-api                                4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
machine-approver                           4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
machine-config                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
marketplace                                4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
monitoring                                 4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
network                                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
node-tuning                                4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
openshift-apiserver                        4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
openshift-controller-manager               4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
openshift-samples                          4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
operator-lifecycle-manager                 4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
service-ca                                 4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
special-resource-operator                  0.0.1-snapshot                      True        False         False      2m59s
storage                                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      14h
[root@ip-172-31-45-145 special-resource-operator]# 
[root@ip-172-31-45-145 special-resource-operator]# 
[root@ip-172-31-45-145 special-resource-operator]# oc get pods -n nvidia-gpu
NAME                                      READY   STATUS             RESTARTS   AGE
nvidia-gpu-driver-build-1-build           0/1     Completed          0          27m
nvidia-gpu-driver-container-rhel8-bltsm   0/1     CrashLoopBackOff   6          27m
[root@ip-172-31-45-145 special-resource-operator]# 
[root@ip-172-31-45-145 special-resource-operator]# oc describe co special-resource-operator
Name:         special-resource-operator
Namespace:    
Labels:       <none>
Annotations:  include.release.openshift.io/ibm-cloud-managed: true
              include.release.openshift.io/self-managed-high-availability: true
              include.release.openshift.io/single-node-developer: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-01-26T06:10:10Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:include.release.openshift.io/ibm-cloud-managed:
          f:include.release.openshift.io/self-managed-high-availability:
          f:include.release.openshift.io/single-node-developer:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2021-01-26T06:10:10Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:extension:
        f:versions:
    Manager:         manager
    Operation:       Update
    Time:            2021-01-26T06:10:23Z
  Resource Version:  289793
  Self Link:         /apis/config.openshift.io/v1/clusteroperators/special-resource-operator
  UID:               fc95853f-7937-4e5f-9e5b-a828a54abf1a
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-01-26T06:43:21Z
    Message:               Reconciling nvidia-gpu
    Reason:                Reconciling
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-01-26T06:43:21Z
    Message:               Reconciling nvidia-gpu
    Reason:                Reconciling
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-01-26T06:43:21Z
    Message:               Special Resource Operator reconciling special resources
    Reason:                Reconciled
    Status:                False
    Type:                  Degraded
  Extension:               <nil>
  Versions:
    Name:     operator
    Version:  0.0.1-snapshot
Events:       <none>
[root@ip-172-31-45-145 special-resource-operator]#

Comment 5 Walid A. 2021-01-26 16:03:00 UTC
As the SRO operator was reconciling, verified that the conditions and statuses of this operator AVAILABLE, PROGRESSING, and SINCE fields were updating when looking at `oc get co` output:

# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      7h48m
baremetal                                  4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
cloud-credential                           4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
cluster-autoscaler                         4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
config-operator                            4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
console                                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
csi-snapshot-controller                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
dns                                        4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
etcd                                       4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
image-registry                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
ingress                                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
kube-apiserver                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
kube-controller-manager                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
kube-scheduler                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
kube-storage-version-migrator              4.7.0-0.nightly-2021-01-22-134922   True        False         False      16h
machine-api                                4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
machine-approver                           4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
machine-config                             4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
marketplace                                4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
monitoring                                 4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
network                                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
node-tuning                                4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
openshift-apiserver                        4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
openshift-controller-manager               4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
openshift-samples                          4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
operator-lifecycle-manager                 4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
service-ca                                 4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h
special-resource-operator                  0.0.1-snapshot                      True        False         False      20s
storage                                    4.7.0-0.nightly-2021-01-22-134922   True        False         False      23h

# oc get co | grep special
special-resource-operator                  0.0.1-snapshot                      True        False         False      26s

# oc get co | grep special
special-resource-operator                  0.0.1-snapshot                      True        False         False      3m18s

# oc get co | grep special
special-resource-operator                  0.0.1-snapshot                      False       True          False      31s

# oc get pods -n nvidia-gpu
NAME                                      READY   STATUS             RESTARTS   AGE
nvidia-gpu-driver-build-1-build           0/1     Completed          0          9h
nvidia-gpu-driver-container-rhel8-bltsm   0/1     CrashLoopBackOff   83         9h

# oc get pods -n driver-container-base
NAME                    READY   STATUS      RESTARTS   AGE
driver-container-base   0/1     Completed   0          9h

# oc get pods -n openshift-special-resource-operator
NAME                                                  READY   STATUS    RESTARTS   AGE
special-resource-controller-manager-757fc45f5-4w9gh   2/2     Running   0          9h

Comment 7 errata-xmlrpc 2021-02-24 15:40:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.