Bug 1906129 - OCP 4.7: Node Feature Discovery (NFD) Operator in CrashLoopBackOff when deployed from OperatorHub
Summary: OCP 4.7: Node Feature Discovery (NFD) Operator in CrashLoopBackOff when depl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Feature Discovery Operator
Version: 4.7
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Carlos Eduardo Arango Gutierrez
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-09 17:46 UTC by Walid A.
Modified: 2021-02-24 15:02 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:01:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-nfd-operator pull 117 0 None closed Bug 1906129: fix manifest files after upstream/downstream sync event 2021-01-27 18:15:06 UTC
Github openshift node-feature-discovery pull 27 0 None closed Bug 1906129: fix dockerfile 2021-01-27 18:14:23 UTC
Red Hat Product Errata RHSA-2020:5635 0 None None None 2021-02-24 15:02:57 UTC

Description Walid A. 2020-12-09 17:46:16 UTC
Description of problem:
This is happening on OCP 4.7.0-0.nightly-2020-12-04-013308 IPI installed cluster on AWS.  When you deploy the Node Feature Discovery (NFD) operator from OperatorHub in a custom-namespace, the operator goes to CrashLoopBackOff state and failed to deploy successfully.  Our automated install (Flexy) creates a new catalgesource qe-app-registry for the latest NFD images.  The operatore image it was trying to deploy was: 4.7.0-202012082225.p0.



# oc get pods -n test-nfd
NAME                          READY   STATUS             RESTARTS   AGE
nfd-operator-fbc5d5dc-hdvrg   0/1     CrashLoopBackOff   20         82m


# oc describe -n test-nfd pod/nfd-operator-fbc5d5dc-hdvrg
Name:         nfd-operator-fbc5d5dc-hdvrg
Namespace:    test-nfd
Priority:     0
Node:         ip-10-0-186-113.us-east-2.compute.internal/10.0.186.113
Start Time:   Wed, 09 Dec 2020 13:50:15 +0000
Labels:       name=nfd-operator
              pod-template-hash=fbc5d5dc
Annotations:  alm-examples:
                [
                  {
                    "apiVersion": "nfd.openshift.io/v1alpha1",
                    "kind": "NodeFeatureDiscovery",
                    "metadata": {
                      "name": "nfd-master-server"
                    },
                    "spec": {
                      "namespace": "openshift-nfd"
                    }
                  }
                ]
              capabilities: Basic Install
              categories: Database
              certified: false
              containerImage: 
              createdAt: 2019-05-30T00:00:00Z
              description:
                This software enables node feature discovery for Kubernetes. It detects hardware features available on each node in a Kubernetes cluster, ...
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.129.0.78"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.129.0.78"
                    ],
                    "default": true,
                    "dns": {}
                }]
              olm.operatorGroup: test-nfd-7l7ls
              olm.operatorNamespace: test-nfd
              olm.skipRange: >=4.2.0 <4.8.0
              olm.targetNamespaces: test-nfd
              openshift.io/scc: anyuid
              operatorframework.io/properties:
                {"properties":[{"type":"olm.gvk","value":{"group":"nfd.openshift.io","kind":"NodeFeatureDiscovery","version":"v1alpha1"}},{"type":"olm.pac...
              provider: Red Hat
              repository: https://github.com/openshift/cluster-nfd-operator
              support: Red Hat
Status:       Running
IP:           10.129.0.78
IPs:
  IP:           10.129.0.78
Controlled By:  ReplicaSet/nfd-operator-fbc5d5dc
Containers:
  nfd-operator:
    Container ID:  cri-o://a944b5addb6ebd2b3b72bc4aebb7fc747db1d4c2a9c8d961755e02be116fa7d9
    Image:         registry.redhat.io/openshift4/ose-cluster-nfd-operator@sha256:9561a4097697224b326a7e604d6eec80fcae14c1c3d57c14280e3ecee8f37741
    Image ID:      registry.redhat.io/openshift4/ose-cluster-nfd-operator@sha256:40f4a5acdccde5f58884d1e319e7fa7ece13c2f66b42671b1667e77096f5619e
    Port:          60000/TCP
    Host Port:     0/TCP
    Command:
      cluster-nfd-operator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 09 Dec 2020 15:11:52 +0000
      Finished:     Wed, 09 Dec 2020 15:11:58 +0000
    Ready:          False
    Restart Count:  20
    Readiness:      exec [stat /tmp/operator-sdk-ready] delay=4s timeout=1s period=10s #success=1 #failure=1
    Environment:
      WATCH_NAMESPACE:                (v1:metadata.annotations['olm.targetNamespaces'])
      POD_NAME:                      nfd-operator-fbc5d5dc-hdvrg (v1:metadata.name)
      OPERATOR_NAME:                 cluster-nfd-operator
      NODE_FEATURE_DISCOVERY_IMAGE:  registry.redhat.io/openshift4/ose-node-feature-discovery@sha256:ebfaae813ecf890ced0a517649c02759454ca82b992e4e07111af464c3b3f30c
    Mounts:
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nfd-operator-token-qs9wb (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  nfd-operator-token-qs9wb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nfd-operator-token-qs9wb
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason          Age                    From                                                 Message
  ----     ------          ----                   ----                                                 -------
  Normal   Scheduled       <unknown>                                                                   Successfully assigned test-nfd/nfd-operator-fbc5d5dc-hdvrg to ip-10-0-186-113.us-east-2.compute.internal
  Normal   AddedInterface  84m                    multus                                               Add eth0 [10.129.0.78/23]
  Normal   Pulled          84m                    kubelet, ip-10-0-186-113.us-east-2.compute.internal  Successfully pulled image "registry.redhat.io/openshift4/ose-cluster-nfd-operator@sha256:9561a4097697224b326a7e604d6eec80fcae14c1c3d57c14280e3ecee8f37741" in 16.593631214s
  Normal   Pulled          84m                    kubelet, ip-10-0-186-113.us-east-2.compute.internal  Successfully pulled image "registry.redhat.io/openshift4/ose-cluster-nfd-operator@sha256:9561a4097697224b326a7e604d6eec80fcae14c1c3d57c14280e3ecee8f37741" in 4.07115016s
  Normal   Pulled          83m                    kubelet, ip-10-0-186-113.us-east-2.compute.internal  Successfully pulled image "registry.redhat.io/openshift4/ose-cluster-nfd-operator@sha256:9561a4097697224b326a7e604d6eec80fcae14c1c3d57c14280e3ecee8f37741" in 3.892592921s
  Normal   Started         83m (x4 over 84m)      kubelet, ip-10-0-186-113.us-east-2.compute.internal  Started container nfd-operator
  Normal   Pulled          83m                    kubelet, ip-10-0-186-113.us-east-2.compute.internal  Successfully pulled image "registry.redhat.io/openshift4/ose-cluster-nfd-operator@sha256:9561a4097697224b326a7e604d6eec80fcae14c1c3d57c14280e3ecee8f37741" in 4.400690076s
  Normal   Pulled          82m                    kubelet, ip-10-0-186-113.us-east-2.compute.internal  Successfully pulled image "registry.redhat.io/openshift4/ose-cluster-nfd-operator@sha256:9561a4097697224b326a7e604d6eec80fcae14c1c3d57c14280e3ecee8f37741" in 4.069757743s
  Normal   Created         82m (x5 over 84m)      kubelet, ip-10-0-186-113.us-east-2.compute.internal  Created container nfd-operator
  Normal   Pulling         24m (x17 over 84m)     kubelet, ip-10-0-186-113.us-east-2.compute.internal  Pulling image "registry.redhat.io/openshift4/ose-cluster-nfd-operator@sha256:9561a4097697224b326a7e604d6eec80fcae14c1c3d57c14280e3ecee8f37741"
  Warning  BackOff         4m26s (x350 over 83m)  kubelet, ip-10-0-186-113.us-east-2.compute.internal  Back-off restarting failed container


# oc logs -n test-nfd nfd-operator-fbc5d5dc-hdvrg
{"level":"info","ts":1607522234.9432366,"logger":"cmd","msg":"Go Version: go1.15.2"}
{"level":"info","ts":1607522234.9432628,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1607522234.9432683,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0+git"}
{"level":"info","ts":1607522234.943933,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1607522235.0183256,"logger":"leader","msg":"Found existing lock with my name. I was likely restarted."}
{"level":"info","ts":1607522235.0199258,"logger":"leader","msg":"Continuing as the leader."}
I1209 13:57:16.070410       1 request.go:645] Throttling request took 1.044642767s, request: GET:https://172.30.0.1:443/apis/tuned.openshift.io/v1?timeout=32s
{"level":"info","ts":1607522237.5251663,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1607522237.5255368,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1607522237.5260978,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1607522237.5262556,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1607522237.5263515,"logger":"controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"error","ts":1607522241.5229845,"logger":"controller-runtime.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":"NodeFeatureDiscovery.nfd.openshift.io","error":"no matches for kind \"NodeFeatureDiscovery\" in version \"nfd.openshift.io/v1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:117\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:143\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:184\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1\n\t/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:661"}
{"level":"error","ts":1607522241.5231156,"logger":"cmd","msg":"Manager exited non-zero","error":"no matches for kind \"NodeFeatureDiscovery\" in version \"nfd.openshift.io/v1\"","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:132\nmain.main\n\t/go/src/github.com/openshift/cluster-nfd-operator/cmd/manager/main.go:106\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:204"}


# oc get catalogsource -A
NAMESPACE               NAME                  DISPLAY                TYPE   PUBLISHER      AGE
openshift-marketplace   certified-operators   Certified Operators    grpc   Red Hat        68m
openshift-marketplace   community-operators   Community Operators    grpc   Red Hat        68m
openshift-marketplace   qe-app-registry       Production Operators   grpc   OpenShift QE   11h
openshift-marketplace   redhat-marketplace    Red Hat Marketplace    grpc   Red Hat        68m
openshift-marketplace   redhat-operators      Red Hat Operators      grpc   Red Hat        68m


# oc get packagemanifest -l catalog=qe-app-registry
NAME                                CATALOG                AGE
cluster-kube-descheduler-operator   Production Operators   11h
local-storage-operator              Production Operators   11h
compliance-operator                 Production Operators   11h
ptp-operator                        Production Operators   11h
elasticsearch-operator              Production Operators   11h
amq-streams                         Production Operators   11h
cluster-logging                     Production Operators   11h
nfd                                 Production Operators   11h
metering-ocp                        Production Operators   11h
sriov-network-operator              Production Operators   11h



Version-Release number of selected component (if applicable):
Server Version: 4.7.0-0.nightly-2020-12-04-013308
Kubernetes Version: v1.19.2+ad738ba

How reproducible:
Every time.

Steps to Reproduce:
1. Create an IPI OCP 4.7 cluster on AWS, 3 master and 3 worker nodes.  Use our Flexy automation which creates a catalogesource for qe-app-registry to pull latest NFD images.
2. From Console create a new project called "test-nfd"
3. From Console, Operator -> OperatorHub, search for NFD operator and install it in test-nfd namespace just created
  - Update Channel:  4.7, Approval Strategy Automatic
4.  Wait for operator status to show Succeeded
5. `oc get pods -n test-nfd` should shopw the nfd operator Running
6.  Normally you create an instance on Node Feature Discovery but at this stage since the operator failed to deploy, I had to stop here

Actual results:
nfd-operator-fbc5d5dc-hdvrg   0/1     CrashLoopBackOff   20         82m

Expected results:
nfd operator running, and Status on console should show Succeeded

Additional info:

Comment 3 Walid A. 2020-12-18 20:33:22 UTC
NFD operator can now be deployed successfully in openshift-nfd namespace, from image built from https://github.com/openshift/cluster-nfd-operator.git master repo:

OCP version:  4.7.0-0.nightly-2020-12-14-165231

git clone https://github.com/openshift/cluster-nfd-operator.git
cd cluster-nfd-operator

ORG=wabouham PULLPOLICY=Always make local-image
ORG=wabouham PULLPOLICY=Always make local-image-push
ORG=wabouham PULLPOLICY=Always make deploy

Waiting on https://bugzilla.redhat.com/show_bug.cgi?id=1908492 to get resolved before we can test deploying it from OperatorHub

Comment 5 errata-xmlrpc 2021-02-24 15:01:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 extras and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5635


Note You need to log in before you can comment on or make changes to this bug.