Description of problem: When attempting to update the NFD Operator, it does remain in Replacing state and not finish the task. Meaning the previous version continues to run and the new version is failing to start and thus causing the update to fail. $ oc get csv NAME DISPLAY VERSION REPLACES PHASE gpu-operator-certified.v1.3.1 NVIDIA GPU Operator 1.3.1 Succeeded nfd.4.5.0-202012050338.p0 Node Feature Discovery 4.5.0 Replacing nfd.4.5.0-202101090338.p0 Node Feature Discovery 4.5.0 nfd.4.5.0-202012050338.p0 Failed Checking logs for the NFD Operator we can see the following: $ tail nfd-operator-85fc65c8d-5x48v.log {"level":"info","ts":1611920855.1498537,"logger":"controller_nodefeaturediscovery","msg":"Looking for","Service":"nfd-master","Namespace":"openshift-operators"} {"level":"info","ts":1611920855.1499653,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Service":"nfd-master","Namespace":"openshift-operators"} {"level":"info","ts":1611920855.159881,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ServiceAccount":"nfd-worker","Namespace":"openshift-operators"} {"level":"info","ts":1611920855.1599567,"logger":"controller_nodefeaturediscovery","msg":"Found, skpping update","ServiceAccount":"nfd-worker","Namespace":"openshift-operators"} {"level":"info","ts":1611920855.1599681,"logger":"controller_nodefeaturediscovery","msg":"Looking for","Role":"nfd-worker","Namespace":"openshift-operators"} {"level":"info","ts":1611920855.1599805,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Role":"nfd-worker","Namespace":"openshift-operators"} {"level":"info","ts":1611920855.1653535,"logger":"controller_nodefeaturediscovery","msg":"Looking for","RoleBinding":"nfd-worker","Namespace":"openshift-operators"} {"level":"info","ts":1611920855.1653917,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","RoleBinding":"nfd-worker","Namespace":"openshift-operators"} {"level":"info","ts":1611920855.1744335,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ConfigMap":"nfd-worker","Namespace":"openshift-operators"} {"level":"info","ts":1611920855.1744826,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","ConfigMap":"nfd-worker","Namespace":"openshift-operators"} $ tail nfd-operator-f4c5c845b-tf97w.log {"level":"info","ts":1611920705.7186546,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1611920724.9109416,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1611920741.9398313,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1611920759.353784,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1611920778.2364805,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1611920796.8049188,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1611920812.8300483,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1611920831.6205456,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1611920850.1580396,"logger":"leader","msg":"Not the leader. Waiting."} {"level":"info","ts":1611920868.707701,"logger":"leader","msg":"Not the leader. Waiting."} Pod nfd-operator-f4c5c845b-tf97w belongs to nfd.4.5.0-202101090338.p0 and pod nfd-operator-85fc65c8d-5x48v to nfd.4.5.0-202012050338.p0. Based on additional data (attached), it seems that nfd-operator-f4c5c845b-tf97w is unable to turn ready because of nfd-operator-85fc65c8d-5x48v is still running and holding the lock. But based from my understanding, nfd-operator-85fc65c8d-5x48v can only stop when nfd-operator-f4c5c845b-tf97w is ready to allow smooth transition. Thus is this expected and if so, what activity is required to fix this? If this is not expected how can we solve this and prevent it from happening? Version-Release number of selected component (if applicable): - nfd.4.5.0-202101090338.p0 How reproducible: - N/A Steps to Reproduce: 1. Update from nfd.4.5.0-202012050338.p0 to nfd.4.5.0-202101090338.p0 using OLM (with subscription set to automatic updates) Actual results: $ oc get csv NAME DISPLAY VERSION REPLACES PHASE gpu-operator-certified.v1.3.1 NVIDIA GPU Operator 1.3.1 Succeeded nfd.4.5.0-202012050338.p0 Node Feature Discovery 4.5.0 Replacing nfd.4.5.0-202101090338.p0 Node Feature Discovery 4.5.0 nfd.4.5.0-202012050338.p0 Failed Expected results: Update to work and nfd.4.5.0-202101090338.p0 to work and thus replace nfd.4.5.0-202012050338.p0 Additional info:
Verified that we can update the subscription channel from 4.5 to 4.6, to 4.7 with catalogsource: apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: my-test-catalog namespace: openshift-nfd spec: sourceType: grpc pullPolicy: Always image: quay.io/eduardoarango/catalog:nfd47 For the 4.7 channel, we had to manually delete the NodeFeatureDiscovery instance on 4.6, and recreate it for 4.7 subscription.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 extras and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5635
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days