Bug 1923998 - NFD Operator is failing to update and remains in Replacing state
Summary: NFD Operator is failing to update and remains in Replacing state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Feature Discovery Operator
Version: 4.7
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.7.0
Assignee: Carlos Eduardo Arango Gutierrez
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks: 1924232
TreeView+ depends on / blocked
 
Reported: 2021-02-02 12:08 UTC by Simon Reber
Modified: 2024-06-14 00:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1924232 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:01:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-nfd-operator pull 130 0 None open [release-4.5] Bug 1924236: Remove readines probe from old operator sdk 2021-02-08 16:23:27 UTC
Github openshift cluster-nfd-operator pull 132 0 None closed Bug 1923998: Remove readines probe from old operator sdk 2021-02-08 16:23:27 UTC
Github openshift cluster-nfd-operator pull 134 0 None closed Bug 1923998: Remove readines probe from CSV 2021-02-09 12:53:10 UTC
Red Hat Knowledge Base (Solution) 5770341 0 None None None 2021-02-03 09:16:17 UTC
Red Hat Product Errata RHSA-2020:5635 0 None None None 2021-02-24 15:02:57 UTC

Description Simon Reber 2021-02-02 12:08:49 UTC
Description of problem:

When attempting to update the NFD Operator, it does remain in Replacing state and not finish the task. Meaning the previous version continues to run and the new version is failing to start and thus causing the update to fail.

$ oc get csv
NAME                                           DISPLAY                                VERSION                 REPLACES                                       PHASE
gpu-operator-certified.v1.3.1                  NVIDIA GPU Operator                    1.3.1                                                                  Succeeded
nfd.4.5.0-202012050338.p0                      Node Feature Discovery                 4.5.0                                                                  Replacing
nfd.4.5.0-202101090338.p0                      Node Feature Discovery                 4.5.0                   nfd.4.5.0-202012050338.p0                      Failed

Checking logs for the NFD Operator we can see the following:

$ tail nfd-operator-85fc65c8d-5x48v.log
{"level":"info","ts":1611920855.1498537,"logger":"controller_nodefeaturediscovery","msg":"Looking for","Service":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1611920855.1499653,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Service":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1611920855.159881,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ServiceAccount":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1611920855.1599567,"logger":"controller_nodefeaturediscovery","msg":"Found, skpping update","ServiceAccount":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1611920855.1599681,"logger":"controller_nodefeaturediscovery","msg":"Looking for","Role":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1611920855.1599805,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Role":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1611920855.1653535,"logger":"controller_nodefeaturediscovery","msg":"Looking for","RoleBinding":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1611920855.1653917,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","RoleBinding":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1611920855.1744335,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ConfigMap":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1611920855.1744826,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","ConfigMap":"nfd-worker","Namespace":"openshift-operators"}

$ tail nfd-operator-f4c5c845b-tf97w.log
{"level":"info","ts":1611920705.7186546,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1611920724.9109416,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1611920741.9398313,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1611920759.353784,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1611920778.2364805,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1611920796.8049188,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1611920812.8300483,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1611920831.6205456,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1611920850.1580396,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1611920868.707701,"logger":"leader","msg":"Not the leader. Waiting."}

Pod nfd-operator-f4c5c845b-tf97w belongs to nfd.4.5.0-202101090338.p0 and pod nfd-operator-85fc65c8d-5x48v to nfd.4.5.0-202012050338.p0.

Based on additional data (attached), it seems that nfd-operator-f4c5c845b-tf97w is unable to turn ready because of nfd-operator-85fc65c8d-5x48v is still running and holding the lock. But based from my understanding, nfd-operator-85fc65c8d-5x48v can only stop when nfd-operator-f4c5c845b-tf97w is ready to allow smooth transition.

Thus is this expected and if so, what activity is required to fix this? If this is not expected how can we solve this and prevent it from happening?

Version-Release number of selected component (if applicable):

 - nfd.4.5.0-202101090338.p0

How reproducible:

 - N/A

Steps to Reproduce:
1. Update from nfd.4.5.0-202012050338.p0 to nfd.4.5.0-202101090338.p0 using OLM (with subscription set to automatic updates)

Actual results:

$ oc get csv
NAME                                           DISPLAY                                VERSION                 REPLACES                                       PHASE
gpu-operator-certified.v1.3.1                  NVIDIA GPU Operator                    1.3.1                                                                  Succeeded
nfd.4.5.0-202012050338.p0                      Node Feature Discovery                 4.5.0                                                                  Replacing
nfd.4.5.0-202101090338.p0                      Node Feature Discovery                 4.5.0                   nfd.4.5.0-202012050338.p0                      Failed

Expected results:

Update to work and nfd.4.5.0-202101090338.p0 to work and thus replace nfd.4.5.0-202012050338.p0

Additional info:

Comment 4 Walid A. 2021-02-09 19:50:11 UTC
Verified that we can update the subscription channel from 4.5 to 4.6, to 4.7 with catalogsource:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: my-test-catalog
  namespace: openshift-nfd
spec:
  sourceType: grpc
  pullPolicy: Always
  image: quay.io/eduardoarango/catalog:nfd47

For the 4.7 channel, we had to manually delete the NodeFeatureDiscovery instance on 4.6, and recreate it for 4.7 subscription.

Comment 9 errata-xmlrpc 2021-02-24 15:01:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 extras and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5635

Comment 10 Red Hat Bugzilla 2023-09-15 01:00:18 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.