Bug 1949061

Summary: [assisted operator][nmstate] Continuous attempts to reconcile InstallEnv in the case of invalid NMStateConfig
Product: OpenShift Container Platform Reporter: nshidlin <nshidlin>
Component: assisted-installerAssignee: Nir Magnezi <nmagnezi>
assisted-installer sub component: stand-alone QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: alazar, aos-bugs, mfilanov, ohochman, yshnaidm
Version: 4.8Keywords: Triaged
Target Milestone: ---Flags: nmagnezi: needinfo-
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Hive KNI-EDGE-4.8
Fixed In Version: OCP-Metal-v1.0.21.1 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:00:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
InstallEnv and NMStateConfig CRDS none

Description nshidlin 2021-04-13 11:12:58 UTC
Created attachment 1771602 [details]
InstallEnv and NMStateConfig CRDS

Description of problem:
In the case where an InstallEnv references an invalid NMStateConfig the reconcile of the InstallEnv is continuously attempted; even though there is no change to the NMStateConfig. 
In this case InstallEnv reconciliation and ISO generation should only be re-attempted once the NMStateConfig is changed.  

Version-Release number of selected component (if applicable):
assisted-service image:
quay.io/ocpmetal/assisted-service@sha256:c65af18f741660660a04e4a3b155c10a6668527bb790de06a9708f6bec17479b

Steps to Reproduce:
1. Create ClusterDeployment
2. Create invalid NMStateConfig
3. Create InstallEnv referencing invalid NMStateConfig

Actual results:
InstallEnv reconcile is continually attempted with no change made to NMStateConfig  

Expected results:
InstallEnv reconcile should only be attempted if the NMStateConfig is changed

Comment 1 Nir Magnezi 2021-05-09 15:26:41 UTC
This bug is a duplicate of https://issues.redhat.com/browse/MGMT-4695

In short, for invalid nmstate config we get the wrong status code, which makes it hard to determine whether or not we should reqeueue.
For invalid config, we would expect HTTP StatusBadRequest (code 400), while we get HTTP StatusInternalServerError (code500) here.

I have added some debug prints and reproduced ths issue here (added prints marked with 'ZZZ'): https://gist.github.com/nmagnezi/cd4e21691e8c64647bd00d32b0a60b30
See that we initially get 500, followed up by many 409 for requests that arrived in under 10 seconds.
For the latter (code 409), I will try to extend the requeue time to a time longer than 10 seconds, yet it will fix part of the issue.

Yevgeny, any plans for https://issues.redhat.com/browse/MGMT-MGMT-4696 ?

Comment 2 Nir Magnezi 2021-05-09 15:28:39 UTC
Yevgeny, see the question on comment#1

Comment 3 Nir Magnezi 2021-05-27 06:10:29 UTC
Fix merged to master.

QE Verification:
================

You may verify the fix by the referenced YAMLs from: https://github.com/openshift/assisted-service/pull/1696#issuecomment-848670736

Comment 4 nshidlin 2021-06-02 05:52:11 UTC
Verified:

The infraenv is reconciled twice with the invalid nmstate config, and then reconciled again only when there is a change to nmstateconfig matching the label is changed 

quay.io/ocpmetal/assisted-service@sha256:434617dd691c2f5f1a410ffd9866908fc0e9c72e0c3b26ced3d0d8578180fc3a

Comment 7 errata-xmlrpc 2021-07-27 23:00:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438