Bug 1971654

Summary: [4.8.0] InfraEnv controller should always requeue for backend response HTTP StatusConflict (code 409)
Product: OpenShift Container Platform Reporter: Michael Filanov <mfilanov>
Component: assisted-installerAssignee: Nir Magnezi <nmagnezi>
assisted-installer sub component: assisted-service QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, mfilanov, nmagnezi, nshidlin
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Hive KNI-EDGE-4.8
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1971640 Environment:
Last Closed: 2021-07-27 23:12:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1971640    
Bug Blocks:    

Description Michael Filanov 2021-06-14 13:56:02 UTC
+++ This bug was initially created as a clone of Bug #1971640 +++

Description of problem:
=======================
InfraEnv controller should always requeue for backend response HTTP StatusConflict (code 409)

The InfraEnv controller is expected to get HTTP StatusConflict (code 409) from the backend for requests sent in a gap smaller than 10 seconds[1].

The backend generates that response without looking at the Generate ISO request content (which may either be valid or not). 
The image should always be on par with the spec. For that, each request must be served by the backend.

Currently, the controller will not requeue in a case of 409, if the conditions indicate an error[2].

The controller should always requeue, and set RequeueAfter to something larger than 10 seconds to avoid repeated 409s.

[1] https://github.com/openshift/assisted-service/blob/52dd09d9b935b35a5a92d146581bd142e3339940/internal/bminventory/inventory.go#L897-L908
[2] https://github.com/openshift/assisted-service/blob/01e814cbe3ddb273197c70b7767bceffe275a87a/internal/controller/controllers/infraenv_controller.go#L344-L346

Comment 1 Nir Magnezi 2021-06-27 08:44:49 UTC
Merged to ocm-2.3 https://github.com/openshift/assisted-service/pull/2073

Comment 5 nshidlin 2021-07-11 08:59:25 UTC
Verified with ACM: 2.3.0-DOWNSTREAM-2021-07-09-20-21-59

Steps to verify:
1. trigger infraenv reconcile by patching the ssh key: 
oc patch infraenvs.agent-install.openshift.io sno-0 --type merge -p '{"spec":{"sshAuthorizedKey":"ssh-rsa <ssh_key_1>"}}'

2. sleep 3 seconds

3. trigger a second infraenv reconcile by patching a different ssh keyoc patch infraenvs.agent-install.openshift.io sno-0 --type merge -p '{"spec":{"sshAuthorizedKey":"ssh-rsa <ssh_key_2>"}}'

The first attempt to patch the second ssh key fails as this is less than 10 seconds since the last request to generate the ISO:
time="2021-07-11T08:28:25Z" level=info msg="InfraEnv Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:81" go-id=3237 infra_env=sno-0 infra_env_namespace=sno-0 request_id=6aa1667d-b300-4fbb-9cc9-d5eead1230af
time="2021-07-11T08:28:25Z" level=info msg="prepare image for cluster daeee389-0377-4bc4-a078-f65a81bd488c" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).GenerateClusterISOInternal" file="/remote-source/assisted-service/app/internal/bminventory/inventory.go:850" go-id=3237 pkg=Inventory request_id=6aa1667d-b300-4fbb-9cc9-d5eead1230af
time="2021-07-11T08:28:25Z" level=error msg="request came too soon after previous request" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).GenerateClusterISOInternal" file="/remote-source/assisted-service/app/internal/bminventory/inventory.go:904" go-id=3237 pkg=Inventory request_id=6aa1667d-b300-4fbb-9cc9-d5eead1230af
time="2021-07-11T08:28:25Z" level=error msg="generate cluster ISO failed" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).GenerateClusterISOInternal.func1" file="/remote-source/assisted-service/app/internal/bminventory/inventory.go:875" go-id=3237 pkg=Inventory request_id=6aa1667d-b300-4fbb-9cc9-d5eead1230af
time="2021-07-11T08:28:25Z" level=info msg="Image sno-0 being prepared for cluster sno-0" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).handleEnsureISOErrors" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:360" go-id=3237 infra_env=sno-0 infra_env_namespace=sno-0 request_id=6aa1667d-b300-4fbb-9cc9-d5eead1230af
time="2021-07-11T08:28:25Z" level=info msg="InfraEnv Reconcile ended" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).Reconcile.func1" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:78" go-id=3237 infra_env=sno-0 infra_env_namespace=sno-0 request_id=6aa1667d-b300-4fbb-9cc9-d5eead1230af

The infraenv reconcile is re-queued ~20 secs later:
time="2021-07-11T08:28:45Z" level=info msg="InfraEnv Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:81" go-id=3237 infra_env=sno-0 infra_env_namespace=sno-0 request_id=ed5e60b3-edc5-4eea-83da-fe03c1213f57
time="2021-07-11T08:28:45Z" level=info msg="prepare image for cluster daeee389-0377-4bc4-a078-f65a81bd488c" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).GenerateClusterISOInternal" file="/remote-source/assisted-service/app/internal/bminventory/inventory.go:850" go-id=3237 pkg=Inventory request_id=ed5e60b3-edc5-4eea-83da-fe03c1213f57
time="2021-07-11T08:28:45Z" level=info msg="Successfully uploaded file daeee389-0377-4bc4-a078-f65a81bd488c/discovery.ign" func="github.com/openshift/assisted-service/pkg/s3wrapper.(*FSClient).Upload" file="/remote-source/assisted-service/app/pkg/s3wrapper/filesystem.go:76" go-id=3237 request_id=ed5e60b3-edc5-4eea-83da-fe03c1213f57
time="2021-07-11T08:28:45Z" level=info msg="Creating minimal ISO for cluster daeee389-0377-4bc4-a078-f65a81bd488c" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).generateClusterMinimalISO.func1" file="/remote-source/assisted-service/app/internal/bminventory/inventory.go:1087" go-id=3237 pkg=Inventory request_id=ed5e60b3-edc5-4eea-83da-fe03c1213f57
time="2021-07-11T08:28:45Z" level=info msg="Uploading minimal ISO for cluster daeee389-0377-4bc4-a078-f65a81bd488c" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).generateClusterMinimalISO" file="/remote-source/assisted-service/app/internal/bminventory/inventory.go:1103" go-id=3237 pkg=Inventory request_id=ed5e60b3-edc5-4eea-83da-fe03c1213f57
time="2021-07-11T08:28:45Z" level=info msg="Successfully uploaded file discovery-image-daeee389-0377-4bc4-a078-f65a81bd488c.iso" func="github.com/openshift/assisted-service/pkg/s3wrapper.(*FSClient).UploadStream" file="/remote-source/assisted-service/app/pkg/s3wrapper/filesystem.go:159" go-id=3237 request_id=ed5e60b3-edc5-4eea-83da-fe03c1213f57
...
time="2021-07-11T08:28:45Z" level=info msg="ISODownloadURL changed from https://assisted-service-rhacm.apps.seal12-cluster-assisted-0.qe.lab.redhat.com/api/assisted-install/v1/clusters/daeee389-0377-4bc4-a078-f65a81bd488c/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiZGFlZWUzODktMDM3Ny00YmM0LWEwNzgtZjY1YTgxYmQ0ODhjIn0.GWj9Fa5clVTS7oh5NxSPVgKYg1sdtNR6eliqhz4cO4WMMYxQDcbRknbr5Kvc2QXiyCP3DFh8lHTM28MZiGBYVw to https://assisted-service-rhacm.apps.seal12-cluster-assisted-0.qe.lab.redhat.com/api/assisted-install/v1/clusters/daeee389-0377-4bc4-a078-f65a81bd488c/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiZGFlZWUzODktMDM3Ny00YmM0LWEwNzgtZjY1YTgxYmQ0ODhjIn0.WeRl73aq14ocUbhsohBunrYdEP2mEqSQz7jcL-lKKrB0YzCZ5-6EKX0eJZr9UcuWRXU6KQuI1H-hVPZie7gpYg" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).updateEnsureISOSuccess" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:328" go-id=3237 infra_env=sno-0 infra_env_namespace=sno-0 request_id=ed5e60b3-edc5-4eea-83da-fe03c1213f57
time="2021-07-11T08:28:45Z" level=info msg="InfraEnv Reconcile ended" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).Reconcile.func1" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:78" go-id=3237 infra_env=sno-0 infra_env_namespace=sno-0 request_id=ed5e60b3-edc5-4eea-83da-fe03c1213f57

Comment 8 errata-xmlrpc 2021-07-27 23:12:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438