Bug 2009530

Summary: Deployment upgrade is failing availability check
Product: OpenShift Container Platform Reporter: dagray
Component: Special Resource OperatorAssignee: Pablo Acevedo <pacevedo>
Status: CLOSED ERRATA QA Contact: Walid A. <wabouham>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: aos-bugs, dagray, pacevedo, wabouham
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2009424 Environment:
Last Closed: 2021-10-18 17:51:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2009424    
Bug Blocks:    

Description dagray 2021-09-30 21:21:06 UTC
+++ This bug was initially created as a clone of Bug #2009424 +++

Description of problem:

Github issue: https://github.com/openshift-psap/special-resource-operator/issues/94

When the image in deployment is updated new replicaSet is created for it. While checking the resource availability for deployment it is stuck, as it keeps referring to old replicaset available replica count.

infoscale-vtas-licensing-controller-68649d5564 0 0 0 22h
infoscale-vtas-licensing-controller-78c4c47dfd 1 1 1 117m

2021-08-24T12:00:04.553Z INFO wait Checking ReplicaSet {"name": "infoscale-vtas-licensing-controller-68649d5564"}
2021-08-24T12:00:04.553Z INFO wait Waiting for availability of {"Kind": "Deployment: infoscale-vtas/infoscale-vtas-licensing-controller"}

2021-08-24T12:00:04.578Z INFO infoscale-vtas RECONCILE REQUEUE: Could not reconcile chart {"error": "Cannot reconcile hardware states: Failed to create state: templates/1000-license-container.yaml: After CRUD hooks failed: Could not wait for resource: Waiting too long for resource: timed out waiting for the condition"}


Version-Release number of selected component (if applicable): 4.9


How reproducible:


Steps to Reproduce:
1. Create a CR that produces a Deployment.
2. Update the CR to use an upgraded helm chart. For example, changing the image.

Actual results:

CR is deployed but is never fully reconciled. It stays looping in:
2021-08-24T12:00:04.578Z INFO infoscale-vtas RECONCILE REQUEUE: Could not reconcile chart {"error": "Cannot reconcile hardware states: Failed to create state: templates/1000-license-container.yaml: After CRUD hooks failed: Could not wait for resource: Waiting too long for resource: timed out waiting for the condition"}


Expected results:
Reconcile loop finishes gracefully.


Additional info:

Comment 2 Walid A. 2021-10-07 17:49:46 UTC
Verified on OCP 4.9rc5 with latest SRO image from release-4.9 github branch (https://github.com/openshift/special-resource-operator.git)

deployed SRO from github repo.

# TAG=release-4.9 make deploy
# oc get pods -n openshift-special-resource-operator
# VERSION=0.0.1 REPO=example SPECIALRESOURCE=ping-pong make
# oc get pods -n ping-pong
NAME                                READY   STATUS    RESTARTS   AGE
ping-pong-client-7fd9cc6848-vbbq9   1/1     Running   0          11m
ping-pong-server-7b8b5c98c4-2wxpb   1/1     Running   0          11m

# oc get deployment -n ping-pong
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
ping-pong-client   1/1     1            1           12m
ping-pong-server   1/1     1            1           12m

## Change image for client and server in each deployment
# oc edit deployment -n ping-pong ping-pong-client
deployment.apps/ping-pong-client edited

# oc edit deployment -n ping-pong ping-pong-server
deployment.apps/ping-pong-server edited

## pod recreated with new image
# oc get pods -n ping-pong
NAME                                READY   STATUS    RESTARTS      AGE
ping-pong-client-c5bff68d7-tmnt2    1/1     Running   2 (25s ago)   65s
ping-pong-server-7cbd48d69c-z9rph   1/1     Running   0             34s



## check SRO manager logs for reconcile success:
# oc logs -n openshift-special-resource-operator special-resource-controller-manager-7b4898899d-kcbxr -c manager | grep RECONCILE
2021-10-07T00:36:35.975Z	INFO	status  	RECONCILE SUCCESS: Reconcile
2021-10-07T00:37:23.187Z	INFO	cert-manager  	RECONCILE REQUEUE: Dependency creation failed 	{"error": "Created new SpecialResource we need to Reconcile"}
2021-10-07T00:38:20.372Z	INFO	cert-manager  	RECONCILE REQUEUE: Could not reconcile chart	{"error": "Cannot reconcile hardware states: failed post-install: hook execution failed cert-manager-startupapicheck cert-manager/templates/startupapicheck-job.yaml: After CRUD hooks failed: Could not wait for resource: Waiting too long for resource: timed out waiting for the condition"}
2021-10-07T00:40:14.893Z	INFO	ping-pong  	RECONCILE SUCCESS: All resources done
2021-10-07T00:40:14.905Z	INFO	status  	RECONCILE SUCCESS: Reconcile
2021-10-07T00:40:28.757Z	INFO	cert-manager  	RECONCILE SUCCESS: All resources done
2021-10-07T00:40:28.771Z	INFO	status  	RECONCILE SUCCESS: Reconcile
2021-10-07T00:41:51.989Z	INFO	ping-pong  	RECONCILE SUCCESS: All resources done
2021-10-07T00:41:52.002Z	INFO	status  	RECONCILE SUCCESS: Reconcile
2021-10-07T00:57:48.497Z	INFO	ping-pong  	RECONCILE SUCCESS: All resources done
2021-10-07T00:57:48.509Z	INFO	status  	RECONCILE SUCCESS: Reconcile
2021-10-07T00:59:12.105Z	INFO	ping-pong  	RECONCILE SUCCESS: All resources done
2021-10-07T00:59:12.116Z	INFO	status  	RECONCILE SUCCESS: Reconcile

Comment 5 errata-xmlrpc 2021-10-18 17:51:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759