Bug 2009424

Summary: Deployment upgrade is failing availability check
Product: OpenShift Container Platform Reporter: Pablo Acevedo <pacevedo>
Component: Special Resource OperatorAssignee: Pablo Acevedo <pacevedo>
Status: CLOSED ERRATA QA Contact: Walid A. <wabouham>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: aos-bugs, bthurber, dagray
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2009530 (view as bug list) Environment:
Last Closed: 2022-03-10 16:14:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2009530    

Description Pablo Acevedo 2021-09-30 15:46:12 UTC
Description of problem:

Github issue: https://github.com/openshift-psap/special-resource-operator/issues/94

When the image in deployment is updated new replicaSet is created for it. While checking the resource availability for deployment it is stuck, as it keeps referring to old replicaset available replica count.

infoscale-vtas-licensing-controller-68649d5564 0 0 0 22h
infoscale-vtas-licensing-controller-78c4c47dfd 1 1 1 117m

2021-08-24T12:00:04.553Z INFO wait Checking ReplicaSet {"name": "infoscale-vtas-licensing-controller-68649d5564"}
2021-08-24T12:00:04.553Z INFO wait Waiting for availability of {"Kind": "Deployment: infoscale-vtas/infoscale-vtas-licensing-controller"}

2021-08-24T12:00:04.578Z INFO infoscale-vtas RECONCILE REQUEUE: Could not reconcile chart {"error": "Cannot reconcile hardware states: Failed to create state: templates/1000-license-container.yaml: After CRUD hooks failed: Could not wait for resource: Waiting too long for resource: timed out waiting for the condition"}


Version-Release number of selected component (if applicable): 4.9


How reproducible:


Steps to Reproduce:
1. Create a CR that produces a Deployment.
2. Update the CR to use an upgraded helm chart. For example, changing the image.

Actual results:

CR is deployed but is never fully reconciled. It stays looping in:
2021-08-24T12:00:04.578Z INFO infoscale-vtas RECONCILE REQUEUE: Could not reconcile chart {"error": "Cannot reconcile hardware states: Failed to create state: templates/1000-license-container.yaml: After CRUD hooks failed: Could not wait for resource: Waiting too long for resource: timed out waiting for the condition"}


Expected results:
Reconcile loop finishes gracefully.


Additional info:

Comment 2 Walid A. 2021-10-08 00:35:04 UTC
Verified on OCP 4.10.0-0.nightly-2021-10-06-093151

1.  git clone master branch of: https://github.com/openshift/special-resource-operator.git
2.  cd special-resource-operator
3.  untar the `ping-pong-0.0.2.tgz` which has a new version in the charts/example dir
4.  Build local image of SRO:
    make local-image-build local-image-push deploy
5.  tag and push new SRO image to your local quay.io account
6.  IMAGE=quay.io/<your_accountname>/special-resource-operator:master make deploy
.  oc apply -f charts/example/ping-pong-0.0.1
8.  oc get all -n ping-pong
NAME                                    READY   STATUS    RESTARTS   AGE
pod/ping-pong-client-7fd9cc6848-m6r5b   1/1     Running   0          7m1s
pod/ping-pong-server-7b8b5c98c4-jr2qn   1/1     Running   0          7m12s

NAME                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/ping-pong-service   ClusterIP   172.30.45.234   <none>        12021/TCP   7m12s

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ping-pong-client   1/1     1            1           7m1s
deployment.apps/ping-pong-server   1/1     1            1           7m12s

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/ping-pong-client-7fd9cc6848   1         1         1       7m1s
replicaset.apps/ping-pong-server-7b8b5c98c4   1         1         1       7m12s

9.  oc apply -f charts/example/ping-pong-0.0.2
This will cause the ping-pong pods to redeploy

10. oc logs -n openshift-special-resource-operator special-resource-controller-manager-7867485ccd-lzh75 -c manager
.
.
.
2021-10-07T21:21:08.103Z	INFO	ping-pong  	NODE	{"Setting Label ": "specialresource.openshift.io/state-ping-pong-0004", "on ": "ip-10-0-153-25.us-east-2.compute.internal"}
2021-10-07T21:21:08.112Z	INFO	ping-pong  	NODE	{"Setting Label ": "specialresource.openshift.io/state-ping-pong-0004", "on ": "ip-10-0-178-128.us-east-2.compute.internal"}
2021-10-07T21:21:08.121Z	INFO	ping-pong  	NODE	{"Setting Label ": "specialresource.openshift.io/state-ping-pong-0004", "on ": "ip-10-0-192-153.us-east-2.compute.internal"}
2021-10-07T21:21:08.121Z	INFO	ping-pong  	Executing	{"State": "templates/0005_client.yaml"}
2021-10-07T21:21:09.520Z	INFO	warning  	OnError: release: already exists  
2021-10-07T21:21:09.520Z	INFO	helmer  	Release pre-install hooks
2021-10-07T21:21:09.525Z	INFO	helmer  	Hooks	{"pre-install": "Ready (Get)"}
2021-10-07T21:21:09.525Z	INFO	helmer  	Release manifests
2021-10-07T21:21:09.530Z	INFO	resource  	Namespace empty settting	{"namespace": "ping-pong"}
2021-10-07T21:21:09.535Z	INFO	resource  	Found, not updating, hash the same: Deployment/ping-pong-client	{"Kind": "Deployment: ping-pong/ping-pong-client"}
2021-10-07T21:21:09.535Z	INFO	resource  	specialresource.openshift.io/wait
2021-10-07T21:21:09.535Z	INFO	wait  	ForResource	{"Kind": "Deployment"}
2021-10-07T21:21:19.552Z	INFO	wait  	Checking ReplicaSet	{"name": "ping-pong-client-7fd9cc6848"}
2021-10-07T21:21:19.552Z	INFO	wait  	ReplicaSet scheduled for termination	{"name": "ping-pong-client-7fd9cc6848"}
2021-10-07T21:21:19.552Z	INFO	wait  	Checking ReplicaSet	{"name": "ping-pong-client-c5bff68d7"}
2021-10-07T21:21:19.552Z	INFO	wait  	Status	{"AvailableReplicas": 1, "Replicas": 1}
2021-10-07T21:21:19.552Z	INFO	wait  	Checking ReplicaSet	{"name": "ping-pong-server-7b8b5c98c4"}
2021-10-07T21:21:19.552Z	INFO	wait  	ReplicaSet scheduled for termination	{"name": "ping-pong-server-7b8b5c98c4"}
2021-10-07T21:21:19.552Z	INFO	wait  	Checking ReplicaSet	{"name": "ping-pong-server-7cbd48d69c"}
2021-10-07T21:21:19.552Z	INFO	wait  	Status	{"AvailableReplicas": 1, "Replicas": 1}
2021-10-07T21:21:19.552Z	INFO	wait  	Resource available 	{"Kind": "Deployment: ping-pong/ping-pong-client"}
2021-10-07T21:21:19.552Z	INFO	helmer  	Release post-install hooks
2021-10-07T21:21:19.556Z	INFO	helmer  	Hooks	{"post-install": "Ready (Get)"}
2021-10-07T21:21:19.587Z	INFO	cache  	Nodes cached	{"name": "ip-10-0-153-25.us-east-2.compute.internal"}
2021-10-07T21:21:19.587Z	INFO	cache  	Nodes cached	{"name": "ip-10-0-178-128.us-east-2.compute.internal"}
2021-10-07T21:21:19.587Z	INFO	cache  	Nodes cached	{"name": "ip-10-0-192-153.us-east-2.compute.internal"}
2021-10-07T21:21:19.587Z	INFO	cache  	Node list:	{"length": 3}
2021-10-07T21:21:19.587Z	INFO	cache  	Nodes	{"num": 3}
2021-10-07T21:21:19.596Z	INFO	ping-pong  	NODE	{"Setting Label ": "specialresource.openshift.io/state-ping-pong-0005", "on ": "ip-10-0-153-25.us-east-2.compute.internal"}
2021-10-07T21:21:19.607Z	INFO	ping-pong  	NODE	{"Setting Label ": "specialresource.openshift.io/state-ping-pong-0005", "on ": "ip-10-0-178-128.us-east-2.compute.internal"}
2021-10-07T21:21:19.616Z	INFO	ping-pong  	NODE	{"Setting Label ": "specialresource.openshift.io/state-ping-pong-0005", "on ": "ip-10-0-192-153.us-east-2.compute.internal"}
2021-10-07T21:21:20.707Z	INFO	warning  	OnError: release: already exists  
2021-10-07T21:21:20.707Z	INFO	helmer  	Release pre-install hooks
2021-10-07T21:21:20.711Z	INFO	helmer  	Hooks	{"pre-install": "Ready (Get)"}
2021-10-07T21:21:20.711Z	INFO	helmer  	Release manifests
2021-10-07T21:21:20.716Z	INFO	helmer  	Release post-install hooks
2021-10-07T21:21:20.725Z	INFO	helmer  	Hooks	{"post-install": "Ready (Get)"}
2021-10-07T21:21:20.734Z	INFO	ping-pong  	RECONCILE SUCCESS: All resources done
2021-10-07T21:21:20.739Z	INFO	status  	Reconciling ClusterOperator
2021-10-07T21:21:20.747Z	INFO	status  	Adding to relatedObjects	{"namespace": "ping-pong"}
2021-10-07T21:21:20.747Z	INFO	status  	Adding to relatedObjects	{"namespace": "cert-manager"}
2021-10-07T21:21:20.747Z	INFO	status  	Adding to relatedObjects	{"namespace": "preamble"}
2021-10-07T21:21:20.754Z	INFO	status  	RECONCILE SUCCESS: Reconcile


11. # oc get all -n ping-pong
NAME                                    READY   STATUS    RESTARTS        AGE
pod/ping-pong-client-c5bff68d7-hgbhn    1/1     Running   1 (6m19s ago)   6m21s
pod/ping-pong-server-7cbd48d69c-mq59t   1/1     Running   0               6m32s

NAME                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/ping-pong-service   ClusterIP   172.30.45.234   <none>        12021/TCP   15m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/ping-pong-client   1/1     1            1           14m
deployment.apps/ping-pong-server   1/1     1            1           15m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/ping-pong-client-7fd9cc6848   0         0         0       14m
replicaset.apps/ping-pong-client-c5bff68d7    1         1         1       6m21s
replicaset.apps/ping-pong-server-7b8b5c98c4   0         0         0       15m
replicaset.apps/ping-pong-server-7cbd48d69c   1         1         1       6m32s

Comment 5 errata-xmlrpc 2022-03-10 16:14:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056