Bug 1859221

Summary:	Machine API Operator should go degraded if any pod controller is crash looping
Product:	OpenShift Container Platform	Reporter:	Alberto <agarcial>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified
Version:	4.6
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:16:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alberto 2020-07-21 13:34:55 UTC

Description of problem:
Currently the mao only validate the expected pods are available if there happens to be a deployment resource update https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/sync.go#L110-L116.

This prevents the operator from going degraded for any scenario where the expected pods are not available and "ApplyDeployment" is not returning "updated=true".
E.g an induced pod crash looping.

This was realised by https://bugzilla.redhat.com/show_bug.cgi?id=1856597


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Consistently kill a pod owned by the mao.
2. Watch the mao logs not waiting for rollout. Wait 5 minutes.
3. Watch the operatorStatus not going degraded

Actual results:


Expected results:


Additional info:

Comment 3 Milind Yadav 2020-07-23 10:34:50 UTC

Hi @Alberto ,Does these steps looks good , will update them to polarion as testcase  if they are ok .

version   4.6.0-0.nightly-2020-07-21-200036   True        False         30h     Cluster version is 4.6.0-0.nightly-2020-07-21-200036


Steps :
1. [miyadav@miyadav ~]$ oc edit deployment machine-api-operator -n openshift-machine-api

deployment.apps/machine-api-operator edited  (changed the manifest in induce failure

2. [miyadav@miyadav ~]$ oc get pods  -n openshift-machine-api
NAME                                           READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-7cf8c8c6b5-f4fz9   2/2     Running            0          30h
machine-api-controllers-5684bf77d-zmmj7        7/7     Running            0          3h46m
machine-api-operator-5568ccb96d-dsjdm          1/2     ImagePullBackOff   0          5s
machine-api-operator-7b8c5c5454-vmvj8          2/2     Running            0          3h17m
[miyadav@miyadav ~]$ 

3. check machine-api-operator logs 
[miyadav@miyadav ~]$ oc logs -f machine-api-operator-7b8c5c5454-vmvj8 -c machine-api-operator
I0723 07:05:35.875011       1 start.go:58] Version: 4.6.0-202007201802.p0-dirty
I0723 07:05:35.876417       1 leaderelection.go:242] attempting to acquire leader lease  openshift-machine-api/machine-api-operator...
I0723 07:07:33.696580       1 leaderelection.go:252] successfully acquired lease openshift-machine-api/machine-api-operator
I0723 07:07:33.701627       1 operator.go:145] Starting Machine API Operator
I0723 07:07:33.801696       1 operator.go:157] Synced up caches
I0723 07:07:33.801696       1 start.go:104] Synced up machine api informer caches
I0723 07:07:33.819131       1 status.go:68] Syncing status: re-syncing
I0723 07:07:33.843605       1 sync.go:67] Synced up all machine API webhook configurations
I0723 07:07:33.863759       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16244fee85ec88fa  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:DeploymentUpdated,Message:Updated Deployment.apps/machine-api-controllers -n openshift-machine-api because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2020-07-23 07:07:33.86368025 +0000 UTC m=+118.024322782,LastTimestamp:2020-07-23 07:07:33.86368025 +0000 UTC m=+118.024322782,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0723 07:07:34.889616       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16244feec311a459  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:DaemonSetUpdated,Message:Updated DaemonSet.apps/machine-api-termination-handler -n openshift-machine-api because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2020-07-23 07:07:34.889522265 +0000 UTC m=+119.050164798,LastTimestamp:2020-07-23 07:07:34.889522265 +0000 UTC m=+119.050164798,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
.
.
.
.

Actual results:
[miyadav@miyadav ~]$ oc get clusteroperator machine-api -w
NAME          VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h



Expected results:
machine-api operator should show degraded when the pod is having ImagepullError

Additional info:
This is not tested , just wanted to make sure of the steps , will test once the payload is available

Comment 4 Alberto 2020-07-23 10:47:46 UTC

Hey Milind,
The machine-api-operator is managed by the cluster version operator (CVO).

The pod that needs to be corrupted for this bug is machine-api-controllers-*.
Also it should done without updating the deployment as we already know it behaves as expected for that scenario.
For this test case you could constantly delete the machine-api-controllers-* pods (e.g via automation, or manually running kubectl delete) and watch the clusterOperator eventually going degraded.
Thanks!

Comment 5 Milind Yadav 2020-07-24 05:20:15 UTC

Thanks Alberto,

VERIFIED on - [miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-24-002417   True        False         105m    Cluster version is 4.6.0-0.nightly-2020-07-24-002417


Steps:

1.Navigate to openshift -machine-api project oc project openshift-machine-api 
 oc project openshift-machine-api

2.un below to keep killing controller pod ( it might take random iterations)
for i in {1..20}; do oc delete po $(oc get po | awk '{print $1}'|  grep machine-api-controller) ; done
.
.
pod "machine-api-controllers-7f6dc5f8db-pmt82" deleted
pod "machine-api-controllers-7f6dc5f8db-c4z8d" deleted
pod "machine-api-controllers-7f6dc5f8db-s5482" deleted
pod "machine-api-controllers-7f6dc5f8db-p6wqq" deleted
pod "machine-api-controllers-7f6dc5f8db-vwm22" deleted
pod "machine-api-controllers-7f6dc5f8db-5zr6c" deleted
pod "machine-api-controllers-7f6dc5f8db-vxlbj" deleted
pod "machine-api-controllers-7f6dc5f8db-xwxjq" deleted
pod "machine-api-controllers-7f6dc5f8db-qhp8r" deleted
pod "machine-api-controllers-7f6dc5f8db-lpsx6" deleted
pod "machine-api-controllers-7f6dc5f8db-lt8b8" deleted
pod "machine-api-controllers-7f6dc5f8db-zhlgk" deleted
.
.

3. Parallely with step 2 , keep watchine clusteroperator status:
^C[miyadav@miyadav ~]$ oc get clusteroperators machine-api -w 
NAME          VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-api   4.6.0-0.nightly-2020-07-24-002417   True        False         False      84m
machine-api   4.6.0-0.nightly-2020-07-24-002417   True        False         False      85m
machine-api   4.6.0-0.nightly-2020-07-24-002417   True        False         True       90m


Expected & Actual :

Operator status became degraded for a while


Additional info:
Test case created , removing label 
Moved to VERIFIED

Comment 7 errata-xmlrpc 2020-10-27 16:16:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196