Bug 1859221 - Machine API Operator should go degraded if any pod controller is crash looping
Summary: Machine API Operator should go degraded if any pod controller is crash looping
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Alberto
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-21 13:34 UTC by Alberto
Modified: 2020-10-27 16:16 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:16:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 651 0 None closed BUG 1859221: Wait for resources to roll out on every sync 2020-12-14 10:34:59 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:16:46 UTC

Description Alberto 2020-07-21 13:34:55 UTC
Description of problem:
Currently the mao only validate the expected pods are available if there happens to be a deployment resource update https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/sync.go#L110-L116.

This prevents the operator from going degraded for any scenario where the expected pods are not available and "ApplyDeployment" is not returning "updated=true".
E.g an induced pod crash looping.

This was realised by https://bugzilla.redhat.com/show_bug.cgi?id=1856597


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Consistently kill a pod owned by the mao.
2. Watch the mao logs not waiting for rollout. Wait 5 minutes.
3. Watch the operatorStatus not going degraded

Actual results:


Expected results:


Additional info:

Comment 3 Milind Yadav 2020-07-23 10:34:50 UTC
Hi @Alberto ,Does these steps looks good , will update them to polarion as testcase  if they are ok .

version   4.6.0-0.nightly-2020-07-21-200036   True        False         30h     Cluster version is 4.6.0-0.nightly-2020-07-21-200036


Steps :
1. [miyadav@miyadav ~]$ oc edit deployment machine-api-operator -n openshift-machine-api

deployment.apps/machine-api-operator edited  (changed the manifest in induce failure

2. [miyadav@miyadav ~]$ oc get pods  -n openshift-machine-api
NAME                                           READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-7cf8c8c6b5-f4fz9   2/2     Running            0          30h
machine-api-controllers-5684bf77d-zmmj7        7/7     Running            0          3h46m
machine-api-operator-5568ccb96d-dsjdm          1/2     ImagePullBackOff   0          5s
machine-api-operator-7b8c5c5454-vmvj8          2/2     Running            0          3h17m
[miyadav@miyadav ~]$ 

3. check machine-api-operator logs 
[miyadav@miyadav ~]$ oc logs -f machine-api-operator-7b8c5c5454-vmvj8 -c machine-api-operator
I0723 07:05:35.875011       1 start.go:58] Version: 4.6.0-202007201802.p0-dirty
I0723 07:05:35.876417       1 leaderelection.go:242] attempting to acquire leader lease  openshift-machine-api/machine-api-operator...
I0723 07:07:33.696580       1 leaderelection.go:252] successfully acquired lease openshift-machine-api/machine-api-operator
I0723 07:07:33.701627       1 operator.go:145] Starting Machine API Operator
I0723 07:07:33.801696       1 operator.go:157] Synced up caches
I0723 07:07:33.801696       1 start.go:104] Synced up machine api informer caches
I0723 07:07:33.819131       1 status.go:68] Syncing status: re-syncing
I0723 07:07:33.843605       1 sync.go:67] Synced up all machine API webhook configurations
I0723 07:07:33.863759       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16244fee85ec88fa  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:DeploymentUpdated,Message:Updated Deployment.apps/machine-api-controllers -n openshift-machine-api because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2020-07-23 07:07:33.86368025 +0000 UTC m=+118.024322782,LastTimestamp:2020-07-23 07:07:33.86368025 +0000 UTC m=+118.024322782,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0723 07:07:34.889616       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16244feec311a459  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:DaemonSetUpdated,Message:Updated DaemonSet.apps/machine-api-termination-handler -n openshift-machine-api because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2020-07-23 07:07:34.889522265 +0000 UTC m=+119.050164798,LastTimestamp:2020-07-23 07:07:34.889522265 +0000 UTC m=+119.050164798,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
.
.
.
.

Actual results:
[miyadav@miyadav ~]$ oc get clusteroperator machine-api -w
NAME          VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h
machine-api   4.6.0-0.nightly-2020-07-21-200036   True        False         False      30h



Expected results:
machine-api operator should show degraded when the pod is having ImagepullError

Additional info:
This is not tested , just wanted to make sure of the steps , will test once the payload is available

Comment 4 Alberto 2020-07-23 10:47:46 UTC
Hey Milind,
The machine-api-operator is managed by the cluster version operator (CVO).

The pod that needs to be corrupted for this bug is machine-api-controllers-*.
Also it should done without updating the deployment as we already know it behaves as expected for that scenario.
For this test case you could constantly delete the machine-api-controllers-* pods (e.g via automation, or manually running kubectl delete) and watch the clusterOperator eventually going degraded.
Thanks!

Comment 5 Milind Yadav 2020-07-24 05:20:15 UTC
Thanks Alberto,

VERIFIED on - [miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-07-24-002417   True        False         105m    Cluster version is 4.6.0-0.nightly-2020-07-24-002417


Steps:

1.Navigate to openshift -machine-api project oc project openshift-machine-api 
 oc project openshift-machine-api

2.un below to keep killing controller pod ( it might take random iterations)
for i in {1..20}; do oc delete po $(oc get po | awk '{print $1}'|  grep machine-api-controller) ; done
.
.
pod "machine-api-controllers-7f6dc5f8db-pmt82" deleted
pod "machine-api-controllers-7f6dc5f8db-c4z8d" deleted
pod "machine-api-controllers-7f6dc5f8db-s5482" deleted
pod "machine-api-controllers-7f6dc5f8db-p6wqq" deleted
pod "machine-api-controllers-7f6dc5f8db-vwm22" deleted
pod "machine-api-controllers-7f6dc5f8db-5zr6c" deleted
pod "machine-api-controllers-7f6dc5f8db-vxlbj" deleted
pod "machine-api-controllers-7f6dc5f8db-xwxjq" deleted
pod "machine-api-controllers-7f6dc5f8db-qhp8r" deleted
pod "machine-api-controllers-7f6dc5f8db-lpsx6" deleted
pod "machine-api-controllers-7f6dc5f8db-lt8b8" deleted
pod "machine-api-controllers-7f6dc5f8db-zhlgk" deleted
.
.

3. Parallely with step 2 , keep watchine clusteroperator status:
^C[miyadav@miyadav ~]$ oc get clusteroperators machine-api -w 
NAME          VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-api   4.6.0-0.nightly-2020-07-24-002417   True        False         False      84m
machine-api   4.6.0-0.nightly-2020-07-24-002417   True        False         False      85m
machine-api   4.6.0-0.nightly-2020-07-24-002417   True        False         True       90m


Expected & Actual :

Operator status became degraded for a while


Additional info:
Test case created , removing label 
Moved to VERIFIED

Comment 7 errata-xmlrpc 2020-10-27 16:16:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.