Description of problem: Currently the mao only validate the expected pods are available if there happens to be a deployment resource update https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/sync.go#L110-L116. This prevents the operator from going degraded for any scenario where the expected pods are not available and "ApplyDeployment" is not returning "updated=true". E.g an induced pod crash looping. This was realised by https://bugzilla.redhat.com/show_bug.cgi?id=1856597 Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Consistently kill a pod owned by the mao. 2. Watch the mao logs not waiting for rollout. Wait 5 minutes. 3. Watch the operatorStatus not going degraded Actual results: Expected results: Additional info:
Hi @Alberto ,Does these steps looks good , will update them to polarion as testcase if they are ok . version 4.6.0-0.nightly-2020-07-21-200036 True False 30h Cluster version is 4.6.0-0.nightly-2020-07-21-200036 Steps : 1. [miyadav@miyadav ~]$ oc edit deployment machine-api-operator -n openshift-machine-api deployment.apps/machine-api-operator edited (changed the manifest in induce failure 2. [miyadav@miyadav ~]$ oc get pods -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-7cf8c8c6b5-f4fz9 2/2 Running 0 30h machine-api-controllers-5684bf77d-zmmj7 7/7 Running 0 3h46m machine-api-operator-5568ccb96d-dsjdm 1/2 ImagePullBackOff 0 5s machine-api-operator-7b8c5c5454-vmvj8 2/2 Running 0 3h17m [miyadav@miyadav ~]$ 3. check machine-api-operator logs [miyadav@miyadav ~]$ oc logs -f machine-api-operator-7b8c5c5454-vmvj8 -c machine-api-operator I0723 07:05:35.875011 1 start.go:58] Version: 4.6.0-202007201802.p0-dirty I0723 07:05:35.876417 1 leaderelection.go:242] attempting to acquire leader lease openshift-machine-api/machine-api-operator... I0723 07:07:33.696580 1 leaderelection.go:252] successfully acquired lease openshift-machine-api/machine-api-operator I0723 07:07:33.701627 1 operator.go:145] Starting Machine API Operator I0723 07:07:33.801696 1 operator.go:157] Synced up caches I0723 07:07:33.801696 1 start.go:104] Synced up machine api informer caches I0723 07:07:33.819131 1 status.go:68] Syncing status: re-syncing I0723 07:07:33.843605 1 sync.go:67] Synced up all machine API webhook configurations I0723 07:07:33.863759 1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16244fee85ec88fa dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:DeploymentUpdated,Message:Updated Deployment.apps/machine-api-controllers -n openshift-machine-api because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2020-07-23 07:07:33.86368025 +0000 UTC m=+118.024322782,LastTimestamp:2020-07-23 07:07:33.86368025 +0000 UTC m=+118.024322782,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,} I0723 07:07:34.889616 1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16244feec311a459 dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:DaemonSetUpdated,Message:Updated DaemonSet.apps/machine-api-termination-handler -n openshift-machine-api because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2020-07-23 07:07:34.889522265 +0000 UTC m=+119.050164798,LastTimestamp:2020-07-23 07:07:34.889522265 +0000 UTC m=+119.050164798,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,} . . . . Actual results: [miyadav@miyadav ~]$ oc get clusteroperator machine-api -w NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h Expected results: machine-api operator should show degraded when the pod is having ImagepullError Additional info: This is not tested , just wanted to make sure of the steps , will test once the payload is available
Hey Milind, The machine-api-operator is managed by the cluster version operator (CVO). The pod that needs to be corrupted for this bug is machine-api-controllers-*. Also it should done without updating the deployment as we already know it behaves as expected for that scenario. For this test case you could constantly delete the machine-api-controllers-* pods (e.g via automation, or manually running kubectl delete) and watch the clusterOperator eventually going degraded. Thanks!
Thanks Alberto, VERIFIED on - [miyadav@miyadav ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-07-24-002417 True False 105m Cluster version is 4.6.0-0.nightly-2020-07-24-002417 Steps: 1.Navigate to openshift -machine-api project oc project openshift-machine-api oc project openshift-machine-api 2.un below to keep killing controller pod ( it might take random iterations) for i in {1..20}; do oc delete po $(oc get po | awk '{print $1}'| grep machine-api-controller) ; done . . pod "machine-api-controllers-7f6dc5f8db-pmt82" deleted pod "machine-api-controllers-7f6dc5f8db-c4z8d" deleted pod "machine-api-controllers-7f6dc5f8db-s5482" deleted pod "machine-api-controllers-7f6dc5f8db-p6wqq" deleted pod "machine-api-controllers-7f6dc5f8db-vwm22" deleted pod "machine-api-controllers-7f6dc5f8db-5zr6c" deleted pod "machine-api-controllers-7f6dc5f8db-vxlbj" deleted pod "machine-api-controllers-7f6dc5f8db-xwxjq" deleted pod "machine-api-controllers-7f6dc5f8db-qhp8r" deleted pod "machine-api-controllers-7f6dc5f8db-lpsx6" deleted pod "machine-api-controllers-7f6dc5f8db-lt8b8" deleted pod "machine-api-controllers-7f6dc5f8db-zhlgk" deleted . . 3. Parallely with step 2 , keep watchine clusteroperator status: ^C[miyadav@miyadav ~]$ oc get clusteroperators machine-api -w NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-api 4.6.0-0.nightly-2020-07-24-002417 True False False 84m machine-api 4.6.0-0.nightly-2020-07-24-002417 True False False 85m machine-api 4.6.0-0.nightly-2020-07-24-002417 True False True 90m Expected & Actual : Operator status became degraded for a while Additional info: Test case created , removing label Moved to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196