Bug 1859221
| Summary: | Machine API Operator should go degraded if any pod controller is crash looping | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Alberto <agarcial> |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> |
| Cloud Compute sub component: | Other Providers | QA Contact: | Milind Yadav <miyadav> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | unspecified | ||
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-27 16:16:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Alberto
2020-07-21 13:34:55 UTC
Hi @Alberto ,Does these steps looks good , will update them to polarion as testcase if they are ok .
version 4.6.0-0.nightly-2020-07-21-200036 True False 30h Cluster version is 4.6.0-0.nightly-2020-07-21-200036
Steps :
1. [miyadav@miyadav ~]$ oc edit deployment machine-api-operator -n openshift-machine-api
deployment.apps/machine-api-operator edited (changed the manifest in induce failure
2. [miyadav@miyadav ~]$ oc get pods -n openshift-machine-api
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-operator-7cf8c8c6b5-f4fz9 2/2 Running 0 30h
machine-api-controllers-5684bf77d-zmmj7 7/7 Running 0 3h46m
machine-api-operator-5568ccb96d-dsjdm 1/2 ImagePullBackOff 0 5s
machine-api-operator-7b8c5c5454-vmvj8 2/2 Running 0 3h17m
[miyadav@miyadav ~]$
3. check machine-api-operator logs
[miyadav@miyadav ~]$ oc logs -f machine-api-operator-7b8c5c5454-vmvj8 -c machine-api-operator
I0723 07:05:35.875011 1 start.go:58] Version: 4.6.0-202007201802.p0-dirty
I0723 07:05:35.876417 1 leaderelection.go:242] attempting to acquire leader lease openshift-machine-api/machine-api-operator...
I0723 07:07:33.696580 1 leaderelection.go:252] successfully acquired lease openshift-machine-api/machine-api-operator
I0723 07:07:33.701627 1 operator.go:145] Starting Machine API Operator
I0723 07:07:33.801696 1 operator.go:157] Synced up caches
I0723 07:07:33.801696 1 start.go:104] Synced up machine api informer caches
I0723 07:07:33.819131 1 status.go:68] Syncing status: re-syncing
I0723 07:07:33.843605 1 sync.go:67] Synced up all machine API webhook configurations
I0723 07:07:33.863759 1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16244fee85ec88fa dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:DeploymentUpdated,Message:Updated Deployment.apps/machine-api-controllers -n openshift-machine-api because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2020-07-23 07:07:33.86368025 +0000 UTC m=+118.024322782,LastTimestamp:2020-07-23 07:07:33.86368025 +0000 UTC m=+118.024322782,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0723 07:07:34.889616 1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16244feec311a459 dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:DaemonSetUpdated,Message:Updated DaemonSet.apps/machine-api-termination-handler -n openshift-machine-api because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2020-07-23 07:07:34.889522265 +0000 UTC m=+119.050164798,LastTimestamp:2020-07-23 07:07:34.889522265 +0000 UTC m=+119.050164798,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
.
.
.
.
Actual results:
[miyadav@miyadav ~]$ oc get clusteroperator machine-api -w
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h
machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h
machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h
machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h
machine-api 4.6.0-0.nightly-2020-07-21-200036 True False False 30h
Expected results:
machine-api operator should show degraded when the pod is having ImagepullError
Additional info:
This is not tested , just wanted to make sure of the steps , will test once the payload is available
Hey Milind, The machine-api-operator is managed by the cluster version operator (CVO). The pod that needs to be corrupted for this bug is machine-api-controllers-*. Also it should done without updating the deployment as we already know it behaves as expected for that scenario. For this test case you could constantly delete the machine-api-controllers-* pods (e.g via automation, or manually running kubectl delete) and watch the clusterOperator eventually going degraded. Thanks! Thanks Alberto,
VERIFIED on - [miyadav@miyadav ~]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.nightly-2020-07-24-002417 True False 105m Cluster version is 4.6.0-0.nightly-2020-07-24-002417
Steps:
1.Navigate to openshift -machine-api project oc project openshift-machine-api
oc project openshift-machine-api
2.un below to keep killing controller pod ( it might take random iterations)
for i in {1..20}; do oc delete po $(oc get po | awk '{print $1}'| grep machine-api-controller) ; done
.
.
pod "machine-api-controllers-7f6dc5f8db-pmt82" deleted
pod "machine-api-controllers-7f6dc5f8db-c4z8d" deleted
pod "machine-api-controllers-7f6dc5f8db-s5482" deleted
pod "machine-api-controllers-7f6dc5f8db-p6wqq" deleted
pod "machine-api-controllers-7f6dc5f8db-vwm22" deleted
pod "machine-api-controllers-7f6dc5f8db-5zr6c" deleted
pod "machine-api-controllers-7f6dc5f8db-vxlbj" deleted
pod "machine-api-controllers-7f6dc5f8db-xwxjq" deleted
pod "machine-api-controllers-7f6dc5f8db-qhp8r" deleted
pod "machine-api-controllers-7f6dc5f8db-lpsx6" deleted
pod "machine-api-controllers-7f6dc5f8db-lt8b8" deleted
pod "machine-api-controllers-7f6dc5f8db-zhlgk" deleted
.
.
3. Parallely with step 2 , keep watchine clusteroperator status:
^C[miyadav@miyadav ~]$ oc get clusteroperators machine-api -w
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
machine-api 4.6.0-0.nightly-2020-07-24-002417 True False False 84m
machine-api 4.6.0-0.nightly-2020-07-24-002417 True False False 85m
machine-api 4.6.0-0.nightly-2020-07-24-002417 True False True 90m
Expected & Actual :
Operator status became degraded for a while
Additional info:
Test case created , removing label
Moved to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |