Bug 1997905
Summary: | Cluster failed to come up after apply the machine config resource | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Praveen Kumar <prkumar> | ||||
Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> | ||||
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> | ||||
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | unspecified | CC: | aos-bugs, cfergeau, jkyros, mkrejci, rfreiman, skumari | ||||
Version: | 4.9 | ||||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-09-13 05:05:16 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Praveen Kumar
2021-08-26 03:05:35 UTC
I can reproduce this reliably with a reboot alone (without having to apply any machine config). It seems to happen almost every time on reboot with single node + this nightly: - Create a single node cluster (I used cluster bot) - Reboot it (oc debug node; chroot /host; shutdown -r 1; ) My results are similar to Praveen's: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-08-24-235829 False False True 95m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-ln-pgbd8zb-d5d6b.origin-ci-int-aws.dev.rhcloud.com/healthz": EOF baremetal 4.9.0-0.nightly-2021-08-24-235829 True False False 128m cloud-controller-manager 4.9.0-0.nightly-2021-08-24-235829 True False False 132m cloud-credential 4.9.0-0.nightly-2021-08-24-235829 True False False 132m cluster-autoscaler 4.9.0-0.nightly-2021-08-24-235829 True False False 129m config-operator 4.9.0-0.nightly-2021-08-24-235829 True False False 130m console 4.9.0-0.nightly-2021-08-24-235829 False True False 36m DeploymentAvailable: 0 pods available for console deployment RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-ln-pgbd8zb-d5d6b.origin-ci-int-aws.dev.rhcloud.com): Get "https://console-openshift-console.apps.ci-ln-pgbd8zb-d5d6b.origin-ci-int-aws.dev.rhcloud.com": EOF csi-snapshot-controller 4.9.0-0.nightly-2021-08-24-235829 True False False 37m dns 4.9.0-0.nightly-2021-08-24-235829 True False False 37m etcd 4.9.0-0.nightly-2021-08-24-235829 True False False 127m image-registry 4.9.0-0.nightly-2021-08-24-235829 False True True 36m Available: The deployment does not have available replicas NodeCADaemonAvailable: The daemon set node-ca has available replicas ImagePrunerAvailable: Pruner CronJob has been created ingress 4.9.0-0.nightly-2021-08-24-235829 False True True 36m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.) insights 4.9.0-0.nightly-2021-08-24-235829 True False True 123m Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed kube-apiserver 4.9.0-0.nightly-2021-08-24-235829 True False False 127m kube-controller-manager 4.9.0-0.nightly-2021-08-24-235829 True False False 127m kube-scheduler 4.9.0-0.nightly-2021-08-24-235829 True False False 128m kube-storage-version-migrator 4.9.0-0.nightly-2021-08-24-235829 True False False 37m machine-api 4.9.0-0.nightly-2021-08-24-235829 True False False 126m machine-approver 4.9.0-0.nightly-2021-08-24-235829 True False False 129m machine-config 4.9.0-0.nightly-2021-08-24-235829 True False False 37m marketplace 4.9.0-0.nightly-2021-08-24-235829 True False False 129m monitoring 4.9.0-0.nightly-2021-08-24-235829 True False False 35m network 4.9.0-0.nightly-2021-08-24-235829 True False False 131m node-tuning 4.9.0-0.nightly-2021-08-24-235829 True False False 37m openshift-apiserver 4.9.0-0.nightly-2021-08-24-235829 True False False 37m openshift-controller-manager 4.9.0-0.nightly-2021-08-24-235829 True False False 37m openshift-samples 4.9.0-0.nightly-2021-08-24-235829 True False False 126m operator-lifecycle-manager 4.9.0-0.nightly-2021-08-24-235829 True False False 130m operator-lifecycle-manager-catalog 4.9.0-0.nightly-2021-08-24-235829 True False False 130m operator-lifecycle-manager-packageserver 4.9.0-0.nightly-2021-08-24-235829 True False False 37m service-ca 4.9.0-0.nightly-2021-08-24-235829 True False False 130m storage 4.9.0-0.nightly-2021-08-24-235829 False True False 96m AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment I think like John shows, this isn't an MCO issue necessarily. A potential way to debug is to start at the first failing operator, and see what caused that. I'm also not sure what the state of SNO CI is. Maybe we should ask if this is a known SNO issue? What do you think @Praveen? I'm not sure where this should live but I don't think this should be on the MCO board. @Yu I am able to reproduce what @John pointed to and I am also asked SNO team to take a look, we can remove the MCO component but I am not sure which component we target it :( Shouldn't be a release blocker since this bug is not known to impact regular OCP or Single Node OpenShift cluster @Rom I have to test this with rc0 or latest nightly because this is something John observed on sno side and will update this bug. I tested this with latest nightly of 4.9 and didn't encounter this issue anymore. Closing it, will reopen if it appear again. |