Created attachment 1741552 [details] oc installer log Description of problem: OC 4.6.9 Installer failed on: level=error msg="Cluster operator cloud-credential Degraded is True with CredentialsFailing: 6 of 6 credentials requests are failing to sync." level=info msg="Cluster operator cloud-credential Progressing is True with Reconciling: 0 of 6 credentials requests provisioned, 6 reporting errors." level=info msg="Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route \"console\" is not available at canonical host []\nOAuthClientSyncProgressing: route \"console\" is not available at canonical host []" level=info msg="Cluster operator console Available is Unknown with NoData: " level=info msg="Cluster operator image-registry Available is False with DeploymentNotFound: Available: The deployment does not exist\nImagePrunerAvailable: Pruner CronJob has been created" level=info msg="Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials \"openshift-image-registry/installer-cloud-credentials\": secret \"installer-cloud-credentials\" not found" level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available." level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available." level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller \"default\" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod \"router-default-55c5467df9-xzkqn\" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match node selector. Pod \"router-default-55c5467df9-2pls4\" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), DNSReady=False (NoZones: The record isn't present in any zones.)" level=info msg="Cluster operator insights Disabled is False with AsExpected: " level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available" level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available" level=info msg="Cluster operator monitoring Available is False with : " level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack." level=info msg="Cluster operator storage Progressing is True with AWSEBSCSIDriverOperatorCR_AWSEBSDriverControllerServiceController_Deploying: AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods" level=info msg="Cluster operator storage Available is False with AWSEBSCSIDriverOperatorCR_AWSEBSDriverControllerServiceController_Deploying: AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring, storage" Version-Release number of selected component (if applicable): OCP 4.6.9 How reproducible: few times already Steps to Reproduce: 1. Install OCP 4.6.9 on AWS Actual results: Expected results: Additional info:
Created attachment 1741588 [details] Another installer log (this time with 4.6.8) + Machines info
I don't think this is an installer problem, and certainly isn't an 'oc' problem, but the machine-API should be setting a status on the machines. From the must-gather: $ grep nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f namespaces/openshift-machine-api/pods/machine-api-controllers-6f9d4f74bd-56qzf/nodelink-controller/nodelink-controller/logs/current.log | tail -n4 2020-12-23T14:45:38.890241771Z W1223 14:45:38.890065 1 nodelink_controller.go:353] Machine "nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f" has no providerID 2020-12-23T14:45:38.890241771Z I1223 14:45:38.890072 1 nodelink_controller.go:375] Finding node from machine "nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f" by IP 2020-12-23T14:45:38.890241771Z W1223 14:45:38.890079 1 nodelink_controller.go:386] not found internal IP for machine "nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f" 2020-12-23T14:45:38.890241771Z I1223 14:45:38.890090 1 nodelink_controller.go:328] No-op: Node for machine "nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f" not found But from the Machine itself: $ yaml2json <namespaces/openshift-machine-api/machine.openshift.io/machines/nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f.yaml | jq -c eys ["apiVersion","kind","metadata","spec"] I'm assigning to machine-API about getting something about "no providerID" or whatever folks want to say about why the Machine is failing to provision reflected in the Machine's status.
$ grep nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f namespaces/openshift-machine-api/pods/machine-api-controllers-6f9d4f74bd-56qzf/machine-controller/machine-controller/logs/current.log | head -n5 2020-12-23T14:28:38.935301681Z I1223 14:28:38.935280 1 controller.go:170] nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: reconciling Machine 2020-12-23T14:28:38.944035228Z I1223 14:28:38.944003 1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machine_controller" "name"="nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f" "namespace"="openshift-machine-api" 2020-12-23T14:28:44.048570540Z I1223 14:28:44.048528 1 controller.go:170] nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: reconciling Machine 2020-12-23T14:28:44.048570540Z I1223 14:28:44.048554 1 actuator.go:100] nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: actuator checking if machine exists 2020-12-23T14:28:44.048913846Z E1223 14:28:44.048899 1 controller.go:273] nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: failed to check if machine exists: nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: failed to create scope for machine: failed to create aws client: aws credentials secret openshift-machine-api/aws-cloud-credentials: Secret "aws-cloud-credentials" not found not found So possibly status on the machine should say "I can't look for or create this machine without creds in the requested cloud"
And, peripherally for the purpose of this machine-API bug, the cloud-cred operator was bumping up against throttling: $ yaml2json <namespaces/openshift-cloud-credential-operator/cloudcredential.openshift.io/credentialsrequests/openshift-machine-api-aws.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2020-12-23T14:23:28Z CredentialsProvisionFailure=True CredentialsProvisionFailure: failed to grant creds: error syncing creds in mint-mode: AWS Error: Throttling: Rate exceeded status code: 400, request id: ...
I've spun the cred-operator throttling issue out into bug 1910396.
Reporting the lack of cloud creds in Machine status (this bug) would be nice, but doesn't seem like it's important enough to be 'high' priority, because while it makes root-causing the issue easier, it will not actually _fix_ the issue. The fix in this case (not this bug) lies somewhere between bug 1910396, the provider's throttle settings, and parallel load on the throttled provider API endpoints.
In QE's ci testing, we also saw such similar error on gcp on 4.7.0-0.nightly-2020-12-21-131655. level=info msg=Cluster operator machine-config Progressing is True with : Working towards 4.7.0-0.nightly-2020-12-21-131655 level=error msg=Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.7.0-0.nightly-2020-12-21-131655: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3) level=info msg=Cluster operator machine-config Available is False with : Cluster not available for 4.7.0-0.nightly-2020-12-21-131655 level=info msg=Cluster operator network ManagementStateDegraded is False with : level=info msg=Cluster operator storage Progressing is True with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods level=info msg=Cluster operator storage Available is False with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service level=error msg=Cluster initialization failed because one or more operators are not functioning properly. level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating
ignore my comment 8, it is another issue, will track it in Bug 1910581.
Within the team, late last year, we had been discussing adding conditions to the Machine object that could encapsulate information like this. For example, one of them could be MachineExists which shows whether the VM instance on the cloud provider exists or not. If we can't look this up (which is one of the first things we do, and the first to use cloud credentials), we could set this to false and add a reason that says "Invalid cloud credentials" or "Cloud credentials do not exist" based on whatever is happening. How does that sound as a solution to this particular issue?
(In reply to Joel Speed from comment #10) > Within the team, late last year, we had been discussing adding conditions to > the Machine object that could encapsulate information like this. For > example, one of them could be MachineExists which shows whether the VM > instance on the cloud provider exists or not. If we can't look this up > (which is one of the first things we do, and the first to use cloud > credentials), we could set this to false and add a reason that says "Invalid > cloud credentials" or "Cloud credentials do not exist" based on whatever is > happening. > > How does that sound as a solution to this particular issue? Sounds good. Please add the whole message, such as the msg in #c5: "failed to grant creds: error syncing creds in mint-mode: AWS Error: Throttling: Rate exceeded" "Invalid cloud credentials" might be confusing in this case, since the credentials I used were correct. The root cause is "Rate exceeded" - it should be clear to user.
> Please add the whole message We will do! The normal format is to have a "reason" which is a short summary (eg InvalidCredentials) and then a longer "message" which would in this case be "failed to grant creds: error syncing creds in mint-mode: AWS Error: Throttling: Rate exceeded"
Master is now open for 4.8, I think we should be able to get a fix in during the 4.8 cycle
Joel Speed, I can't find "conditions" in machine status. $ oc get machine zhsun312-pnqnc-worker-us-east-2b-lsdrn -o yaml status: addresses: - address: 10.0.171.66 type: InternalIP - address: ip-10-0-171-66.us-east-2.compute.internal type: InternalDNS - address: ip-10-0-171-66.us-east-2.compute.internal type: Hostname errorMessage: Can't find created instance. lastUpdated: "2021-03-12T05:42:16Z" nodeRef: kind: Node name: ip-10-0-171-66.us-east-2.compute.internal uid: f09ccb27-11e2-43f5-b22b-01d3b343ec4e phase: Failed providerStatus: conditions: - lastProbeTime: "2021-03-12T01:32:52Z" lastTransitionTime: "2021-03-12T01:32:52Z" message: Machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreation instanceId: i-0a2262db355b33d91 instanceState: Unknown clusterversion: 4.8.0-0.nightly-2021-03-10-142839
Sorry, need to update the vendoring for the machine controllers. Will attach those now.
I'm aware this isn't working correctly, going to need further investigation
*** Bug 1941105 has been marked as a duplicate of this bug. ***
Tested below scenarios on aws and gcp, clusterversion 4.8.0-0.nightly-2021-03-26-054333 1.Check one normal running machine status 2.Stop the instance from GUI,check machine status 3.Delete the instance from GUI,check machine status 4.Create a new machineset with invalid field. There is only one place with some doubts, when deleting the instance from GUI, condition.status=true condition.severity=Warning not sure if this is correct. According to the crd yaml file seems this is inconsistent, "The Severity field MUST be set only when Status=False" And according to "status: description: Status of the condition, one of True, False, Unknown." not sure if this should be Unknown. If not, when is it Unknown? status: addresses: - address: 10.0.128.3 type: InternalIP - address: zhsungcp-zk5jv-worker-b-89s8d.us-central1-b.c.openshift-qe.internal type: InternalDNS - address: zhsungcp-zk5jv-worker-b-89s8d.c.openshift-qe.internal type: InternalDNS - address: zhsungcp-zk5jv-worker-b-89s8d type: InternalDNS conditions: - lastTransitionTime: "2021-03-29T03:40:51Z" message: Instance not found on provider reason: InstanceMissing severity: Warning status: "True" type: InstanceExists errorMessage: Can't find created instance. lastUpdated: "2021-03-29T10:35:18Z" nodeRef: kind: Node name: zhsungcp-zk5jv-worker-b-89s8d.c.openshift-qe.internal uid: abec0c18-1cb7-427e-9ae3-4b3d12f79c3f phase: Failed providerStatus: conditions: - lastProbeTime: "2021-03-29T03:40:30Z" lastTransitionTime: "2021-03-29T03:40:30Z" message: machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreated instanceId: zhsungcp-zk5jv-worker-b-89s8d instanceState: Unknown
This looks wrong, will need to investigate why the update isn't changing the status to false when the instance is not found on the provider
We need to revendor before this fix is picked up
the bug won't be attached to an errata unless ART's sweeper moves it from MODIFIED to ON_QA.
Verified clusterversion: 4.8.0-0.nightly-2021-05-19-123944 Delete the instance from GUI, condition.status=false conditions: - lastTransitionTime: "2021-05-20T04:15:25Z" message: Instance not found on provider reason: InstanceMissing severity: Warning status: "False" type: InstanceExists errorMessage: Can't find created instance. lastUpdated: "2021-05-20T04:15:26Z" nodeRef: kind: Node name: ip-10-0-204-158.us-east-2.compute.internal uid: 0ef18f3d-ee47-46cd-86c7-b7a3c6f59549 phase: Failed providerStatus: conditions: - lastProbeTime: "2021-05-20T04:05:47Z" lastTransitionTime: "2021-05-20T04:05:47Z" message: Machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreation instanceId: i-03f0444046da27f4e instanceState: Unknown
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438