Bug 1910318 - OC 4.6.9 Installer failed: Some pods are not scheduled: 3 node(s) didn't match node selector: AWS compute machines without status
Summary: OC 4.6.9 Installer failed: Some pods are not scheduled: 3 node(s) didn't matc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.8.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
: 1941105 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-23 12:32 UTC by Noam Manos
Modified: 2021-07-27 22:36 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:35:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
oc installer log (96.37 KB, text/plain)
2020-12-23 12:32 UTC, Noam Manos
no flags Details
Another installer log (this time with 4.6.8) + Machines info (119.61 KB, text/plain)
2020-12-23 16:25 UTC, Noam Manos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-aws pull 392 0 None closed Bug 1910318: Add condition to show actuator exists condition on machine 2021-03-23 12:20:55 UTC
Github openshift cluster-api-provider-aws pull 396 0 None closed Bug 1910318: Ensure original conditions aren't mutated during reconcile 2021-05-19 16:10:15 UTC
Github openshift cluster-api-provider-azure pull 207 0 None closed Bug 1910318: Add condition to show actuator exists condition on machine 2021-03-23 12:20:56 UTC
Github openshift cluster-api-provider-azure pull 211 0 None closed Bug 1910318: Ensure original conditions aren't mutated during reconcile 2021-05-19 16:10:17 UTC
Github openshift cluster-api-provider-gcp pull 152 0 None closed Bug 1910318: Add condition to show actuator exists condition on machine 2021-03-23 12:20:57 UTC
Github openshift cluster-api-provider-gcp pull 155 0 None closed Bug 1910318: Ensure original conditions aren't mutated during reconcile 2021-05-19 16:10:06 UTC
Github openshift machine-api-operator pull 810 0 None closed Bug 1910318: [OCPCLOUD-931] Add condition to show actuator exists output on machine status 2021-05-19 16:10:03 UTC
Github openshift machine-api-operator pull 829 0 None closed Bug 1910318: Ensure original conditions aren't mutated during reconcile 2021-03-23 12:20:59 UTC
Github openshift machine-api-operator pull 849 0 None closed Bug 1910318: Ensure conditions are correctly copied before annotations are patched 2021-05-19 16:09:58 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:36:03 UTC

Description Noam Manos 2020-12-23 12:32:44 UTC
Created attachment 1741552 [details]
oc installer log

Description of problem:
OC 4.6.9 Installer failed on:

level=error msg="Cluster operator cloud-credential Degraded is True with CredentialsFailing: 6 of 6 credentials requests are failing to sync."
level=info msg="Cluster operator cloud-credential Progressing is True with Reconciling: 0 of 6 credentials requests provisioned, 6 reporting errors."
level=info msg="Cluster operator console Progressing is True with DefaultRouteSync_FailedAdmitDefaultRoute::OAuthClientSync_FailedHost: DefaultRouteSyncProgressing: route \"console\" is not available at canonical host []\nOAuthClientSyncProgressing: route \"console\" is not available at canonical host []"
level=info msg="Cluster operator console Available is Unknown with NoData: "
level=info msg="Cluster operator image-registry Available is False with DeploymentNotFound: Available: The deployment does not exist\nImagePrunerAvailable: Pruner CronJob has been created"
level=info msg="Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials \"openshift-image-registry/installer-cloud-credentials\": secret \"installer-cloud-credentials\" not found"
level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available."
level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: ingresscontroller \"default\" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod \"router-default-55c5467df9-xzkqn\" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match node selector. Pod \"router-default-55c5467df9-2pls4\" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1), DNSReady=False (NoZones: The record isn't present in any zones.)"
level=info msg="Cluster operator insights Disabled is False with AsExpected: "
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available"
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=info msg="Cluster operator storage Progressing is True with AWSEBSCSIDriverOperatorCR_AWSEBSDriverControllerServiceController_Deploying: AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods"
level=info msg="Cluster operator storage Available is False with AWSEBSCSIDriverOperatorCR_AWSEBSDriverControllerServiceController_Deploying: AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service"
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring, storage"


Version-Release number of selected component (if applicable):
OCP 4.6.9

How reproducible:
few times already

Steps to Reproduce:
1. Install OCP 4.6.9 on AWS


Actual results:


Expected results:


Additional info:

Comment 1 Noam Manos 2020-12-23 16:25:55 UTC
Created attachment 1741588 [details]
Another installer log (this time with 4.6.8) + Machines info

Comment 3 W. Trevor King 2020-12-23 17:44:58 UTC
I don't think this is an installer problem, and certainly isn't an 'oc' problem, but the machine-API should be setting a status on the machines.  From the must-gather:

$ grep nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f namespaces/openshift-machine-api/pods/machine-api-controllers-6f9d4f74bd-56qzf/nodelink-controller/nodelink-controller/logs/current.log | tail -n4
2020-12-23T14:45:38.890241771Z W1223 14:45:38.890065       1 nodelink_controller.go:353] Machine "nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f" has no providerID
2020-12-23T14:45:38.890241771Z I1223 14:45:38.890072       1 nodelink_controller.go:375] Finding node from machine "nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f" by IP
2020-12-23T14:45:38.890241771Z W1223 14:45:38.890079       1 nodelink_controller.go:386] not found internal IP for machine "nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f"
2020-12-23T14:45:38.890241771Z I1223 14:45:38.890090       1 nodelink_controller.go:328] No-op: Node for machine "nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f" not found

But from the Machine itself:

$ yaml2json <namespaces/openshift-machine-api/machine.openshift.io/machines/nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f.yaml | jq -c eys
["apiVersion","kind","metadata","spec"]

I'm assigning to machine-API about getting something about "no providerID" or whatever folks want to say about why the Machine is failing to provision reflected in the Machine's status.

Comment 4 W. Trevor King 2020-12-23 17:50:33 UTC
$ grep nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f namespaces/openshift-machine-api/pods/machine-api-controllers-6f9d4f74bd-56qzf/machine-controller/machine-controller/logs/current.log | head -n5
2020-12-23T14:28:38.935301681Z I1223 14:28:38.935280       1 controller.go:170] nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: reconciling Machine
2020-12-23T14:28:38.944035228Z I1223 14:28:38.944003       1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machine_controller" "name"="nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f" "namespace"="openshift-machine-api" 
2020-12-23T14:28:44.048570540Z I1223 14:28:44.048528       1 controller.go:170] nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: reconciling Machine
2020-12-23T14:28:44.048570540Z I1223 14:28:44.048554       1 actuator.go:100] nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: actuator checking if machine exists
2020-12-23T14:28:44.048913846Z E1223 14:28:44.048899       1 controller.go:273] nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: failed to check if machine exists: nmanos-cluster-a-ktbpp-worker-us-west-1a-bjv4f: failed to create scope for machine: failed to create aws client: aws credentials secret openshift-machine-api/aws-cloud-credentials: Secret "aws-cloud-credentials" not found not found

So possibly status on the machine should say "I can't look for or create this machine without creds in the requested cloud"

Comment 5 W. Trevor King 2020-12-23 17:56:42 UTC
And, peripherally for the purpose of this machine-API bug, the cloud-cred operator was bumping up against throttling:

$ yaml2json <namespaces/openshift-cloud-credential-operator/cloudcredential.openshift.io/credentialsrequests/openshift-machine-api-aws.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-12-23T14:23:28Z CredentialsProvisionFailure=True CredentialsProvisionFailure: failed to grant creds: error syncing creds in mint-mode: AWS Error: Throttling: Rate exceeded
        status code: 400, request id: ...

Comment 6 W. Trevor King 2020-12-23 18:04:44 UTC
I've spun the cred-operator throttling issue out into bug 1910396.

Comment 7 W. Trevor King 2020-12-23 21:49:33 UTC
Reporting the lack of cloud creds in Machine status (this bug) would be nice, but doesn't seem like it's important enough to be 'high' priority, because while it makes root-causing the issue easier, it will not actually _fix_ the issue.  The fix in this case (not this bug) lies somewhere between bug 1910396, the provider's throttle settings, and parallel load on the throttled provider API endpoints.

Comment 8 Johnny Liu 2020-12-24 06:52:04 UTC
In QE's ci testing, we also saw such similar error on gcp on 4.7.0-0.nightly-2020-12-21-131655.

level=info msg=Cluster operator machine-config Progressing is True with : Working towards 4.7.0-0.nightly-2020-12-21-131655
level=error msg=Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.7.0-0.nightly-2020-12-21-131655: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)
level=info msg=Cluster operator machine-config Available is False with : Cluster not available for 4.7.0-0.nightly-2020-12-21-131655
level=info msg=Cluster operator network ManagementStateDegraded is False with : 
level=info msg=Cluster operator storage Progressing is True with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods
level=info msg=Cluster operator storage Available is False with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service
level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
level=fatal msg=failed to initialize the cluster: Cluster operator machine-config is still updating

Comment 9 Johnny Liu 2020-12-24 09:50:04 UTC
ignore my comment 8, it is another issue, will track it in Bug 1910581.

Comment 10 Joel Speed 2021-01-04 13:13:48 UTC
Within the team, late last year, we had been discussing adding conditions to the Machine object that could encapsulate information like this. For example, one of them could be MachineExists which shows whether the VM instance on the cloud provider exists or not. If we can't look this up (which is one of the first things we do, and the first to use cloud credentials), we could set this to false and add a reason that says "Invalid cloud credentials" or "Cloud credentials do not exist" based on whatever is happening.

How does that sound as a solution to this particular issue?

Comment 13 Noam Manos 2021-01-05 16:40:31 UTC
(In reply to Joel Speed from comment #10)
> Within the team, late last year, we had been discussing adding conditions to
> the Machine object that could encapsulate information like this. For
> example, one of them could be MachineExists which shows whether the VM
> instance on the cloud provider exists or not. If we can't look this up
> (which is one of the first things we do, and the first to use cloud
> credentials), we could set this to false and add a reason that says "Invalid
> cloud credentials" or "Cloud credentials do not exist" based on whatever is
> happening.
> 
> How does that sound as a solution to this particular issue?

Sounds good. 
Please add the whole message, such as the msg in #c5:
"failed to grant creds: error syncing creds in mint-mode: AWS Error: Throttling: Rate exceeded"

"Invalid cloud credentials" might be confusing in this case, since the credentials I used were correct.
The root cause is "Rate exceeded" - it should be clear to user.

Comment 14 Joel Speed 2021-01-05 16:56:16 UTC
> Please add the whole message

We will do! The normal format is to have a "reason" which is a short summary (eg InvalidCredentials) and then a longer "message" which would in this case be "failed to grant creds: error syncing creds in mint-mode: AWS Error: Throttling: Rate exceeded"

Comment 15 Joel Speed 2021-02-08 10:22:02 UTC
Master is now open for 4.8, I think we should be able to get a fix in during the 4.8 cycle

Comment 17 sunzhaohua 2021-03-12 07:32:02 UTC
Joel Speed, I can't find "conditions" in machine status. 

$ oc get machine zhsun312-pnqnc-worker-us-east-2b-lsdrn -o yaml
status:
  addresses:
  - address: 10.0.171.66
    type: InternalIP
  - address: ip-10-0-171-66.us-east-2.compute.internal
    type: InternalDNS
  - address: ip-10-0-171-66.us-east-2.compute.internal
    type: Hostname
  errorMessage: Can't find created instance.
  lastUpdated: "2021-03-12T05:42:16Z"
  nodeRef:
    kind: Node
    name: ip-10-0-171-66.us-east-2.compute.internal
    uid: f09ccb27-11e2-43f5-b22b-01d3b343ec4e
  phase: Failed
  providerStatus:
    conditions:
    - lastProbeTime: "2021-03-12T01:32:52Z"
      lastTransitionTime: "2021-03-12T01:32:52Z"
      message: Machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-0a2262db355b33d91
    instanceState: Unknown

clusterversion: 4.8.0-0.nightly-2021-03-10-142839

Comment 18 Joel Speed 2021-03-12 11:41:22 UTC
Sorry, need to update the vendoring for the machine controllers. Will attach those now.

Comment 19 Joel Speed 2021-03-18 12:09:54 UTC
I'm aware this isn't working correctly, going to need further investigation

Comment 20 Joel Speed 2021-03-22 10:18:20 UTC
*** Bug 1941105 has been marked as a duplicate of this bug. ***

Comment 24 sunzhaohua 2021-04-06 03:39:32 UTC
Tested below scenarios on aws and gcp, clusterversion 4.8.0-0.nightly-2021-03-26-054333
1.Check one normal running machine status
2.Stop the instance from GUI,check machine status
3.Delete the instance from GUI,check machine status	
4.Create a new machineset with invalid field.

There is only one place with some doubts, when deleting the instance from GUI, condition.status=true condition.severity=Warning not sure if this is correct. 
According to the crd yaml file seems this is inconsistent,  "The Severity field MUST be set only when Status=False"
And according to "status: description: Status of the condition, one of True, False, Unknown."  not sure if this should be Unknown. If not, when is it Unknown?

status:
  addresses:
  - address: 10.0.128.3
    type: InternalIP
  - address: zhsungcp-zk5jv-worker-b-89s8d.us-central1-b.c.openshift-qe.internal
    type: InternalDNS
  - address: zhsungcp-zk5jv-worker-b-89s8d.c.openshift-qe.internal
    type: InternalDNS
  - address: zhsungcp-zk5jv-worker-b-89s8d
    type: InternalDNS
  conditions:
  - lastTransitionTime: "2021-03-29T03:40:51Z"
    message: Instance not found on provider
    reason: InstanceMissing
    severity: Warning
    status: "True"
    type: InstanceExists
  errorMessage: Can't find created instance.
  lastUpdated: "2021-03-29T10:35:18Z"
  nodeRef:
    kind: Node
    name: zhsungcp-zk5jv-worker-b-89s8d.c.openshift-qe.internal
    uid: abec0c18-1cb7-427e-9ae3-4b3d12f79c3f
  phase: Failed
  providerStatus:
    conditions:
    - lastProbeTime: "2021-03-29T03:40:30Z"
      lastTransitionTime: "2021-03-29T03:40:30Z"
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreated
    instanceId: zhsungcp-zk5jv-worker-b-89s8d
    instanceState: Unknown

Comment 25 Joel Speed 2021-04-06 10:32:31 UTC
This looks wrong, will need to investigate why the update isn't changing the status to false when the instance is not found on the provider

Comment 27 Joel Speed 2021-04-19 09:26:26 UTC
We need to revendor before this fix is picked up

Comment 30 W. Trevor King 2021-05-19 17:00:26 UTC
the bug won't be attached to an errata unless ART's sweeper moves it from MODIFIED to ON_QA.

Comment 32 sunzhaohua 2021-05-20 04:21:23 UTC
Verified
clusterversion: 4.8.0-0.nightly-2021-05-19-123944

Delete the instance from GUI, condition.status=false

  conditions:
  - lastTransitionTime: "2021-05-20T04:15:25Z"
    message: Instance not found on provider
    reason: InstanceMissing
    severity: Warning
    status: "False"
    type: InstanceExists
  errorMessage: Can't find created instance.
  lastUpdated: "2021-05-20T04:15:26Z"
  nodeRef:
    kind: Node
    name: ip-10-0-204-158.us-east-2.compute.internal
    uid: 0ef18f3d-ee47-46cd-86c7-b7a3c6f59549
  phase: Failed
  providerStatus:
    conditions:
    - lastProbeTime: "2021-05-20T04:05:47Z"
      lastTransitionTime: "2021-05-20T04:05:47Z"
      message: Machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-03f0444046da27f4e
    instanceState: Unknown

Comment 35 errata-xmlrpc 2021-07-27 22:35:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.