Description of problem: I attempted to create a multi-node spoke cluster following the AgentClusterInstall example here: https://github.com/openshift/assisted-service/blob/master/docs/crds/agentClusterInstall.yaml I also included two worker nodes for the installation but the installation never began. The cluster just remained in a ready for installation state but never started. How reproducible: 100% Steps to Reproduce: 1. Create a cluster utilizing the AgentClusterInstall from the above link 2. Include two additional nodes annotated with worker role Actual results: Cluster remains in ready for installation but installation has not started Expected results: Either cluster installation begins or some validations or events are generated explaining why the installation has not begun
Trey, would it be possible to have more information here? It's unclear to me what really happened. - What did the resources look like? - Are there any logs we can look at? - Any other information about the cluster that could help us understand what happened?
Hey Flavio, here are the details. By following the example on Github, I have created a cluster by specifying only 3 controlPlaneAgents, however I also created 2 workers and registered them. AgentClusterInstall validations are passing and host validations are passing: # oc get agentclusterinstall/ipv4-multinode-agent-cluster-install -o json | jq -r '.status.conditions' [ { "lastProbeTime": "2021-06-02T13:14:38Z", "lastTransitionTime": "2021-06-02T13:14:38Z", "message": "The Spec has been successfully applied", "reason": "SyncOK", "status": "True", "type": "SpecSynced" }, { "lastProbeTime": "2021-06-02T13:24:09Z", "lastTransitionTime": "2021-06-02T13:24:09Z", "message": "The cluster's validations are passing", "reason": "ValidationsPassing", "status": "True", "type": "Validated" }, { "lastProbeTime": "2021-06-02T13:24:09Z", "lastTransitionTime": "2021-06-02T13:24:09Z", "message": "The cluster is ready to begin the installation", "reason": "ClusterIsReady", "status": "True", "type": "RequirementsMet" }, { "lastProbeTime": "2021-06-02T13:14:38Z", "lastTransitionTime": "2021-06-02T13:14:38Z", "message": "The installation has not yet started", "reason": "InstallationNotStarted", "status": "False", "type": "Completed" }, { "lastProbeTime": "2021-06-02T13:14:38Z", "lastTransitionTime": "2021-06-02T13:14:38Z", "message": "The installation has not failed", "reason": "InstallationNotFailed", "status": "False", "type": "Failed" }, { "lastProbeTime": "2021-06-02T13:14:38Z", "lastTransitionTime": "2021-06-02T13:14:38Z", "message": "The installation is waiting to start or in progress", "reason": "InstallationNotStopped", "status": "False", "type": "Stopped" } ] # oc get agent -o json | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-06-02T13:22:37Z", "message": "The Spec has been successfully applied", "reason": "SyncOK", "status": "True", "type": "SpecSynced" }, { "lastTransitionTime": "2021-06-02T13:22:37Z", "message": "The agent's connection to the installation service is unimpaired", "reason": "AgentIsConnected", "status": "True", "type": "Connected" }, { "lastTransitionTime": "2021-06-02T13:23:51Z", "message": "The agent is ready to begin the installation", "reason": "AgentIsReady", "status": "True", "type": "ReadyForInstallation" }, { "lastTransitionTime": "2021-06-02T13:23:51Z", "message": "The agent's validations are passing", "reason": "ValidationsPassing", "status": "True", "type": "Validated" }, { "lastTransitionTime": "2021-06-02T13:22:37Z", "message": "The installation has not yet started", "reason": "InstallationNotStarted", "status": "False", "type": "Installed" } ] [ { "lastTransitionTime": "2021-06-02T13:22:35Z", "message": "The Spec has been successfully applied", "reason": "SyncOK", "status": "True", "type": "SpecSynced" }, { "lastTransitionTime": "2021-06-02T13:22:35Z", "message": "The agent's connection to the installation service is unimpaired", "reason": "AgentIsConnected", "status": "True", "type": "Connected" }, { "lastTransitionTime": "2021-06-02T13:23:51Z", "message": "The agent is ready to begin the installation", "reason": "AgentIsReady", "status": "True", "type": "ReadyForInstallation" }, { "lastTransitionTime": "2021-06-02T13:23:51Z", "message": "The agent's validations are passing", "reason": "ValidationsPassing", "status": "True", "type": "Validated" }, { "lastTransitionTime": "2021-06-02T13:22:35Z", "message": "The installation has not yet started", "reason": "InstallationNotStarted", "status": "False", "type": "Installed" } ] [ { "lastTransitionTime": "2021-06-02T13:22:34Z", "message": "The Spec has been successfully applied", "reason": "SyncOK", "status": "True", "type": "SpecSynced" }, { "lastTransitionTime": "2021-06-02T13:22:34Z", "message": "The agent's connection to the installation service is unimpaired", "reason": "AgentIsConnected", "status": "True", "type": "Connected" }, { "lastTransitionTime": "2021-06-02T13:23:51Z", "message": "The agent is ready to begin the installation", "reason": "AgentIsReady", "status": "True", "type": "ReadyForInstallation" }, { "lastTransitionTime": "2021-06-02T13:23:51Z", "message": "The agent's validations are passing", "reason": "ValidationsPassing", "status": "True", "type": "Validated" }, { "lastTransitionTime": "2021-06-02T13:22:34Z", "message": "The installation has not yet started", "reason": "InstallationNotStarted", "status": "False", "type": "Installed" } ] [ { "lastTransitionTime": "2021-06-02T13:22:40Z", "message": "The Spec has been successfully applied", "reason": "SyncOK", "status": "True", "type": "SpecSynced" }, { "lastTransitionTime": "2021-06-02T13:22:40Z", "message": "The agent's connection to the installation service is unimpaired", "reason": "AgentIsConnected", "status": "True", "type": "Connected" }, { "lastTransitionTime": "2021-06-02T13:24:07Z", "message": "The agent is ready to begin the installation", "reason": "AgentIsReady", "status": "True", "type": "ReadyForInstallation" }, { "lastTransitionTime": "2021-06-02T13:24:07Z", "message": "The agent's validations are passing", "reason": "ValidationsPassing", "status": "True", "type": "Validated" }, { "lastTransitionTime": "2021-06-02T13:22:40Z", "message": "The installation has not yet started", "reason": "InstallationNotStarted", "status": "False", "type": "Installed" } ] [ { "lastTransitionTime": "2021-06-02T13:22:44Z", "message": "The Spec has been successfully applied", "reason": "SyncOK", "status": "True", "type": "SpecSynced" }, { "lastTransitionTime": "2021-06-02T13:22:44Z", "message": "The agent's connection to the installation service is unimpaired", "reason": "AgentIsConnected", "status": "True", "type": "Connected" }, { "lastTransitionTime": "2021-06-02T13:24:07Z", "message": "The agent is ready to begin the installation", "reason": "AgentIsReady", "status": "True", "type": "ReadyForInstallation" }, { "lastTransitionTime": "2021-06-02T13:24:07Z", "message": "The agent's validations are passing", "reason": "ValidationsPassing", "status": "True", "type": "Validated" }, { "lastTransitionTime": "2021-06-02T13:22:44Z", "message": "The installation has not yet started", "reason": "InstallationNotStarted", "status": "False", "type": "Installed" } ] However, the installation never begins and nothing indicates why. IMO this may be confusing and difficult for end users to troubleshoot.
Handing this over to Richard since he's looked into day 2 a bit more.
I think during the initial cluster install, only 3 masters can be specified. Workers can't be added to the cluster until the control plane is up. After the cluster has been installed, the agentclusterinstall will show this status: 'The installation has completed: cluster is adding hosts to existing'. At which point, you can add the worker nodes by creating their BMH.
That is understandable. However, in my environment if I add the workers before the installation begins, it will never begin and there are no messages indicating why and the conditions that I pasted above make it seem like the installation should have started. I am wondering if there is a clearer way to indicate why the installation hasn't begun in that situation.
Seems that only if we have the exact number of agents vs the number of required the installation will start: https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/clusterdeployments_controller.go#L497 This should be reflected in the conditions. Currently we are displaying only the case that the number is lower: https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/clusterdeployments_controller.go#L1196 We need to decide what the user should do in this case? Delete Agents? Update required number of agents? @mfilanov FYI
I remember that we had a conversation about it with @mhrivnak but i don't remember what was the conclusion. On one hand we can say that if user add extra workers then it's fine by us, we can address the numbers in the spec as a "minimal requirement" On the other hand if user specified exact number of master/workers then we need to give him an indication that something is wrong. So @mhrivnak how should we address the spec in this case? exact match or minimal requirement? Another point to pay attention to if it's a minimum requirement then we need to differentiate between SNO and multiple nodes clusters
If the controller is deciding not to proceed with installation, a reason should always be provided in the status. The number of control plane hosts assigned to the cluster should exactly match the number specified, since the control plane itself must have a specific size. For workers, I could see it either way. I lean toward: if there are *at least* as many hosts assigned as are specified in the cluster requirements, then proceed with installing all of them. Someone or something assigned each of those hosts to the cluster, so why not include them all?
Enabling the *at least* approach can lead to a problematic outcome. Consider the following scenario: Req 3 masters, 2 workers. Registered 7 hosts, Approved only 5 hosts. In this case, the install will proceed and install also the non approved hosts. I suggest that for 4.8 we keep the strict approach, while adding to the condition a message that too many hosts are registered. The user can either increase the spec or delete agents.
I assume that we will never provision an unapproved host. Approval is a security gate to prevent unwelcome actors on the network from joining a cluster. The service should ignore unapproved hosts that were assigned to a cluster, and/or prevent assignment (via an admission webhook) of unapproved hosts.
Sounds like a big change for 4.8, backend doesn't aware if hosts are approved or not, hosts just register to a cluster and have states. In addition we don't plant to add webhooks in 4.8. We can always disabled hosts that are not approved but i'm not sure if it will complicate things with late binding, in addition we don't enable the users to enable/disable hosts.
Which change is too big for 4.8? Are you saying the code as-is will provision unapproved hosts?
No because we have exact match validations that check that all the hosts are valid and approved. But if we decide to install extra hosts then maybe the condition will need to say that we have extra hosts that are not approved and user need to approve or remove then.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759