Bug 1964471 - [master] Confusing behavior when multi-node spoke workers present when only controlPlaneAgents specified
Summary: [master] Confusing behavior when multi-node spoke workers present when only c...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Fred Rolland
QA Contact: bjacot
URL:
Whiteboard: AI-Team-Hive KNI-EDGE-4.8
Depends On:
Blocks: 1975404
TreeView+ depends on / blocked
 
Reported: 2021-05-25 14:49 UTC by Trey West
Modified: 2021-10-18 17:32 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1975404 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:31:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift assisted-service pull 2089 0 None open Bug 1964471: Add AdditionalAgent Reason to condition 2021-06-24 15:09:41 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:32:03 UTC

Internal Links: 1975404

Description Trey West 2021-05-25 14:49:02 UTC
Description of problem:

I attempted to create a multi-node spoke cluster following the AgentClusterInstall example here:

https://github.com/openshift/assisted-service/blob/master/docs/crds/agentClusterInstall.yaml

I also included two worker nodes for the installation but the installation never began. The cluster just remained in a ready for installation state but never started.

How reproducible:
100%


Steps to Reproduce:
1. Create a cluster utilizing the AgentClusterInstall from the above link
2. Include two additional nodes annotated with worker role

Actual results:

Cluster remains in ready for installation but installation has not started


Expected results:

Either cluster installation begins or some validations or events are generated explaining why the installation has not begun

Comment 1 Flavio Percoco 2021-06-01 09:51:27 UTC
Trey, would it be possible to have more information here? It's unclear to me what really happened.

- What did the resources look like?
- Are there any logs we can look at?
- Any other information about the cluster that could help us understand what happened?

Comment 2 Trey West 2021-06-02 13:35:33 UTC
Hey Flavio, here are the details.

By following the example on Github, I have created a cluster by specifying only 3 controlPlaneAgents, however I also created 2 workers and registered them. AgentClusterInstall validations are passing and host validations are passing:

# oc get agentclusterinstall/ipv4-multinode-agent-cluster-install -o json | jq -r '.status.conditions'
[
  {
    "lastProbeTime": "2021-06-02T13:14:38Z",
    "lastTransitionTime": "2021-06-02T13:14:38Z",
    "message": "The Spec has been successfully applied",
    "reason": "SyncOK",
    "status": "True",
    "type": "SpecSynced"
  },
  {
    "lastProbeTime": "2021-06-02T13:24:09Z",
    "lastTransitionTime": "2021-06-02T13:24:09Z",
    "message": "The cluster's validations are passing",
    "reason": "ValidationsPassing",
    "status": "True",
    "type": "Validated"
  },
  {
    "lastProbeTime": "2021-06-02T13:24:09Z",
    "lastTransitionTime": "2021-06-02T13:24:09Z",
    "message": "The cluster is ready to begin the installation",
    "reason": "ClusterIsReady",
    "status": "True",
    "type": "RequirementsMet"
  },
  {
    "lastProbeTime": "2021-06-02T13:14:38Z",
    "lastTransitionTime": "2021-06-02T13:14:38Z",
    "message": "The installation has not yet started",
    "reason": "InstallationNotStarted",
    "status": "False",
    "type": "Completed"
  },
  {
    "lastProbeTime": "2021-06-02T13:14:38Z",
    "lastTransitionTime": "2021-06-02T13:14:38Z",
    "message": "The installation has not failed",
    "reason": "InstallationNotFailed",
    "status": "False",
    "type": "Failed"
  },
  {
    "lastProbeTime": "2021-06-02T13:14:38Z",
    "lastTransitionTime": "2021-06-02T13:14:38Z",
    "message": "The installation is waiting to start or in progress",
    "reason": "InstallationNotStopped",
    "status": "False",
    "type": "Stopped"
  }
]

# oc get agent -o json | jq -r '.items[].status.conditions'
[
  {
    "lastTransitionTime": "2021-06-02T13:22:37Z",
    "message": "The Spec has been successfully applied",
    "reason": "SyncOK",
    "status": "True",
    "type": "SpecSynced"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:37Z",
    "message": "The agent's connection to the installation service is unimpaired",
    "reason": "AgentIsConnected",
    "status": "True",
    "type": "Connected"
  },
  {
    "lastTransitionTime": "2021-06-02T13:23:51Z",
    "message": "The agent is ready to begin the installation",
    "reason": "AgentIsReady",
    "status": "True",
    "type": "ReadyForInstallation"
  },
  {
    "lastTransitionTime": "2021-06-02T13:23:51Z",
    "message": "The agent's validations are passing",
    "reason": "ValidationsPassing",
    "status": "True",
    "type": "Validated"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:37Z",
    "message": "The installation has not yet started",
    "reason": "InstallationNotStarted",
    "status": "False",
    "type": "Installed"
  }
]
[
  {
    "lastTransitionTime": "2021-06-02T13:22:35Z",
    "message": "The Spec has been successfully applied",
    "reason": "SyncOK",
    "status": "True",
    "type": "SpecSynced"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:35Z",
    "message": "The agent's connection to the installation service is unimpaired",
    "reason": "AgentIsConnected",
    "status": "True",
    "type": "Connected"
  },
  {
    "lastTransitionTime": "2021-06-02T13:23:51Z",
    "message": "The agent is ready to begin the installation",
    "reason": "AgentIsReady",
    "status": "True",
    "type": "ReadyForInstallation"
  },
  {
    "lastTransitionTime": "2021-06-02T13:23:51Z",
    "message": "The agent's validations are passing",
    "reason": "ValidationsPassing",
    "status": "True",
    "type": "Validated"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:35Z",
    "message": "The installation has not yet started",
    "reason": "InstallationNotStarted",
    "status": "False",
    "type": "Installed"
  }
]
[
  {
    "lastTransitionTime": "2021-06-02T13:22:34Z",
    "message": "The Spec has been successfully applied",
    "reason": "SyncOK",
    "status": "True",
    "type": "SpecSynced"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:34Z",
    "message": "The agent's connection to the installation service is unimpaired",
    "reason": "AgentIsConnected",
    "status": "True",
    "type": "Connected"
  },
  {
    "lastTransitionTime": "2021-06-02T13:23:51Z",
    "message": "The agent is ready to begin the installation",
    "reason": "AgentIsReady",
    "status": "True",
    "type": "ReadyForInstallation"
  },
  {
    "lastTransitionTime": "2021-06-02T13:23:51Z",
    "message": "The agent's validations are passing",
    "reason": "ValidationsPassing",
    "status": "True",
    "type": "Validated"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:34Z",
    "message": "The installation has not yet started",
    "reason": "InstallationNotStarted",
    "status": "False",
    "type": "Installed"
  }
]
[
  {
    "lastTransitionTime": "2021-06-02T13:22:40Z",
    "message": "The Spec has been successfully applied",
    "reason": "SyncOK",
    "status": "True",
    "type": "SpecSynced"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:40Z",
    "message": "The agent's connection to the installation service is unimpaired",
    "reason": "AgentIsConnected",
    "status": "True",
    "type": "Connected"
  },
  {
    "lastTransitionTime": "2021-06-02T13:24:07Z",
    "message": "The agent is ready to begin the installation",
    "reason": "AgentIsReady",
    "status": "True",
    "type": "ReadyForInstallation"
  },
  {
    "lastTransitionTime": "2021-06-02T13:24:07Z",
    "message": "The agent's validations are passing",
    "reason": "ValidationsPassing",
    "status": "True",
    "type": "Validated"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:40Z",
    "message": "The installation has not yet started",
    "reason": "InstallationNotStarted",
    "status": "False",
    "type": "Installed"
  }
]
[
  {
    "lastTransitionTime": "2021-06-02T13:22:44Z",
    "message": "The Spec has been successfully applied",
    "reason": "SyncOK",
    "status": "True",
    "type": "SpecSynced"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:44Z",
    "message": "The agent's connection to the installation service is unimpaired",
    "reason": "AgentIsConnected",
    "status": "True",
    "type": "Connected"
  },
  {
    "lastTransitionTime": "2021-06-02T13:24:07Z",
    "message": "The agent is ready to begin the installation",
    "reason": "AgentIsReady",
    "status": "True",
    "type": "ReadyForInstallation"
  },
  {
    "lastTransitionTime": "2021-06-02T13:24:07Z",
    "message": "The agent's validations are passing",
    "reason": "ValidationsPassing",
    "status": "True",
    "type": "Validated"
  },
  {
    "lastTransitionTime": "2021-06-02T13:22:44Z",
    "message": "The installation has not yet started",
    "reason": "InstallationNotStarted",
    "status": "False",
    "type": "Installed"
  }
]


However, the installation never begins and nothing indicates why. IMO this may be confusing and difficult for end users to troubleshoot.

Comment 3 Flavio Percoco 2021-06-17 15:43:58 UTC
Handing this over to Richard since he's looked into day 2 a bit more.

Comment 4 Richard Su 2021-06-21 16:42:07 UTC
I think during the initial cluster install, only 3 masters can be specified. Workers can't be added to the cluster until the control plane is up. After the cluster has been installed, the agentclusterinstall will show this status: 'The installation has completed: cluster is adding hosts to existing'. At which point, you can add the worker nodes by creating their BMH.

Comment 5 Trey West 2021-06-21 17:24:35 UTC
That is understandable. However, in my environment if I add the workers before the installation begins, it will never begin and there are no messages indicating why and the conditions that I pasted above make it seem like the installation should have started. I am wondering if there is a clearer way to indicate why the installation hasn't begun in that situation.

Comment 6 Fred Rolland 2021-06-22 06:16:36 UTC
Seems that only if we have the exact number of agents vs the number of required the installation will start:

https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/clusterdeployments_controller.go#L497

This should be reflected in the conditions. Currently we are displaying only the case that the number is lower:
https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/clusterdeployments_controller.go#L1196

We need to decide what the user should do in this case? Delete Agents? Update required number of agents?

@mfilanov FYI

Comment 7 Michael Filanov 2021-06-22 10:59:06 UTC
I remember that we had a conversation about it with @mhrivnak but i don't remember what was the conclusion. 
On one hand we can say that if user add extra workers then it's fine by us, we can address the numbers in the spec as a "minimal requirement"
On the other hand if user specified exact number of master/workers then we need to give him an indication that something is wrong. 

So @mhrivnak how should we address the spec in this case? exact match or minimal requirement? 

Another point to pay attention to if it's a minimum requirement then we need to differentiate between SNO and multiple nodes clusters

Comment 9 Michael Hrivnak 2021-06-23 12:56:01 UTC
If the controller is deciding not to proceed with installation, a reason should always be provided in the status.

The number of control plane hosts assigned to the cluster should exactly match the number specified, since the control plane itself must have a specific size.

For workers, I could see it either way. I lean toward: if there are *at least* as many hosts assigned as are specified in the cluster requirements, then proceed with installing all of them. Someone or something assigned each of those hosts to the cluster, so why not include them all?

Comment 10 Fred Rolland 2021-06-24 13:02:41 UTC
Enabling the *at least* approach can lead to a problematic outcome.
Consider the following scenario:
Req 3 masters, 2 workers.
Registered 7 hosts, Approved only 5 hosts.

In this case, the install will proceed and install also the non approved hosts.

I suggest that for 4.8 we keep the strict approach, while adding to the condition a message that too many hosts are registered. The user can either increase the spec or delete agents.

Comment 11 Michael Hrivnak 2021-06-24 15:17:43 UTC
I assume that we will never provision an unapproved host. Approval is a security gate to prevent unwelcome actors on the network from joining a cluster. The service should ignore unapproved hosts that were assigned to a cluster, and/or prevent assignment (via an admission webhook) of unapproved hosts.

Comment 12 Michael Filanov 2021-06-24 15:30:04 UTC
Sounds like a big change for 4.8, backend doesn't aware if hosts are approved or not, hosts just register to a cluster and have states. In addition we don't plant to add webhooks in 4.8.
We can always disabled hosts that are not approved but i'm not sure if it will complicate things with late binding, in addition we don't enable the users to enable/disable hosts.

Comment 13 Michael Hrivnak 2021-06-24 15:41:01 UTC
Which change is too big for 4.8?

Are you saying the code as-is will provision unapproved hosts?

Comment 14 Michael Filanov 2021-06-24 15:52:26 UTC
No because we have exact match validations that check that all the hosts are valid and approved. 
But if we decide to install extra hosts then maybe the condition will need to say that we have extra hosts that are not approved and user need to approve or remove then.

Comment 19 errata-xmlrpc 2021-10-18 17:31:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.