Created attachment 1673272 [details]
Description of problem:
Image used - quay.io/openshift-release-dev/ocp-release:4.4.0-rc.2-x86_64
install-config.yaml file contains 3 master and 2 worker hosts
Installation script finish without errors, no errors in log, still one of the nodes not deployed
Steps to Reproduce:
1. Provision hosts
2. generate install-config.yaml with 3 masters and 2 workers
3. run openshift-baremetal-install
installation finished without errors, but sanity on the cluster shows
[kni@provisionhost-0 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-0.ocp-edge-cluster.qe.lab.redhat.com Ready master 40h v1.17.1
master-1.ocp-edge-cluster.qe.lab.redhat.com Ready master 40h v1.17.1
master-2.ocp-edge-cluster.qe.lab.redhat.com Ready master 40h v1.17.1
worker-1.ocp-edge-cluster.qe.lab.redhat.com Ready worker 39h v1.17.1
[kni@provisionhost-0 ~]$ oc get baremetalhost -n openshift-machine-api
NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR
openshift-master-0 OK externally provisioned ocp-edge-cluster-master-0 redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/809abc4e-efd2-4958-9025-58d4d07007a6 true
openshift-master-1 OK externally provisioned ocp-edge-cluster-master-1 redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/bb1d1da7-48d6-46f3-98a5-9668603e45c0 true
openshift-master-2 OK externally provisioned ocp-edge-cluster-master-2 redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/4ce82172-547a-4dc5-8ade-50cd628a66a8 true
openshift-worker-0 error inspecting ocp-edge-cluster-worker-0-5j5bc redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/59636f73-b26f-41d0-8ec0-14c83670b1df true Introspection timeout
openshift-worker-1 OK provisioned ocp-edge-cluster-worker-0-sbmmd redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/c04a6eb3-eb56-460f-bb55-ba47f7c863d8 unknown true
Installation should fail with informative error, like "Introspection timeout"
Created attachment 1673273 [details]
install-config.yaml without ssh keys section
It looks like the installer should actually error if the number of requested worker replicas in the install-config wasn't met, but it's not doing that.
(In reply to Stephen Benjamin from comment #2)
> It looks like the installer should actually error if the number of requested
> worker replicas in the install-config wasn't met, but it's not doing that.
the machine-api operator should error or go degraded if some of the machine objects are failing to join the cluster. It's not the job of the installer to check for these.
secondly, if no operators are unhappy, i don't think we should fail. this can easily be fixed as day-2 or made available as an alert day-2.
Moving to machine-api to add the degrade error, if the team feels this is not something they should be doing, feel free to close this as wontfix.
Compute errors are generally recoverable and meeting the expected capacity at a point in time is orthogonal to the product as whole being able to provide service as expected so I don't think this should make the installer fail based on this.
There are multiple transient errors that could prevent a machine from joining the cluster e.g cloud api unavailability, cloud rate limits/constraints, temporary connectivity issues, a bad spec input... while the operator is still operating well (i.e CRDs and all controllers are functional) therefore we don't consider this to qualify as degraded.
For automated recovery/remediation of unhealthy machines a MHC resource can be created.
For any scenario like this there are alerts in place out of the box, if any machine is stuck in a phase != running i.e. it has no nodeRef, it will trigger. If there's more upcoming CSRs than expected for the number of machines in the cluster it will trigger.
We might consider to let the operator go degraded and block upgrades if e.g a given threshold of machines out of the total with no node is met. However I'd prefer to consider something like this only if there are more real scenarios evidences where this proves to be useful in addition to the alerts. Please reopen if still relevant.
Not sure why this was closed as WONTFIX. Usually WONTFIX means that introducing a fix might impact the product in a destructive way or that the product is at a stage where its not accepting changes.
I've hit this problem on a newly built system.
While I understand the reason for closure, it would help a lot if we can have another look on this.
Maybe have a recover procedure for it.
We should think that the customer hits this. What does he do next ?
Moving to ASSIGNED for a second thought.
[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version False True 156m Unable to apply 4.5.0-0.nightly-2020-05-26-063751: the cluster operator machine-config has not yet successfully rolled out
[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR
openshift-machine-api openshift-master-0-0 ocp-edge-cluster-cdv-0-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/02bd4f6a-ba56-431c-ab25-ab1a6f99c0ba true
openshift-machine-api openshift-master-0-1 ocp-edge-cluster-cdv-0-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/2d504635-5e44-402f-862f-666e45586400 true
openshift-machine-api openshift-master-0-2 ocp-edge-cluster-cdv-0-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/d12a33c2-d60c-48e4-bd50-cc7a9dadfe0e true
openshift-machine-api openshift-worker-0-0 OK provisioned ocp-edge-cluster-cdv-0-worker-0-xhz76 redfish://192.168.123.1:8000/redfish/v1/Systems/814285d1-7aab-4c39-a656-a5a7004be254 unknown true
openshift-machine-api openshift-worker-0-1 error inspecting redfish://192.168.123.1:8000/redfish/v1/Systems/6cff79a3-f8ad-4fff-88cf-5ea0574d2cfb false Introspection timeout
reproduced with 4.5.0-0.nightly-2020-06-30-104148
We've previously posted an enhancement for improving the debuggability of problems during the baremetal installation process. One of the things covered there was the lack of indication of worker deployment failure. It manifests itself in 2 ways:
1) Minimum number of workers isn't deployed: this results in confusing error messages for a number of operators that can't run on control plane hosts, with no explicit error that workers didn't deploy
2) Some number of workers beyond the maximum failed to deploy: the cluster installs successfully as operators roll out, but there's not a lot of indication of problem to the user.
None of this is really baremetal specific, but on-premise is more likely to have unrecoverable fatal errors (e.g. bad BMC credentials) than cloud-based platforms.
The installer team has suggested we could do something there, possibly warning a user that the requested compute replicas havevn't finished rolling out
I still think something should be happening in MAO or the machine controllers. Alerts should be fired, or the operator should go degraded, but so far no one agrees on that.
For this bug, we could possibly at least try the installer route as a first pass at improving the situation.
>I still think something should be happening in MAO or the machine controllers. Alerts should be fired, or the operator should go degraded, but so far no one agrees on that.
Please see https://bugzilla.redhat.com/show_bug.cgi?id=1816904#c4. This already exists.
The machine API operator triggers alerts for any machine that has no backing node at any time.
In terms of "improving the debuggability" should probably consider to bubble up those errors as machine providerStatus conditions if it isn't yet. We do that for any provider, see https://github.com/openshift/cluster-api-provider-aws/blob/master/pkg/apis/awsprovider/v1beta1/awsproviderstatus_types.go#L40
There are countless transient scenarios for "Failure to deploy a worker". This makes impractical putting a reasonable generic semantic on top of it. And so this makes worthless to let the overall operator going degraded in such a heterogeneous scenario.
Instead I believe the boundaries to signal the details of theses errors belong to individual machine resource conditions and any lower level resource. Just like we do for any other provider https://github.com/openshift/cluster-api-provider-aws/blob/master/pkg/apis/awsprovider/v1beta1/awsproviderstatus_types.go#L40
Then to communicate "Failure to deploy a worker" We already trigger alerts any time a machine has no node regardless of the failure details and regardless the provider. So each failure details can then be analysed in the format described above.
Beyond all the above, regardless of the failure details and based on the overall health of the cluster (e.g 99 out 100 machines has no node) we might decide our criteria for a semantic that represents a permanent global issue and choose to let the mao going degraded in that case. But that's a separate scoped discussion.
Additionally, fwiw in clouds the errors that happen to be transient and recoverable are mitigated by Machine Health checks.
*** Bug 1859644 has been marked as a duplicate of this bug. ***
*** Bug 1883564 has been marked as a duplicate of this bug. ***
I am moving this back to the core installer team for review. We have tried to fix this ourselves, but we've tried several different avenues to fix this. Most recently in https://github.com/openshift/installer/pull/4071 we tried a baremetal-only fix. I do not believe this should be a baremetal-only fix.
Comment #13 explains the failure cases. The UX of this is terrible, especially if the user does not end up without enough workers to get routers and other operators running. All the installer tells them is they have a half dozen failing operators, with no indication why (there's not enough workers).
If a user requests X replicas for workers, and they don't get them, the installer should be clear this didn't happen, however that viewpoint from the baremetal installer team seems to be a minority view.
If this won't be fixed, then please close out the bug as WONTFIX.
@Amit: I know you were concerned about the state of this bug, see above comment as I think we're not going to be able to fix it on the baremetal side.
I am not inclined to introduce into the installer reporting on one aspect of the state of the cluster. This is something that the user should be able to ascertain from the cluster, not just immediately after installation but over the life of the cluster.
There is certainly room for improvement in the UX when failing workers keep the cluster installation from succeeding. However, any improvements along those lines would be better as general UX improvements for failing clusters rather than improvements specific to failing workers. That is beyond the scope of this bug and would be better as an RFE.
The situation remains the same. Since no-one seems to be able to overcome this bug, we need to at least document that post deployment the report is showing false results.
Moving forward, for testing purposes we will need to stop relying on this report.
If the failure is in Ignition or before kubelet, I think it should live in the MCO; see
Hi this bug is reoccurring on 4.5, 4.7 as well - do we need separate bug per each version
(In reply to yigal dalal from comment #29)
> Hi this bug is reoccurring on 4.5, 4.7 as well - do we need separate bug per
> each version
No, you do not need separate bugs for each version. This BZ was closed as WONTFIX: It is not reoccurring but rather is still occurring.