Bug 1816904

Summary:

installation process exits successfully while deployment of one of the worker nodes in install-config failed

Product:

OpenShift Container Platform

Reporter:

Victor Voronkov <vvoronko>

Component:

Installer

Assignee:

aos-install

Installer sub component:

openshift-installer

QA Contact:

Gaoyun Pei <gpei>

Status:

CLOSED WONTFIX

Docs Contact:

Severity:

high

Priority:

high

CC:

agarcial, augol, beth.white, cvultur, dhellmann, kiran, lshilin, mstaeble, ohochman, omichael, padillon, racedoro, sasha, stbenjam, walters, wking, ydalal, yprokule

Version:

4.4

Keywords:

AutomationBlocker, Reopened

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1859644 (view as bug list)

Environment:

Last Closed:

2020-11-12 23:01:52 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1859644

Attachments:

Description	Flags
installation log	none
install-config.yaml without ssh keys section	none

Description Victor Voronkov 2020-03-25 03:59:46 UTC

Created attachment 1673272 [details]
installation log

Description of problem:
Image used - quay.io/openshift-release-dev/ocp-release:4.4.0-rc.2-x86_64

install-config.yaml file contains 3 master and 2 worker hosts
Installation script finish without errors, no errors in log, still one of the nodes not deployed
  
How reproducible:

Steps to Reproduce:
1. Provision hosts
2. generate install-config.yaml with 3 masters and 2 workers
3. run openshift-baremetal-install

Actual results:
installation finished without errors, but sanity on the cluster shows

[kni@provisionhost-0 ~]$ oc get nodes
NAME                                          STATUS   ROLES    AGE   VERSION
master-0.ocp-edge-cluster.qe.lab.redhat.com   Ready    master   40h   v1.17.1
master-1.ocp-edge-cluster.qe.lab.redhat.com   Ready    master   40h   v1.17.1
master-2.ocp-edge-cluster.qe.lab.redhat.com   Ready    master   40h   v1.17.1
worker-1.ocp-edge-cluster.qe.lab.redhat.com   Ready    worker   39h   v1.17.1

[kni@provisionhost-0 ~]$ oc get baremetalhost -n openshift-machine-api
NAME                 STATUS   PROVISIONING STATUS      CONSUMER                          BMC                                                                                               HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ocp-edge-cluster-master-0         redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/809abc4e-efd2-4958-9025-58d4d07007a6                      true     
openshift-master-1   OK       externally provisioned   ocp-edge-cluster-master-1         redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/bb1d1da7-48d6-46f3-98a5-9668603e45c0                      true     
openshift-master-2   OK       externally provisioned   ocp-edge-cluster-master-2         redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/4ce82172-547a-4dc5-8ade-50cd628a66a8                      true     
openshift-worker-0   error    inspecting               ocp-edge-cluster-worker-0-5j5bc   redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/59636f73-b26f-41d0-8ec0-14c83670b1df                      true     Introspection timeout
openshift-worker-1   OK       provisioned              ocp-edge-cluster-worker-0-sbmmd   redfish://[fd2e:6f44:5dd8:c956::1]:8000/redfish/v1/Systems/c04a6eb3-eb56-460f-bb55-ba47f7c863d8   unknown            true     


Expected results:
Installation should fail with informative error, like "Introspection timeout"

Comment 1 Victor Voronkov 2020-03-25 04:04:55 UTC

Created attachment 1673273 [details]
install-config.yaml without ssh keys section

Comment 2 Stephen Benjamin 2020-03-26 17:49:58 UTC

It looks like the installer should actually error if the number of requested worker replicas in the install-config wasn't met, but it's not doing that.

Comment 3 Abhinav Dahiya 2020-03-26 18:20:18 UTC

(In reply to Stephen Benjamin from comment #2)
> It looks like the installer should actually error if the number of requested
> worker replicas in the install-config wasn't met, but it's not doing that.

the machine-api operator should error or go degraded if some of the machine objects are failing to join the cluster. It's not the job of the installer to check for these.

secondly, if no operators are unhappy, i don't think we should fail. this can easily be fixed as day-2 or made available as an alert day-2.

Moving to machine-api to add the degrade error, if the team feels this is not something they should be doing, feel free to close this as wontfix.

Comment 4 Alberto 2020-03-27 08:49:24 UTC

Compute errors are generally recoverable and meeting the expected capacity at a point in time is orthogonal to the product as whole being able to provide service as expected so I don't think this should make the installer fail based on this.

There are multiple transient errors that could prevent a machine from joining the cluster e.g cloud api unavailability, cloud rate limits/constraints, temporary connectivity issues, a bad spec input... while the operator is still operating well (i.e CRDs and all controllers are functional) therefore we don't consider this to qualify as degraded.
For automated recovery/remediation of unhealthy machines a MHC resource can be created.

For any scenario like this there are alerts in place out of the box, if any machine is stuck in a phase != running i.e. it has no nodeRef, it will trigger. If there's more upcoming CSRs than expected for the  number of machines in the cluster it will trigger.


We might consider to let the operator go degraded and block upgrades if e.g a given threshold of machines out of the total with no node is met. However I'd prefer to consider something like this only if there are more real scenarios evidences where this proves to be useful in addition to the alerts. Please reopen if still relevant.

Comment 5 Amit Ugol 2020-03-30 10:35:58 UTC

Not sure why this was closed as WONTFIX. Usually WONTFIX means that introducing a fix might impact the product in a destructive way or that the product is at a stage where its not accepting changes.

Comment 6 Constantin Vultur 2020-05-26 13:39:03 UTC

I've hit this problem on a newly built system. 
While I understand the reason for closure, it would help a lot if we can have another look on this. 
Maybe have a recover procedure for it. 
We should think that the customer hits this. What does he do next ?

Moving to ASSIGNED for a second thought.

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          156m    Unable to apply 4.5.0-0.nightly-2020-05-26-063751: the cluster operator machine-config has not yet successfully rolled out
[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATUS   PROVISIONING STATUS   CONSUMER                                BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-machine-api   openshift-master-0-0                                  ocp-edge-cluster-cdv-0-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/02bd4f6a-ba56-431c-ab25-ab1a6f99c0ba                      true     
openshift-machine-api   openshift-master-0-1                                  ocp-edge-cluster-cdv-0-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/2d504635-5e44-402f-862f-666e45586400                      true     
openshift-machine-api   openshift-master-0-2                                  ocp-edge-cluster-cdv-0-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/d12a33c2-d60c-48e4-bd50-cc7a9dadfe0e                      true     
openshift-machine-api   openshift-worker-0-0   OK       provisioned           ocp-edge-cluster-cdv-0-worker-0-xhz76   redfish://192.168.123.1:8000/redfish/v1/Systems/814285d1-7aab-4c39-a656-a5a7004be254   unknown            true     
openshift-machine-api   openshift-worker-0-1   error    inspecting                                                    redfish://192.168.123.1:8000/redfish/v1/Systems/6cff79a3-f8ad-4fff-88cf-5ea0574d2cfb                      false    Introspection timeout
[kni@provisionhost-0-0 ~]$

Comment 9 Alexander Chuzhoy 2020-06-30 16:15:04 UTC

reproduced with 4.5.0-0.nightly-2020-06-30-104148

Comment 13 Stephen Benjamin 2020-07-15 12:46:43 UTC

We've previously posted an enhancement for improving the debuggability of problems during the baremetal installation process[1]. One of the things covered there was the lack of indication of worker deployment failure. It manifests itself in 2 ways:

1) Minimum number of workers isn't deployed: this results in confusing error messages for a number of operators that can't run on control plane hosts, with no explicit error that workers didn't deploy

2) Some number of workers beyond the maximum failed to deploy: the cluster installs successfully as operators roll out, but there's not a lot of indication of problem to the user.

None of this is really baremetal specific, but on-premise is more likely to have unrecoverable fatal errors (e.g. bad BMC credentials) than cloud-based platforms.

The installer team has suggested we could do something there, possibly warning a user that the requested compute replicas havevn't finished rolling out[2] 

I still think something should be happening in MAO or the machine controllers. Alerts should be fired, or the operator should go degraded, but so far no one agrees on that.

For this bug, we could possibly at least try the installer route as a first pass at improving the situation.

[1] https://github.com/openshift/enhancements/pull/328
[2] https://github.com/openshift/enhancements/pull/328#pullrequestreview-412848558

Comment 15 Alberto 2020-07-16 08:36:37 UTC

>I still think something should be happening in MAO or the machine controllers. Alerts should be fired, or the operator should go degraded, but so far no one agrees on that.

Please see https://bugzilla.redhat.com/show_bug.cgi?id=1816904#c4. This already exists.
The machine API operator triggers alerts for any machine that has no backing node at any time.

In terms of "improving the debuggability" should probably consider to bubble up those errors as machine providerStatus conditions if it isn't yet. We do that for any provider, see https://github.com/openshift/cluster-api-provider-aws/blob/master/pkg/apis/awsprovider/v1beta1/awsproviderstatus_types.go#L40

There are countless transient scenarios for "Failure to deploy a worker". This makes impractical putting a reasonable generic semantic on top of it. And so this makes worthless to let the overall operator going degraded in such a heterogeneous scenario.

Instead I believe the boundaries to signal the details of theses errors belong to individual machine resource conditions and any lower level resource. Just like we do for any other provider https://github.com/openshift/cluster-api-provider-aws/blob/master/pkg/apis/awsprovider/v1beta1/awsproviderstatus_types.go#L40

Then to communicate "Failure to deploy a worker" We already trigger alerts any time a machine has no node regardless of the failure details and regardless the provider. So each failure details can then be analysed in the format described above.

Beyond all the above, regardless of the failure details and based on the overall health of the cluster (e.g 99 out 100 machines has no node) we might decide our criteria for a semantic that represents a permanent global issue and choose to let the mao going degraded in that case. But that's a separate scoped discussion.

Additionally, fwiw in clouds the errors that happen to be transient and recoverable are mitigated by Machine Health checks.

Comment 16 Beth White 2020-07-27 16:03:46 UTC

*** Bug 1859644 has been marked as a duplicate of this bug. ***

Comment 20 Yossi Boaron 2020-09-29 16:20:45 UTC

*** Bug 1883564 has been marked as a duplicate of this bug. ***

Comment 23 Stephen Benjamin 2020-10-28 12:59:32 UTC

I am moving this back to the core installer team for review. We have tried to fix this ourselves, but we've tried several different avenues to fix this. Most recently in https://github.com/openshift/installer/pull/4071 we tried a baremetal-only fix. I do not believe this should be a baremetal-only fix.

Comment #13 explains the failure cases. The UX of this is terrible, especially if the user does not end up without enough workers to get routers and other operators running.  All the installer tells them is they have a half dozen failing operators, with no indication why (there's not enough workers).

If a user requests X replicas for workers, and they don't get them, the installer should be clear this didn't happen, however that viewpoint from the baremetal installer team seems to be a minority view.
 

If this won't be fixed, then please close out the bug as WONTFIX.

Comment 24 Stephen Benjamin 2020-10-28 13:00:32 UTC

@Amit: I know you were concerned about the state of this bug, see above comment as I think we're not going to be able to fix it on the baremetal side.

Comment 26 Matthew Staebler 2020-11-12 23:01:52 UTC

I am not inclined to introduce into the installer reporting on one aspect of the state of the cluster. This is something that the user should be able to ascertain from the cluster, not just immediately after installation but over the life of the cluster.

There is certainly room for improvement in the UX when failing workers keep the cluster installation from succeeding. However, any improvements along those lines would be better as general UX improvements for failing clusters rather than improvements specific to failing workers. That is beyond the scope of this bug and would be better as an RFE.

Comment 27 Amit Ugol 2020-11-15 14:26:53 UTC

The situation remains the same. Since no-one seems to be able to overcome this bug, we need to at least document that post deployment the report is showing false results.
Moving forward, for testing purposes we will need to stop relying on this report.

Comment 28 Colin Walters 2020-12-14 23:05:24 UTC

If the failure is in Ignition or before kubelet, I think it should live in the MCO; see

https://github.com/coreos/ignition/issues/585#issuecomment-540252573
https://github.com/openshift/machine-config-operator/issues/1365

Comment 29 yigal dalal 2021-05-31 09:49:01 UTC

Hi this bug is reoccurring on 4.5, 4.7 as well - do we need separate bug per each version

Comment 30 Matthew Staebler 2021-05-31 17:29:46 UTC

(In reply to yigal dalal from comment #29)
> Hi this bug is reoccurring on 4.5, 4.7 as well - do we need separate bug per
> each version

No, you do not need separate bugs for each version. This BZ was closed as WONTFIX: It is not reoccurring but rather is still occurring.