Bug 1889333

Summary: [CNV][Chaos] Integrate with MachineHealthCheck
Product: OpenShift Container Platform Reporter: Piotr Kliczewski <pkliczew>
Component: assisted-installerAssignee: Piotr Kliczewski <pkliczew>
assisted-installer sub component: assisted-service QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED DEFERRED Docs Contact:
Severity: high    
Priority: medium CC: abeekhof, alazar, aos-bugs, asalkeld, cyosef, danken, ercohen, masayag, mfilanov, rfreiman, rgarcia, ycui, yshnaidm
Version: 4.6   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Projects
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-06 12:13:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1908661    

Description Piotr Kliczewski 2020-10-19 12:23:18 UTC
In order to make sure chaos scenarios won't affect user workload we need to enable machine health check by default on freshly installed clusters.

Comment 1 Michael Filanov 2020-10-20 06:48:20 UTC
Not sure what it means, are you talking about https://github.com/openshift/assisted-service/blob/master/deploy/assisted-service.yaml#L29 ?

Comment 3 Eran Cohen 2020-10-20 08:00:36 UTC
@yshnaidm  I guess we can do it the same way we create the BMH?
alazar, rom if we want to add it we should probably do it during the ignition generation.
Thoughts?

Comment 4 Dan Kenigsberg 2020-10-20 10:07:26 UTC
I think that the problem here is more profound: since the assisted installer is not an IPI, it does not integrate at all with the MachineHealthCheck (MHC). I think this bz should be changed to a request for extension: let assisted-installed cluster integrate with MHC, so that non-responsive nodes can be automatically recycled/restarted.

Comment 5 Moti Asayag 2020-11-25 19:16:40 UTC
(In reply to Eran Cohen from comment #3)
> @yshnaidm  I guess we can do it the same way we create the BMH?
> alazar, rom if we want to add it we should probably do
> it during the ignition generation.
> Thoughts?

In terms of implementation, if the purpose is to add a custom manifest to the cluster, such as the one describe here:
https://docs.openshift.com/container-platform/4.5/machine_management/deploying-machine-health-checks.html#machine-health-checks-resource_deploying-machine-health-checks

It can be achieved by using the manifest API to provide it after the cluster was created and it will be rendered into the ign file by: 
https://github.com/openshift/assisted-service/blob/master/internal/ignition/ignition.go#L141

Comment 6 yevgeny shnaidman 2020-11-26 06:33:12 UTC
@ercohen why are we integrated with Machine Health from the start? I mean, why should we create any specific manifest? Does not openshift-installer should do it? by some kind of configuraiton

Comment 7 Piotr Kliczewski 2020-11-26 08:39:24 UTC
Please take a look at BZ #1889651 comments to have more context about this change.

Comment 13 Andrew Beekhof 2021-01-20 02:12:41 UTC
The lack of a provisioning network isn't specifically an issue, but we do need the Machine API to be functional and able to provision/destroy nodes.

Adding the Lifecycle squad for visibility

Comment 15 Angus Salkeld 2021-01-24 23:23:35 UTC
(In reply to Andrew Beekhof from comment #13)
> The lack of a provisioning network isn't specifically an issue, but we do
> need the Machine API to be functional and able to provision/destroy nodes.
> 
> Adding the Lifecycle squad for visibility

Currently in AI, creating/deleting machine objects has no effect as the bmh entities
have no BMC details or any other ability to provision (they are discovered, but in an unmanaged state).
There is work ahead to enable day 2 provisioning, but this is a while off.

Comment 16 Angus Salkeld 2021-03-22 21:44:22 UTC
Not working directly on AI at the moment. Releasing so someone else can work on it.

Comment 17 Piotr Kliczewski 2021-05-06 12:13:32 UTC
This featrue is tracked by https://issues.redhat.com/browse/MGMT-4811 and we have decided to wait on node health check.