Bug 1800780

Summary: improve handling of GCP routes for load balancing purposes
Product: OpenShift Container Platform Reporter: Micah Abbott <miabbott>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED DUPLICATE QA Contact: Michael Nguyen <mnguyen>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: adahiya, behoward, cglombek, smilner, vrutkovs, walters
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-29 14:14:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Micah Abbott 2020-02-07 21:00:13 UTC
In BZ#1769847 it was reported that the control plane was dropping out during upgrades on GCP.

Improvements to the GCP route handling were made in openshift/machine-config-operator#1317, but additional discussion indicated that still more improvements were needed.

Copying some notes from Abhinav:

```
the mcd daemon issues a reboot of the machine, the apiserver container is configured with graceful termination such that no new connections are allowed and all current work is completed with it's health check marked as failing.
the gives time for LB to react and create gracefull rolling of apiservers.

now when the reboot is issued, systemd start shutting down services, and it shuts down the gcp-routes.service.. and since gcp-routes.service is designed to cleanup when it receives stop, it removes the ip route immediately dropping/closing connections to the apiserver... and hence all the work done to gracefully close connections etc from above wrt apiserver is not being used here.
```

One proposal is to move the GCP route handling out of RHCOS itself and into the MCD:

```
>> Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD.

If the MCD knows what the route is, then the problem domain can be a whole lot simpler:

    burn the gcp-rotues.sh and gcp-routes.service in RHCOS
    the MCD setup only the route it needs

IMHO, that would be the better fix and solves my concerns about the correct route being set up for the service being served by the LB.
```

With the caveat:

```
you would want RHCOS to be usable without machine-config-daemon running on it. like on bootstrap-host or new control-plane node.
```

Comment 1 Christian Glombek 2020-02-19 13:29:10 UTC
Related:

PR to move gcp-routes.service into MCO's privileged gcp-routes-controller container: https://github.com/openshift/machine-config-operator/pull/1489

Comment 2 Micah Abbott 2020-02-26 19:04:22 UTC
*** Bug 1782536 has been marked as a duplicate of this bug. ***

Comment 3 Steve Milner 2020-03-03 14:19:36 UTC
As a side note GCP routes are really not system configuration nor upgrade (MCO) nor operating system. I understand that this functionality currently resides within MCO but the reality is it should live where cloud configuration occurs OR possibly where special cloud workarounds exist (EG: agent -- afterburn).