1800780 – improve handling of GCP routes for load balancing purposes

Bug 1800780 - improve handling of GCP routes for load balancing purposes

Summary: improve handling of GCP routes for load balancing purposes

Keywords:
Status:	CLOSED DUPLICATE of bug 1802534
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1782536 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-07 21:00 UTC by Micah Abbott
Modified:	2020-04-29 14:14 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-29 14:14:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1317	0	None	closed	Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal	2020-04-29 14:13:45 UTC

Description Micah Abbott 2020-02-07 21:00:13 UTC

In BZ#1769847 it was reported that the control plane was dropping out during upgrades on GCP.

Improvements to the GCP route handling were made in openshift/machine-config-operator#1317, but additional discussion indicated that still more improvements were needed.

Copying some notes from Abhinav:

```
the mcd daemon issues a reboot of the machine, the apiserver container is configured with graceful termination such that no new connections are allowed and all current work is completed with it's health check marked as failing.
the gives time for LB to react and create gracefull rolling of apiservers.

now when the reboot is issued, systemd start shutting down services, and it shuts down the gcp-routes.service.. and since gcp-routes.service is designed to cleanup when it receives stop, it removes the ip route immediately dropping/closing connections to the apiserver... and hence all the work done to gracefully close connections etc from above wrt apiserver is not being used here.
```

One proposal is to move the GCP route handling out of RHCOS itself and into the MCD:

```
>> Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD.

If the MCD knows what the route is, then the problem domain can be a whole lot simpler:

    burn the gcp-rotues.sh and gcp-routes.service in RHCOS
    the MCD setup only the route it needs

IMHO, that would be the better fix and solves my concerns about the correct route being set up for the service being served by the LB.
```

With the caveat:

```
you would want RHCOS to be usable without machine-config-daemon running on it. like on bootstrap-host or new control-plane node.
```

Comment 1 Christian Glombek 2020-02-19 13:29:10 UTC

Related:

PR to move gcp-routes.service into MCO's privileged gcp-routes-controller container: https://github.com/openshift/machine-config-operator/pull/1489

Comment 2 Micah Abbott 2020-02-26 19:04:22 UTC

*** Bug 1782536 has been marked as a duplicate of this bug. ***

Comment 3 Steve Milner 2020-03-03 14:19:36 UTC

As a side note GCP routes are really not system configuration nor upgrade (MCO) nor operating system. I understand that this functionality currently resides within MCO but the reality is it should live where cloud configuration occurs OR possibly where special cloud workarounds exist (EG: agent -- afterburn).

Note You need to log in before you can comment on or make changes to this bug.