In BZ#1769847 it was reported that the control plane was dropping out during upgrades on GCP. Improvements to the GCP route handling were made in openshift/machine-config-operator#1317, but additional discussion indicated that still more improvements were needed. Copying some notes from Abhinav: ``` the mcd daemon issues a reboot of the machine, the apiserver container is configured with graceful termination such that no new connections are allowed and all current work is completed with it's health check marked as failing. the gives time for LB to react and create gracefull rolling of apiservers. now when the reboot is issued, systemd start shutting down services, and it shuts down the gcp-routes.service.. and since gcp-routes.service is designed to cleanup when it receives stop, it removes the ip route immediately dropping/closing connections to the apiserver... and hence all the work done to gracefully close connections etc from above wrt apiserver is not being used here. ``` One proposal is to move the GCP route handling out of RHCOS itself and into the MCD: ``` >> Hm; one preparatory thing that may help here is moving the route script out of RHCOS and into the MCD. If the MCD knows what the route is, then the problem domain can be a whole lot simpler: burn the gcp-rotues.sh and gcp-routes.service in RHCOS the MCD setup only the route it needs IMHO, that would be the better fix and solves my concerns about the correct route being set up for the service being served by the LB. ``` With the caveat: ``` you would want RHCOS to be usable without machine-config-daemon running on it. like on bootstrap-host or new control-plane node. ```
Related: PR to move gcp-routes.service into MCO's privileged gcp-routes-controller container: https://github.com/openshift/machine-config-operator/pull/1489
*** Bug 1782536 has been marked as a duplicate of this bug. ***
As a side note GCP routes are really not system configuration nor upgrade (MCO) nor operating system. I understand that this functionality currently resides within MCO but the reality is it should live where cloud configuration occurs OR possibly where special cloud workarounds exist (EG: agent -- afterburn).