GCP drops out during upgrade. It should not. We need to fix this and then backport to 4.2.
This could be that the timeouts for the LB are too short, or something related to blackholing, or shutdowns, but has to work.
I tried to put together timeline:
The first master rebooted:
23:09:55 (1) "" Reboot Node will reboot into config rendered-master-3af9469f08d2a9de67c09c6e43554356
!!NOTE: The graceful termination events are missing, which suggests this master was torn down forcefully or the API server didn't had a time to properly shutdown.
Here are messages we are missing around that time:
23:18:25 (1) "openshift-kube-apiserver" TerminationStart Received signal to terminate, becoming unready, but keeping serving
23:18:25 (1) "openshift-kube-apiserver" TerminationPreShutdownHooksFinished All pre-shutdown hooks have been finished
23:18:48 (1) "openshift-kube-controller-manager-operator" SATokenSignerControllerOK found expected kube-apiserver endpoints
23:19:35 (1) "openshift-kube-apiserver" TerminationMinimalShutdownDurationFinished The minimal shutdown duration of 1m10s finished
23:19:35 (1) "openshift-kube-apiserver" TerminationStoppedServing Server has stopped listening
23:19:35 (1) "openshift-kube-apiserver" TerminationGracefulTerminationFinished All pending requests processed
(these are for the other master reboot, where the API server was shutdown properly).
To correlate these with the API server drop out:
Nov 04 23:09:55.315 E kube-apiserver failed contacting the API: Get https://api.ci-op-ycfivbvz-db044.origin-ci-int-gce.dev.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dversion&resourceVersion=35181&timeout=5m47s&timeoutSeconds=347&watch=true: dial tcp 22.214.171.124:6443: connect: connection refused
So it dropped out exactly at the time when the first master node was rebooted (probably forcefully).
For now, I do believe the fist master was rebooted without giving API server running on it time to deregister from LB, which means the endpoint was still active but blackholed (connection refused).
Hmm, I wonder if it's related to the special
service we have. Since everything is getting SIGTERM at the same time here, that service is highly likely to unset the routes at the same time as cri-o may try to de-register.
I haven't fully dug into the routing/LB stuff on GCP, just noting that in passing.
This is blocking merging tests to the product to prevent regressions. Can we step up the priority of investigation here? We cannot ship with this bug, and I also need to verify the other platforms.
If gcp routes gets termed, I believe it does need to do something special.
There was discussion in Slack about this issue:
Clayton Coleman 1:33 PM
so on GCP thing :slightly_smiling_face:
the GCP-routes service - does it clear itself on TERM (remove the mapping)?
Abhinav Dahiya 2:12 PM
Yes, it cleans-up after itself when stopped.
works together to prevent backholes which might wedge rolling of api-server
Clayton Coleman 4:22 PM
so if it is removing itself before the apiserver gracefully shuts down, that’s the bug (edited)
the apiserver phases are “begin shutdown, stop accepting new connections, complete existing connections”
i think it’s exiting before the “stop accepting new connections” phase
which causes new connections from the LB to be targeted to dead backends
Abhinav Dahiya 4:37 PM
hmm, so on shutdown the service should wait for apiserver to complete existing connections before removing the forwarded_ip ?
gcp-routes-controller does the health checking and call gcp-routes shutdown.. RHCOS will have to move the health-check to the gcp-routes service we need ^^ behavior
Clayton Coleman 4:41 PM
the current logic is there are a bunch of waits to ensure the endpoint is removed from the load balancer via the /ready check before we stop accepting connections
i don’t think gcp-routes should shutdown until apiserver stops accepting new connections
coupling that is kind of ugly
(this is why we have to move to SLB for load balancing eventually)
@Abhinav, do we have agreement that this is what is happening? Do we need to modify the `gcp-routes.service`?
> @Abhinav, do we have agreement that this is what is happening? Do we need to modify the `gcp-routes.service`?
Yeah, I think we need to make the gcp-routes.service smarter and move the logic of gcp-routes-controller to it...
I think that the problem might be both that the timeouts are too tight.
Consider the following situation:
- the Controller shutdowns down
- until the health checks start failing (e.g. host-route removal) it will continue to exist in the LB
- until the health checks start passing, the node will not service requests.
Shutting down the "gcproutes.service" systemd file will result in the node being dropped from the LB a little faster. The better fix is to remove the node from LB on shutdown or failure of the node; shutting down the route at the node level just reduces the time before GCP automatically drops the node out of the LB rotation.
I'm not sure how we can make the GCP route service "smarter." It has no awareness that requests are being serviced.
Reproducer from @walters on the PR:
It should be sufficient to deploy a 4.3 (or 4.2 for that matter) cluster in GCP, then cause the MCO to reboot the control plane by deploying a dummy MC that writes /etc/darkmuggle-was-here or so. During the drain/reboot process, the API server should remain fully available.
So probably first reproduce the failure by doing API calls in a loop oc get pods in a shell or whatever, then deploy this patch and verify you don't see some API calls fail.
Verified on 4.3.0-0.nightly-2020-02-10-173426
Deployed dummy MC while making api calls. api calls were successful. I compared this to 4.2 using the same method and many api calls were dropped for extended periods of time.
while true; do oc get pods -A --field-selector spec.nodeName=<node_name> ; sleep .25; done
watch -n .25 oc get pods -A --field-selector spec.nodeName=<node_name>
$ cat file.yaml
$ oc create -f file.yaml
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.