1769847 – kube-apiserver does not remain available during rolling master upgrade in 4.3 on GCP

Bug 1769847 - kube-apiserver does not remain available during rolling master upgrade in 4.3 on GCP

Summary: kube-apiserver does not remain available during rolling master upgrade in 4.3...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Ben Howard
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1782536
TreeView+	depends on / blocked

Reported:	2019-11-07 15:29 UTC by Clayton Coleman
Modified:	2023-09-14 05:45 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1782536 (view as bug list)
Environment:
Last Closed:	2020-05-13 21:52:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1317	0	None	closed	Bug 1769847: cmd/gcp-routes-controller: shutdown gcp routes on signal	2020-10-14 19:29:51 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-13 21:52:29 UTC

Description Clayton Coleman 2019-11-07 15:29:51 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23991/pull-ci-openshift-origin-master-e2e-gcp-upgrade/479

GCP drops out during upgrade. It should not. We need to fix this and then backport to 4.2.

This could be that the timeouts for the LB are too short, or something related to blackholing, or shutdowns, but has to work.

Comment 1 Michal Fojtik 2019-11-22 09:06:26 UTC

I tried to put together timeline:

The first master rebooted:

23:09:55 (1) "" Reboot Node will reboot into config rendered-master-3af9469f08d2a9de67c09c6e43554356

!!NOTE: The graceful termination events are missing, which suggests this master was torn down forcefully or the API server didn't had a time to properly shutdown.

Here are messages we are missing around that time:

23:18:25 (1) "openshift-kube-apiserver" TerminationStart Received signal to terminate, becoming unready, but keeping serving
23:18:25 (1) "openshift-kube-apiserver" TerminationPreShutdownHooksFinished All pre-shutdown hooks have been finished
23:18:48 (1) "openshift-kube-controller-manager-operator" SATokenSignerControllerOK found expected kube-apiserver endpoints
23:19:35 (1) "openshift-kube-apiserver" TerminationMinimalShutdownDurationFinished The minimal shutdown duration of 1m10s finished
23:19:35 (1) "openshift-kube-apiserver" TerminationStoppedServing Server has stopped listening
23:19:35 (1) "openshift-kube-apiserver" TerminationGracefulTerminationFinished All pending requests processed

(these are for the other master reboot, where the API server was shutdown properly).

To correlate these with the API server drop out:

Nov 04 23:09:55.315 E kube-apiserver failed contacting the API: Get https://api.ci-op-ycfivbvz-db044.origin-ci-int-gce.dev.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dversion&resourceVersion=35181&timeout=5m47s&timeoutSeconds=347&watch=true: dial tcp 35.227.50.101:6443: connect: connection refused

So it dropped out exactly at the time when the first master node was rebooted (probably forcefully).

Comment 2 Michal Fojtik 2019-11-22 09:09:07 UTC

For now, I do believe the fist master was rebooted without giving API server running on it time to deregister from LB, which means the endpoint was still active but blackholed (connection refused).

Comment 4 Colin Walters 2019-11-22 13:04:07 UTC

Hmm, I wonder if it's related to the special
https://gitlab.cee.redhat.com/coreos/redhat-coreos/blob/master/overlay.d/05rhcos/usr/lib/systemd/system/gcp-routes.service
https://gitlab.cee.redhat.com/coreos/redhat-coreos/blob/master/overlay.d/05rhcos/usr/sbin/gcp-routes.sh
service we have.  Since everything is getting SIGTERM at the same time here, that service is highly likely to unset the routes at the same time as cri-o may try to de-register.

I haven't fully dug into the routing/LB stuff on GCP, just noting that in passing.

Comment 5 Clayton Coleman 2019-11-22 18:51:37 UTC

This is blocking merging tests to the product to prevent regressions.  Can we step up the priority of investigation here?  We cannot ship with this bug, and I also need to verify the other platforms.

If gcp routes gets termed, I believe it does need to do something special.

Comment 6 Micah Abbott 2019-12-03 17:22:06 UTC

There was discussion in Slack about this issue:

```
Clayton Coleman 1:33 PM
so on GCP thing :slightly_smiling_face:
the GCP-routes service - does it clear itself on TERM (remove the mapping)?

Abhinav Dahiya 2:12 PM
Yes, it cleans-up after itself when stopped.
https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/481
+
https://github.com/openshift/machine-config-operator/pull/1031
works together to prevent backholes which might wedge rolling of api-server

Clayton Coleman 4:22 PM
ok
so if it is removing itself before the apiserver gracefully shuts down, that’s the bug (edited) 
the apiserver phases are “begin shutdown, stop accepting new connections, complete existing connections”
i think it’s exiting before the “stop accepting new connections” phase
which causes new connections from the LB to be targeted to dead backends

Abhinav Dahiya 4:37 PM
hmm, so on shutdown the service should wait for apiserver to complete existing connections  before removing the forwarded_ip ?
gcp-routes-controller does the health checking and call gcp-routes shutdown.. RHCOS will have to move the health-check to the gcp-routes service we need ^^ behavior

Clayton Coleman 4:41 PM
the current logic is there are a bunch of waits to ensure the endpoint is removed from the load balancer via the /ready check before we stop accepting connections
i don’t think gcp-routes should shutdown until apiserver stops accepting new connections
coupling that is kind of ugly
(this is why we have to move to SLB for load balancing eventually)
```

@Abhinav, do we have agreement that this is what is happening?  Do we need to modify the `gcp-routes.service`?

Comment 7 Abhinav Dahiya 2019-12-03 17:57:46 UTC

> @Abhinav, do we have agreement that this is what is happening?  Do we need to modify the `gcp-routes.service`?

Yeah, I think we need to make the gcp-routes.service smarter and move the logic of gcp-routes-controller to it...

Comment 9 Ben Howard 2019-12-09 23:42:48 UTC

I think that the problem might be both that the timeouts are too tight. 

Consider the following situation:
- the Controller shutdowns down
- until the health checks start failing (e.g. host-route removal) it will continue to exist in the LB
- until the health checks start passing, the node will not service requests.

Shutting down the "gcproutes.service" systemd file will result in the node being dropped from the LB a little faster. The better fix is to remove the node from LB on shutdown or failure of the node; shutting down the route at the node level just reduces the time before GCP automatically drops the node out of the LB rotation.

Comment 10 Ben Howard 2019-12-10 00:05:00 UTC

I'm not sure how we can make the GCP route service "smarter." It has no awareness that requests are being serviced.

Comment 12 Micah Abbott 2020-02-07 20:49:05 UTC

Reproducer from @walters on the PR:

```
It should be sufficient to deploy a 4.3 (or 4.2 for that matter) cluster in GCP, then cause the MCO to reboot the control plane by deploying a dummy MC that writes /etc/darkmuggle-was-here or so. During the drain/reboot process, the API server should remain fully available.

So probably first reproduce the failure by doing API calls in a loop oc get pods in a shell or whatever, then deploy this patch and verify you don't see some API calls fail.
```

Comment 13 Michael Nguyen 2020-02-11 21:50:29 UTC

Verified on 4.3.0-0.nightly-2020-02-10-173426

Deployed dummy MC while making api calls.  api calls were successful.  I compared this to 4.2 using the same method and many api calls were dropped for extended periods of time.

while true; do oc get pods -A  --field-selector spec.nodeName=<node_name> ; sleep .25; done
or
watch -n .25 oc get pods -A  --field-selector spec.nodeName=<node_name>

$ cat file.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: test-file-master
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 0644
        path: /etc/test

$ oc create -f file.yaml

Comment 15 errata-xmlrpc 2020-05-13 21:52:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 16 Red Hat Bugzilla 2023-09-14 05:45:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.