Bug 1769847
Summary: | kube-apiserver does not remain available during rolling master upgrade in 4.3 on GCP | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
Component: | RHCOS | Assignee: | Ben Howard <behoward> | |
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.3.0 | CC: | adahiya, aos-bugs, bbreard, dustymabe, imcleod, jligon, mfojtik, miabbott, nstielau, scuppett, sttts, walters | |
Target Milestone: | --- | |||
Target Release: | 4.4.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1782536 (view as bug list) | Environment: | ||
Last Closed: | 2020-05-13 21:52:24 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1782536 |
Description
Clayton Coleman
2019-11-07 15:29:51 UTC
I tried to put together timeline: The first master rebooted: 23:09:55 (1) "" Reboot Node will reboot into config rendered-master-3af9469f08d2a9de67c09c6e43554356 !!NOTE: The graceful termination events are missing, which suggests this master was torn down forcefully or the API server didn't had a time to properly shutdown. Here are messages we are missing around that time: 23:18:25 (1) "openshift-kube-apiserver" TerminationStart Received signal to terminate, becoming unready, but keeping serving 23:18:25 (1) "openshift-kube-apiserver" TerminationPreShutdownHooksFinished All pre-shutdown hooks have been finished 23:18:48 (1) "openshift-kube-controller-manager-operator" SATokenSignerControllerOK found expected kube-apiserver endpoints 23:19:35 (1) "openshift-kube-apiserver" TerminationMinimalShutdownDurationFinished The minimal shutdown duration of 1m10s finished 23:19:35 (1) "openshift-kube-apiserver" TerminationStoppedServing Server has stopped listening 23:19:35 (1) "openshift-kube-apiserver" TerminationGracefulTerminationFinished All pending requests processed (these are for the other master reboot, where the API server was shutdown properly). To correlate these with the API server drop out: Nov 04 23:09:55.315 E kube-apiserver failed contacting the API: Get https://api.ci-op-ycfivbvz-db044.origin-ci-int-gce.dev.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dversion&resourceVersion=35181&timeout=5m47s&timeoutSeconds=347&watch=true: dial tcp 35.227.50.101:6443: connect: connection refused So it dropped out exactly at the time when the first master node was rebooted (probably forcefully). For now, I do believe the fist master was rebooted without giving API server running on it time to deregister from LB, which means the endpoint was still active but blackholed (connection refused). Hmm, I wonder if it's related to the special https://gitlab.cee.redhat.com/coreos/redhat-coreos/blob/master/overlay.d/05rhcos/usr/lib/systemd/system/gcp-routes.service https://gitlab.cee.redhat.com/coreos/redhat-coreos/blob/master/overlay.d/05rhcos/usr/sbin/gcp-routes.sh service we have. Since everything is getting SIGTERM at the same time here, that service is highly likely to unset the routes at the same time as cri-o may try to de-register. I haven't fully dug into the routing/LB stuff on GCP, just noting that in passing. This is blocking merging tests to the product to prevent regressions. Can we step up the priority of investigation here? We cannot ship with this bug, and I also need to verify the other platforms. If gcp routes gets termed, I believe it does need to do something special. There was discussion in Slack about this issue: ``` Clayton Coleman 1:33 PM so on GCP thing :slightly_smiling_face: the GCP-routes service - does it clear itself on TERM (remove the mapping)? Abhinav Dahiya 2:12 PM Yes, it cleans-up after itself when stopped. https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/481 + https://github.com/openshift/machine-config-operator/pull/1031 works together to prevent backholes which might wedge rolling of api-server Clayton Coleman 4:22 PM ok so if it is removing itself before the apiserver gracefully shuts down, that’s the bug (edited) the apiserver phases are “begin shutdown, stop accepting new connections, complete existing connections” i think it’s exiting before the “stop accepting new connections” phase which causes new connections from the LB to be targeted to dead backends Abhinav Dahiya 4:37 PM hmm, so on shutdown the service should wait for apiserver to complete existing connections before removing the forwarded_ip ? gcp-routes-controller does the health checking and call gcp-routes shutdown.. RHCOS will have to move the health-check to the gcp-routes service we need ^^ behavior Clayton Coleman 4:41 PM the current logic is there are a bunch of waits to ensure the endpoint is removed from the load balancer via the /ready check before we stop accepting connections i don’t think gcp-routes should shutdown until apiserver stops accepting new connections coupling that is kind of ugly (this is why we have to move to SLB for load balancing eventually) ``` @Abhinav, do we have agreement that this is what is happening? Do we need to modify the `gcp-routes.service`? > @Abhinav, do we have agreement that this is what is happening? Do we need to modify the `gcp-routes.service`?
Yeah, I think we need to make the gcp-routes.service smarter and move the logic of gcp-routes-controller to it...
I think that the problem might be both that the timeouts are too tight. Consider the following situation: - the Controller shutdowns down - until the health checks start failing (e.g. host-route removal) it will continue to exist in the LB - until the health checks start passing, the node will not service requests. Shutting down the "gcproutes.service" systemd file will result in the node being dropped from the LB a little faster. The better fix is to remove the node from LB on shutdown or failure of the node; shutting down the route at the node level just reduces the time before GCP automatically drops the node out of the LB rotation. I'm not sure how we can make the GCP route service "smarter." It has no awareness that requests are being serviced. Reproducer from @walters on the PR: ``` It should be sufficient to deploy a 4.3 (or 4.2 for that matter) cluster in GCP, then cause the MCO to reboot the control plane by deploying a dummy MC that writes /etc/darkmuggle-was-here or so. During the drain/reboot process, the API server should remain fully available. So probably first reproduce the failure by doing API calls in a loop oc get pods in a shell or whatever, then deploy this patch and verify you don't see some API calls fail. ``` Verified on 4.3.0-0.nightly-2020-02-10-173426 Deployed dummy MC while making api calls. api calls were successful. I compared this to 4.2 using the same method and many api calls were dropped for extended periods of time. while true; do oc get pods -A --field-selector spec.nodeName=<node_name> ; sleep .25; done or watch -n .25 oc get pods -A --field-selector spec.nodeName=<node_name> $ cat file.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: test-file-master spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK filesystem: root mode: 0644 path: /etc/test $ oc create -f file.yaml Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |