Bug 1823950
Summary: | [e2e-metal-ipi] Attempts to connect to API server result in `read: connection reset by peer` | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Stephen Benjamin <stbenjam> |
Component: | Machine Config Operator | Assignee: | Ben Nemec <bnemec> |
Status: | CLOSED ERRATA | QA Contact: | Victor Voronkov <vvoronko> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 4.5 | CC: | aos-bugs, bnemec, kboumedh, kgarriso, mfojtik, mkrejci, mnewby, smilner, vvoronko, wking |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Unnecessary API VIP moves
Consequence: Client connection errors
Fix: Changed API VIP healthchecks to limit the number of times it moves
Result: Fewer errors caused by API VIP moves
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 15:57:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Stephen Benjamin
2020-04-14 20:34:28 UTC
Not sure if other platforms are seeing similar results, but e2e-metal-ipi is reliably running into this: https://prow.svc.ci.openshift.org/?type=periodic&job=*metal-ipi* We run it only every 4 hours, but so far most have failed today due to this. The 16h one looks the same: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/46 (In reply to Stephen Benjamin from comment #0) > Description of problem: > > > Apr 14 09:38:59.900: FAIL: Post > https://api.ostest.test.metalkube.org:6443/apis/user.openshift.io/v1/users: > read tcp [fd2e:6f44:5dd8:c956::1]:53610->[fd2e:6f44:5dd8:c956::5]:6443: > read: connection reset by peer A search for this error reveals only 4 occurrences in the past 14 days, all on the e2e-metal-e2e job: https://search.svc.ci.openshift.org/?search=.*users%3A+read+tcp.*read%3A+connection+reset+by+peer.*&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520 As per the error, is ipv6 a factor? > Looking at the search, for the kcm error, I see a lot more occurrences in > the last 2 days (500+) than in the last 14d. > > https://search.svc.ci.openshift.org/?search=.*failed+to+renew+lease+kube- > system%2Fkube-controller-manager%3A+timed+out+waiting+for+the+condition. > *&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520 This message is commonly emitted by HA controllers on shutdown, and is unlikely to indicate a problem. Given that the reported failure is for 4.5 e2e jobs and doesn't appear to be affecting any 4.4 jobs, moving target to 4.5. Thanks, it could be IPv6. I do see more than just metal in that search query (1 aws-upgrade, and a few openstack jobs), but it is definitely mostly metal IPI. We've seen this again yesterday on https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/56. Any idea what we can do to troubleshoot this? The first step I would suggest is to check operator status in the must-gather dump (cluster-scoped-resources/operator.openshift.io/*/cluster.yaml): https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/56/artifacts/e2e-metal-ipi/test/must-gather.tar From what I can tell none of the major operators (etc, apiserver, etc) appear to be unhealthy. In the absence of an indication that the cluster itself is unhealthy, I think it may be possible that there is a connectivity problem between where the tests are running and the target cluster. I'm afraid I don't have any insight into how to go about diagnosing this possibility for a metal cluster. Despite the must gather indication of cluster health, a clever person pointed out the job's e2e.log shows evidence of the unhealthy states that are causing test failure. Ideally it would be possible to see timestamped changes to cluster operator status that could be correlated with the logged errors but that doesn't exist (yet). Can you bring up a metal ipi cluster in a manner similar to how the ci job is doing, and see if there is churn in the cluster operator status (i.e. `watch oc get co`) before or during execution of e2e? Sorry for the churn, was contemplating switching the component to networking until I noticed the indication of apiserver/controller manager health problems in the log. Moving back to apiserver. After further discussion it sounds like this might be due to HAproxy configuration looking at the wrong health endpoint. From the PR: "[T]he /readyz endpoint is how the api communicates that it is gracefully shutting down. Once /readyz starts to report failure, we want to stop sending traffic to that backend. If we wait for /healthz, it may be too late." https://github.com/openshift/machine-config-operator/pull/1724 which points to this bug has merged. Should this be MODIFIED or is this bug still being worked on? Bugzilla robot seemed to not do the right thing after merge, but this bug still needs to be worked on anyway, this didn't fix our problem. Moving back to assigned as we don't have a definite fix for this yet. in our deployments, the same tooling works fine for 4.4 and not 4.5 after switching to using nvme disks for the masters, deployment succeeded, so the issue seems to be disk related Reopening as the initial fix was incomplete. Verified on build 4.6.0-0.ci-2020-09-16-082844 /etc/haproxy/haproxy.cfg and /etc/keepalived/keepalived.conf params are as in PR2075 [core@master-0-0 ~]$ sudo crictl ps | grep kube-apiserver 74644d04ac969 983002e4672e4f703c1d7da7efb241198cf9d52e2b4f8b98dec73fcb5c2bff2e About an hour ago Running kube-apiserver 0 4f2059c4c4750 [core@master-0-0 ~]$ sudo crictl stop 74644d04ac969 74644d04ac969 [core@master-0-0 ~]$ ip a | grep 192.168 inet 192.168.123.116/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex inet 192.168.123.5/24 scope global secondary br-ex VIP stay on the same master even after few minutes, while in HAPROXY log: 2020-09-16T15:28:40.173394186+00:00 stdout F <134>Sep 16 15:28:40 haproxy[15]: Connect from ::ffff:10.129.0.60:54584 to ::ffff:10.129.0.2:9445 (main/TCP) 2020-09-16T15:28:40.183922551+00:00 stderr F [WARNING] 259/152840 (15) : Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN. 2020-09-16T15:28:40.184006537+00:00 stdout F <133>Sep 16 15:28:40 haproxy[15]: Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN. ... 2020-09-16T15:30:58.520642503+00:00 stderr F [WARNING] 259/153058 (15) : Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN. 2020-09-16T15:30:58.520847019+00:00 stdout F <133>Sep 16 15:30:58 haproxy[15]: Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN. ... 2020-09-16T15:31:03.684524295+00:00 stderr F [WARNING] 259/153103 (15) : Health check for server masters/master-0-0 failed, reason: Layer7 wrong status, code: 403, info: "Forbidden", check duration: 160ms, status: 0/3 DOWN. 2020-09-16T15:31:03.684571145+00:00 stdout F <133>Sep 16 15:31:03 haproxy[15]: Health check for server masters/master-0-0 failed, reason: Layer7 wrong status, code: 403, info: "Forbidden", check duration: 160ms, status: 0/3 DOWN. ... 2020-09-16T15:31:05.709370392+00:00 stderr F [WARNING] 259/153105 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 8ms, status: 1/3 DOWN. 2020-09-16T15:31:05.709466318+00:00 stdout F <133>Sep 16 15:31:05 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 8ms, status: 1/3 DOWN. ... 2020-09-16T15:31:06.717679657+00:00 stderr F [WARNING] 259/153106 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN. 2020-09-16T15:31:06.717752699+00:00 stdout F <133>Sep 16 15:31:06 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN. ... 2020-09-16T15:31:06.717679657+00:00 stderr F [WARNING] 259/153106 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN. 2020-09-16T15:31:06.717752699+00:00 stdout F <133>Sep 16 15:31:06 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN. ... 2020-09-16T15:31:07.726572143+00:00 stderr F [WARNING] 259/153107 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/2 UP. 2020-09-16T15:31:07.726878966+00:00 stderr F [WARNING] 259/153107 (15) : Server masters/master-0-0 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. 2020-09-16T15:31:07.726990626+00:00 stdout F <133>Sep 16 15:31:07 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/2 UP Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |