Bug 1823950 - [e2e-metal-ipi] Attempts to connect to API server result in `read: connection reset by peer`
Summary: [e2e-metal-ipi] Attempts to connect to API server result in `read: connection...
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.0
Assignee: Ben Nemec
QA Contact: Victor Voronkov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-14 20:34 UTC by Stephen Benjamin
Modified: 2020-09-16 15:42 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Unnecessary API VIP moves Consequence: Client connection errors Fix: Changed API VIP healthchecks to limit the number of times it moves Result: Fewer errors caused by API VIP moves
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1724 None closed Bug 1823950: [baremetal] Switch to /readyz for haproxy healthchecking 2020-09-21 13:12:45 UTC
Github openshift machine-config-operator pull 1893 None closed "Bug 1823950: [baremetal] Don't failover api vip if loadbalanced endpoint is responding" 2020-09-21 13:12:46 UTC
Github openshift machine-config-operator pull 2075 None closed Bug 1823950: Reverse haproxy and keepalived check timings 2020-09-21 13:12:45 UTC

Description Stephen Benjamin 2020-04-14 20:34:28 UTC
Description of problem:

Apologies for the vague title - but I've started seeing this a lot in the last few days on e2e-metal-ipi for about half our jobs.  The API server becomes unavailable and tests end up failing.

Logs report:

 Apr 14 09:38:59.900: FAIL: Post https://api.ostest.test.metalkube.org:6443/apis/user.openshift.io/v1/users: read tcp [fd2e:6f44:5dd8:c956::1]:53610->[fd2e:6f44:5dd8:c956::5]:6443: read: connection reset by peer 


As well as:

leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition


Version-Release number of selected component (if applicable):
4.5/master

How reproducible:
Sometimes

Steps to Reproduce:
Unknown

Additional info:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/43 is a metal-ipi job that exhibits this behavior


Looking at the search, for the kcm error, I see a lot more occurrences in the last 2 days (500+) than in the last 14d.

https://search.svc.ci.openshift.org/?search=.*failed+to+renew+lease+kube-system%2Fkube-controller-manager%3A+timed+out+waiting+for+the+condition.*&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520

Comment 1 Stephen Benjamin 2020-04-15 00:11:40 UTC
Not sure if other platforms are seeing similar results, but e2e-metal-ipi is reliably running into this: https://prow.svc.ci.openshift.org/?type=periodic&job=*metal-ipi*

We run it only every 4 hours, but so far most have failed today due to this. The 16h one looks the same: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/46

Comment 2 Maru Newby 2020-04-15 05:26:17 UTC
(In reply to Stephen Benjamin from comment #0)
> Description of problem:
> 
> 
>  Apr 14 09:38:59.900: FAIL: Post
> https://api.ostest.test.metalkube.org:6443/apis/user.openshift.io/v1/users:
> read tcp [fd2e:6f44:5dd8:c956::1]:53610->[fd2e:6f44:5dd8:c956::5]:6443:
> read: connection reset by peer 

A search for this error reveals only 4 occurrences in the past 14 days, all on the e2e-metal-e2e job:

https://search.svc.ci.openshift.org/?search=.*users%3A+read+tcp.*read%3A+connection+reset+by+peer.*&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520

As per the error, is ipv6 a factor?

> Looking at the search, for the kcm error, I see a lot more occurrences in
> the last 2 days (500+) than in the last 14d.
> 
> https://search.svc.ci.openshift.org/?search=.*failed+to+renew+lease+kube-
> system%2Fkube-controller-manager%3A+timed+out+waiting+for+the+condition.
> *&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520

This message is commonly emitted by HA controllers on shutdown, and is unlikely to indicate a problem.

Comment 3 Maru Newby 2020-04-15 05:31:03 UTC
Given that the reported failure is for 4.5 e2e jobs and doesn't appear to be affecting any 4.4 jobs, moving target to 4.5.

Comment 4 Stephen Benjamin 2020-04-16 12:50:31 UTC
Thanks, it could be IPv6.  I do see more than just metal in that search query (1 aws-upgrade, and a few openstack jobs), but it is definitely mostly metal IPI. We've seen this again yesterday on https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/56.


Any idea what we can do to troubleshoot this?

Comment 5 Maru Newby 2020-04-17 03:44:37 UTC
The first step I would suggest is to check operator status in the must-gather dump (cluster-scoped-resources/operator.openshift.io/*/cluster.yaml): 

https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/56/artifacts/e2e-metal-ipi/test/must-gather.tar

From what I can tell none of the major operators (etc, apiserver, etc) appear to be unhealthy. In the absence of an indication that the cluster itself is unhealthy, I think it may be possible that there is a connectivity problem between where the tests are running and the target cluster. I'm afraid I don't have any insight into how to go about diagnosing this possibility for a metal cluster.

Comment 6 Maru Newby 2020-04-17 04:22:32 UTC
Despite the must gather indication of cluster health, a clever person pointed out the job's e2e.log shows evidence of the unhealthy states that are causing test failure. Ideally it would be possible to see timestamped changes to cluster operator status that could be correlated with the logged errors but that doesn't exist (yet). 

Can you bring up a metal ipi cluster in a manner similar to how the ci job is doing, and see if there is churn in the cluster operator status (i.e. `watch oc get co`) before or during execution of e2e?

Comment 7 Maru Newby 2020-04-17 04:30:23 UTC
Sorry for the churn, was contemplating switching the component to networking until I noticed the indication of apiserver/controller manager health problems in the log. Moving back to apiserver.

Comment 8 Stephen Benjamin 2020-05-11 18:28:36 UTC
After further discussion it sounds like this might be due to HAproxy configuration looking at the wrong health endpoint.

From the PR:

"[T]he /readyz endpoint is how the api communicates that it
is gracefully shutting down. Once /readyz starts to report failure,
we want to stop sending traffic to that backend. If we wait for
/healthz, it may be too late."

Comment 9 Steve Milner 2020-05-20 17:50:37 UTC
https://github.com/openshift/machine-config-operator/pull/1724 which points to this bug has merged. Should this be MODIFIED or is this bug still being worked on?

Comment 10 Stephen Benjamin 2020-05-20 18:04:35 UTC
Bugzilla robot seemed to not do the right thing after merge, but this bug still needs to be worked on anyway, this didn't fix our problem.

Comment 11 Ben Nemec 2020-05-21 14:09:40 UTC
Moving back to assigned as we don't have a definite fix for this yet.

Comment 12 Karim Boumedhel 2020-05-31 15:54:41 UTC
in our deployments, the same tooling works fine for 4.4 and not 4.5
after switching to using nvme disks for the masters, deployment succeeded, so the issue seems to be disk related

Comment 18 Ben Nemec 2020-09-11 14:43:07 UTC
Reopening as the initial fix was incomplete.

Comment 22 Victor Voronkov 2020-09-16 15:42:52 UTC
Verified on build 4.6.0-0.ci-2020-09-16-082844

/etc/haproxy/haproxy.cfg and /etc/keepalived/keepalived.conf params are as in PR2075


[core@master-0-0 ~]$ sudo crictl ps | grep kube-apiserver
74644d04ac969       983002e4672e4f703c1d7da7efb241198cf9d52e2b4f8b98dec73fcb5c2bff2e                                                                  About an hour ago   Running             kube-apiserver                                0                   4f2059c4c4750
[core@master-0-0 ~]$ sudo crictl stop 74644d04ac969
74644d04ac969
[core@master-0-0 ~]$ ip a | grep 192.168
    inet 192.168.123.116/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex
    inet 192.168.123.5/24 scope global secondary br-ex

VIP stay on the same master even after few minutes, while in HAPROXY log:

2020-09-16T15:28:40.173394186+00:00 stdout F <134>Sep 16 15:28:40 haproxy[15]: Connect from ::ffff:10.129.0.60:54584 to ::ffff:10.129.0.2:9445 (main/TCP)
2020-09-16T15:28:40.183922551+00:00 stderr F [WARNING] 259/152840 (15) : Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN.
2020-09-16T15:28:40.184006537+00:00 stdout F <133>Sep 16 15:28:40 haproxy[15]: Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN.
...
2020-09-16T15:30:58.520642503+00:00 stderr F [WARNING] 259/153058 (15) : Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN.
2020-09-16T15:30:58.520847019+00:00 stdout F <133>Sep 16 15:30:58 haproxy[15]: Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN.
...
2020-09-16T15:31:03.684524295+00:00 stderr F [WARNING] 259/153103 (15) : Health check for server masters/master-0-0 failed, reason: Layer7 wrong status, code: 403, info: "Forbidden", check duration: 160ms, status: 0/3 DOWN.
2020-09-16T15:31:03.684571145+00:00 stdout F <133>Sep 16 15:31:03 haproxy[15]: Health check for server masters/master-0-0 failed, reason: Layer7 wrong status, code: 403, info: "Forbidden", check duration: 160ms, status: 0/3 DOWN.
...
2020-09-16T15:31:05.709370392+00:00 stderr F [WARNING] 259/153105 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 8ms, status: 1/3 DOWN.
2020-09-16T15:31:05.709466318+00:00 stdout F <133>Sep 16 15:31:05 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 8ms, status: 1/3 DOWN.
...
2020-09-16T15:31:06.717679657+00:00 stderr F [WARNING] 259/153106 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN.
2020-09-16T15:31:06.717752699+00:00 stdout F <133>Sep 16 15:31:06 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN.
...
2020-09-16T15:31:06.717679657+00:00 stderr F [WARNING] 259/153106 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN.
2020-09-16T15:31:06.717752699+00:00 stdout F <133>Sep 16 15:31:06 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN.
...
2020-09-16T15:31:07.726572143+00:00 stderr F [WARNING] 259/153107 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/2 UP.
2020-09-16T15:31:07.726878966+00:00 stderr F [WARNING] 259/153107 (15) : Server masters/master-0-0 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
2020-09-16T15:31:07.726990626+00:00 stdout F <133>Sep 16 15:31:07 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/2 UP


Note You need to log in before you can comment on or make changes to this bug.