Bug 1823950

Summary:	[e2e-metal-ipi] Attempts to connect to API server result in `read: connection reset by peer`
Product:	OpenShift Container Platform	Reporter:	Stephen Benjamin <stbenjam>
Component:	Machine Config Operator	Assignee:	Ben Nemec <bnemec>
Status:	CLOSED ERRATA	QA Contact:	Victor Voronkov <vvoronko>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.5	CC:	aos-bugs, bnemec, kboumedh, kgarriso, mfojtik, mkrejci, mnewby, smilner, vvoronko, wking
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Unnecessary API VIP moves Consequence: Client connection errors Fix: Changed API VIP healthchecks to limit the number of times it moves Result: Fewer errors caused by API VIP moves	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 15:57:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2020-04-14 20:34:28 UTC

Description of problem:

Apologies for the vague title - but I've started seeing this a lot in the last few days on e2e-metal-ipi for about half our jobs.  The API server becomes unavailable and tests end up failing.

Logs report:

 Apr 14 09:38:59.900: FAIL: Post https://api.ostest.test.metalkube.org:6443/apis/user.openshift.io/v1/users: read tcp [fd2e:6f44:5dd8:c956::1]:53610->[fd2e:6f44:5dd8:c956::5]:6443: read: connection reset by peer 


As well as:

leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition


Version-Release number of selected component (if applicable):
4.5/master

How reproducible:
Sometimes

Steps to Reproduce:
Unknown

Additional info:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/43 is a metal-ipi job that exhibits this behavior


Looking at the search, for the kcm error, I see a lot more occurrences in the last 2 days (500+) than in the last 14d.

https://search.svc.ci.openshift.org/?search=.*failed+to+renew+lease+kube-system%2Fkube-controller-manager%3A+timed+out+waiting+for+the+condition.*&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520

Comment 1 Stephen Benjamin 2020-04-15 00:11:40 UTC

Not sure if other platforms are seeing similar results, but e2e-metal-ipi is reliably running into this: https://prow.svc.ci.openshift.org/?type=periodic&job=*metal-ipi*

We run it only every 4 hours, but so far most have failed today due to this. The 16h one looks the same: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/46

Comment 2 Maru Newby 2020-04-15 05:26:17 UTC

(In reply to Stephen Benjamin from comment #0)
> Description of problem:
> 
> 
>  Apr 14 09:38:59.900: FAIL: Post
> https://api.ostest.test.metalkube.org:6443/apis/user.openshift.io/v1/users:
> read tcp [fd2e:6f44:5dd8:c956::1]:53610->[fd2e:6f44:5dd8:c956::5]:6443:
> read: connection reset by peer 

A search for this error reveals only 4 occurrences in the past 14 days, all on the e2e-metal-e2e job:

https://search.svc.ci.openshift.org/?search=.*users%3A+read+tcp.*read%3A+connection+reset+by+peer.*&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520

As per the error, is ipv6 a factor?

> Looking at the search, for the kcm error, I see a lot more occurrences in
> the last 2 days (500+) than in the last 14d.
> 
> https://search.svc.ci.openshift.org/?search=.*failed+to+renew+lease+kube-
> system%2Fkube-controller-manager%3A+timed+out+waiting+for+the+condition.
> *&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520

This message is commonly emitted by HA controllers on shutdown, and is unlikely to indicate a problem.

Comment 3 Maru Newby 2020-04-15 05:31:03 UTC

Given that the reported failure is for 4.5 e2e jobs and doesn't appear to be affecting any 4.4 jobs, moving target to 4.5.

Comment 4 Stephen Benjamin 2020-04-16 12:50:31 UTC

Thanks, it could be IPv6.  I do see more than just metal in that search query (1 aws-upgrade, and a few openstack jobs), but it is definitely mostly metal IPI. We've seen this again yesterday on https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/56.


Any idea what we can do to troubleshoot this?

Comment 5 Maru Newby 2020-04-17 03:44:37 UTC

The first step I would suggest is to check operator status in the must-gather dump (cluster-scoped-resources/operator.openshift.io/*/cluster.yaml): 

https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.5-e2e-metal-ipi/56/artifacts/e2e-metal-ipi/test/must-gather.tar

From what I can tell none of the major operators (etc, apiserver, etc) appear to be unhealthy. In the absence of an indication that the cluster itself is unhealthy, I think it may be possible that there is a connectivity problem between where the tests are running and the target cluster. I'm afraid I don't have any insight into how to go about diagnosing this possibility for a metal cluster.

Comment 6 Maru Newby 2020-04-17 04:22:32 UTC

Despite the must gather indication of cluster health, a clever person pointed out the job's e2e.log shows evidence of the unhealthy states that are causing test failure. Ideally it would be possible to see timestamped changes to cluster operator status that could be correlated with the logged errors but that doesn't exist (yet). 

Can you bring up a metal ipi cluster in a manner similar to how the ci job is doing, and see if there is churn in the cluster operator status (i.e. `watch oc get co`) before or during execution of e2e?

Comment 7 Maru Newby 2020-04-17 04:30:23 UTC

Sorry for the churn, was contemplating switching the component to networking until I noticed the indication of apiserver/controller manager health problems in the log. Moving back to apiserver.

Comment 8 Stephen Benjamin 2020-05-11 18:28:36 UTC

After further discussion it sounds like this might be due to HAproxy configuration looking at the wrong health endpoint.

From the PR:

"[T]he /readyz endpoint is how the api communicates that it
is gracefully shutting down. Once /readyz starts to report failure,
we want to stop sending traffic to that backend. If we wait for
/healthz, it may be too late."

Comment 9 Steve Milner 2020-05-20 17:50:37 UTC

https://github.com/openshift/machine-config-operator/pull/1724 which points to this bug has merged. Should this be MODIFIED or is this bug still being worked on?

Comment 10 Stephen Benjamin 2020-05-20 18:04:35 UTC

Bugzilla robot seemed to not do the right thing after merge, but this bug still needs to be worked on anyway, this didn't fix our problem.

Comment 11 Ben Nemec 2020-05-21 14:09:40 UTC

Moving back to assigned as we don't have a definite fix for this yet.

Comment 12 Karim Boumedhel 2020-05-31 15:54:41 UTC

in our deployments, the same tooling works fine for 4.4 and not 4.5
after switching to using nvme disks for the masters, deployment succeeded, so the issue seems to be disk related

Comment 18 Ben Nemec 2020-09-11 14:43:07 UTC

Reopening as the initial fix was incomplete.

Comment 22 Victor Voronkov 2020-09-16 15:42:52 UTC

Verified on build 4.6.0-0.ci-2020-09-16-082844

/etc/haproxy/haproxy.cfg and /etc/keepalived/keepalived.conf params are as in PR2075


[core@master-0-0 ~]$ sudo crictl ps | grep kube-apiserver
74644d04ac969       983002e4672e4f703c1d7da7efb241198cf9d52e2b4f8b98dec73fcb5c2bff2e                                                                  About an hour ago   Running             kube-apiserver                                0                   4f2059c4c4750
[core@master-0-0 ~]$ sudo crictl stop 74644d04ac969
74644d04ac969
[core@master-0-0 ~]$ ip a | grep 192.168
    inet 192.168.123.116/24 brd 192.168.123.255 scope global dynamic noprefixroute br-ex
    inet 192.168.123.5/24 scope global secondary br-ex

VIP stay on the same master even after few minutes, while in HAPROXY log:

2020-09-16T15:28:40.173394186+00:00 stdout F <134>Sep 16 15:28:40 haproxy[15]: Connect from ::ffff:10.129.0.60:54584 to ::ffff:10.129.0.2:9445 (main/TCP)
2020-09-16T15:28:40.183922551+00:00 stderr F [WARNING] 259/152840 (15) : Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN.
2020-09-16T15:28:40.184006537+00:00 stdout F <133>Sep 16 15:28:40 haproxy[15]: Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN.
...
2020-09-16T15:30:58.520642503+00:00 stderr F [WARNING] 259/153058 (15) : Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN.
2020-09-16T15:30:58.520847019+00:00 stdout F <133>Sep 16 15:30:58 haproxy[15]: Health check for server masters/master-0-0 failed, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms, status: 0/3 DOWN.
...
2020-09-16T15:31:03.684524295+00:00 stderr F [WARNING] 259/153103 (15) : Health check for server masters/master-0-0 failed, reason: Layer7 wrong status, code: 403, info: "Forbidden", check duration: 160ms, status: 0/3 DOWN.
2020-09-16T15:31:03.684571145+00:00 stdout F <133>Sep 16 15:31:03 haproxy[15]: Health check for server masters/master-0-0 failed, reason: Layer7 wrong status, code: 403, info: "Forbidden", check duration: 160ms, status: 0/3 DOWN.
...
2020-09-16T15:31:05.709370392+00:00 stderr F [WARNING] 259/153105 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 8ms, status: 1/3 DOWN.
2020-09-16T15:31:05.709466318+00:00 stdout F <133>Sep 16 15:31:05 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 8ms, status: 1/3 DOWN.
...
2020-09-16T15:31:06.717679657+00:00 stderr F [WARNING] 259/153106 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN.
2020-09-16T15:31:06.717752699+00:00 stdout F <133>Sep 16 15:31:06 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN.
...
2020-09-16T15:31:06.717679657+00:00 stderr F [WARNING] 259/153106 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN.
2020-09-16T15:31:06.717752699+00:00 stdout F <133>Sep 16 15:31:06 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/3 DOWN.
...
2020-09-16T15:31:07.726572143+00:00 stderr F [WARNING] 259/153107 (15) : Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/2 UP.
2020-09-16T15:31:07.726878966+00:00 stderr F [WARNING] 259/153107 (15) : Server masters/master-0-0 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
2020-09-16T15:31:07.726990626+00:00 stdout F <133>Sep 16 15:31:07 haproxy[15]: Health check for server masters/master-0-0 succeeded, reason: Layer7 check passed, code: 200, info: "OK", check duration: 7ms, status: 2/2 UP

Comment 24 errata-xmlrpc 2020-10-27 15:57:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196