1974350 – LB endpoint for API becomes unavailable briefly during openshift test suite

Bug 1974350 - LB endpoint for API becomes unavailable briefly during openshift test suite

Summary: LB endpoint for API becomes unavailable briefly during openshift test suite

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Yossi Boaron
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-21 12:56 UTC by Stephen Benjamin
Modified:	2021-10-18 17:36 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-compact=all
Last Closed:	2021-10-18 17:35:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift baremetal-runtimecfg pull 145	0	None	open	Bug 1974350: HAProxy-monitor: send reload only if cfg file changed	2021-06-22 13:17:55 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:36:09 UTC

Description Stephen Benjamin 2021-06-21 12:56:30 UTC

Description of problem:

In CI, ha proxy is being reloaded making the API unavailable briefly (and causing tests to fail). HA proxy log:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1406897170153476096/artifacts/e2e-metal-ipi/gather-extra/artifacts/pods/openshift-kni-infra_haproxy-master-0_haproxy.log

<134>Jun 21 10:20:23 haproxy[33]: Connect from ::1:34004 to ::1:9445 (main/TCP)
The client send: reload
<134>Jun 21 10:20:24 haproxy[33]: Connect from ::1:34048 to ::1:9445 (main/TCP)
[WARNING] 171/102024 (48) : Failed to connect to the old process socket '/var/lib/haproxy/run/haproxy.sock'
[NOTICE] 171/102024 (48) : haproxy version is 2.2.13-5f3eb59
[NOTICE] 171/102024 (48) : path to executable is /usr/sbin/haproxy
[ALERT] 171/102024 (48) : Failed to get the sockets from the old process!
<133>Jun 21 10:20:24 haproxy[48]: Proxy main started.
<133>Jun 21 10:20:24 haproxy[48]: Proxy health_check_http_url started.
<133>Jun 21 10:20:24 haproxy[48]: Proxy stats started.
<133>Jun 21 10:20:24 haproxy[48]: Proxy masters started.

Around the same time we have a test failure:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1406897170153476096

Jun 21 10:20:29.541: FAIL: failed to wait for definition "com.example.crd-publish-openapi-test-common-group.v5.E2e-test-crd-publish-openapi-1571-crd" to be served with the right OpenAPI schema: failed to wait for OpenAPI spec validating condition: Get "https://api.ostest.test.metalkube.org:6443/openapi/v2?timeout=32s": read tcp 192.168.111.1:45138->192.168.111.5:6443: read: connection reset by peer; lastMsg: 



Version-Release number of selected component (if applicable):

4.9 nightlies (maybe 4.8 as well?)

How reproducible:

Sometimes


Steps to Reproduce:
1. Look at latest e2e-metal-ipi periodics - https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi

Actual results:

Random unit tests fail when API becomes briefly unavailable

Expected results:

Successful runs

Additional info:

Comment 2 Victor Voronkov 2021-06-29 08:14:49 UTC

Verified on 4.9.0-0.nightly-2021-06-27-223612

after stopping haproxy-monitor container it was restarted with next log:

2021-06-29T08:08:54.170174554+00:00 stderr F time="2021-06-29T08:08:54Z" level=info msg="Runtimecfg rendering template" path=/etc/haproxy/haproxy.cfg
2021-06-29T08:08:54.170589994+00:00 stderr F time="2021-06-29T08:08:54Z" level=info msg="Rendered cfg file equal to previous one, no need to reload" curConfig="{6443 9445 50000 [{master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::5b 6443} {master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::6c 6443} {master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::90 6443}] ::}"
2021-06-29T08:08:54.177990736+00:00 stderr F time="2021-06-29T08:08:54Z" level=info msg="API is reachable through HAProxy"

and haproxy container was not restarted, as expected:

[core@master-0-0 ~]$ sudo sudo crictl ps | grep haproxy
f7414f285a773       ffcbbee91d054bc8486274528e3c650d5214f54f5048aa06a15b9e25084f5ae8                                                         3 minutes ago       Running             haproxy-monitor                               2                   40e8c0810cd60
29ced7a5baab3       e4f01cff8172f8ae03a390717f6a23c6a7d748402d93bd7a978fe247ae478441                                                         24 hours ago        Running             haproxy                                       2                   40e8c0810cd60

Comment 5 errata-xmlrpc 2021-10-18 17:35:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.