Description of problem: In CI, ha proxy is being reloaded making the API unavailable briefly (and causing tests to fail). HA proxy log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1406897170153476096/artifacts/e2e-metal-ipi/gather-extra/artifacts/pods/openshift-kni-infra_haproxy-master-0_haproxy.log <134>Jun 21 10:20:23 haproxy[33]: Connect from ::1:34004 to ::1:9445 (main/TCP) The client send: reload <134>Jun 21 10:20:24 haproxy[33]: Connect from ::1:34048 to ::1:9445 (main/TCP) [WARNING] 171/102024 (48) : Failed to connect to the old process socket '/var/lib/haproxy/run/haproxy.sock' [NOTICE] 171/102024 (48) : haproxy version is 2.2.13-5f3eb59 [NOTICE] 171/102024 (48) : path to executable is /usr/sbin/haproxy [ALERT] 171/102024 (48) : Failed to get the sockets from the old process! <133>Jun 21 10:20:24 haproxy[48]: Proxy main started. <133>Jun 21 10:20:24 haproxy[48]: Proxy health_check_http_url started. <133>Jun 21 10:20:24 haproxy[48]: Proxy stats started. <133>Jun 21 10:20:24 haproxy[48]: Proxy masters started. Around the same time we have a test failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1406897170153476096 Jun 21 10:20:29.541: FAIL: failed to wait for definition "com.example.crd-publish-openapi-test-common-group.v5.E2e-test-crd-publish-openapi-1571-crd" to be served with the right OpenAPI schema: failed to wait for OpenAPI spec validating condition: Get "https://api.ostest.test.metalkube.org:6443/openapi/v2?timeout=32s": read tcp 192.168.111.1:45138->192.168.111.5:6443: read: connection reset by peer; lastMsg: Version-Release number of selected component (if applicable): 4.9 nightlies (maybe 4.8 as well?) How reproducible: Sometimes Steps to Reproduce: 1. Look at latest e2e-metal-ipi periodics - https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi Actual results: Random unit tests fail when API becomes briefly unavailable Expected results: Successful runs Additional info:
Verified on 4.9.0-0.nightly-2021-06-27-223612 after stopping haproxy-monitor container it was restarted with next log: 2021-06-29T08:08:54.170174554+00:00 stderr F time="2021-06-29T08:08:54Z" level=info msg="Runtimecfg rendering template" path=/etc/haproxy/haproxy.cfg 2021-06-29T08:08:54.170589994+00:00 stderr F time="2021-06-29T08:08:54Z" level=info msg="Rendered cfg file equal to previous one, no need to reload" curConfig="{6443 9445 50000 [{master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::5b 6443} {master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::6c 6443} {master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com fd2e:6f44:5dd8::90 6443}] ::}" 2021-06-29T08:08:54.177990736+00:00 stderr F time="2021-06-29T08:08:54Z" level=info msg="API is reachable through HAProxy" and haproxy container was not restarted, as expected: [core@master-0-0 ~]$ sudo sudo crictl ps | grep haproxy f7414f285a773 ffcbbee91d054bc8486274528e3c650d5214f54f5048aa06a15b9e25084f5ae8 3 minutes ago Running haproxy-monitor 2 40e8c0810cd60 29ced7a5baab3 e4f01cff8172f8ae03a390717f6a23c6a7d748402d93bd7a978fe247ae478441 24 hours ago Running haproxy 2 40e8c0810cd60
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759