Sometime in last few weeks I started seeing instances of this test "The HAProxy router should serve the correct routes when scoped to a single namespace and label set" fail with a 500 (which is new). In general for a 500 to happen HAProxy has to crash or the backend process has to crash, which is something I can't recall seeing in the last few releases, so my immediate concern is that we have regressed via some new failure mode. ++ curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: 10.131.1.174' http://10.131.1.174:1936/healthz + code=500 + [[ 0 -eq 0 ]] + echo 500 + [[ 500 -eq 200 ]] + [[ 500 -ne 503 ]] + exit 1 command terminated with exit code 1 error: exit status 1 500 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1359896257195151360 This seems to ONLY be happening on 4.7 and above, in about 0.75% of failures as per: https://search.ci.openshift.org/?search=%5C%2B+code%3D500&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=1&maxBytes=20971520&groupBy=job (the 4.3 reference was to a different problem). I think we have a new crash somewhere in haproxy or the backend, and therefore it's likely we regressed HAProxy during 4.7 in a way that materializes as a 500. Needs investigation, blocker to start.
marking blocker+ at @ccoleman's request
If this is a flaky test, or hasn't gotten worse from 4.6 (harder to prove), then this can get blocker removed. If this is a flaky test we need to identify why.
It is possible to see a 500 response if the healthz check fails. After chatting with @mmasters I'm going to investigate a fix to a possible race condition in the test. The test should either wait for the router pod to become ready before it checks the healthz, or else tolerate some 500 responses while the router synchs.
Changing targeted release to 4.8, can backport to 4.7.
Clearing blocker as we believe this is a race in the test. Will land the PR and verify with ongoing CI testing that it is just a test issue. Previous conversation for same 500 status code: https://coreos.slack.com/archives/CCH60A77E/p1585918525311400 (4.5).
There is no more failures for the "openshift-tests.[sig-network][Feature: Router] The HAProxy router should serve the correct routes when scoped to a single namespace and label set" testcase for v4.8 release: ------ https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-fips ------ Hence marking as 'verified'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438