Bug 1928297 - HAProxy fails with 500 on some requests
Summary: HAProxy fails with 500 on some requests
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Candace Holman
QA Contact: Arvind iyengar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-12 22:10 UTC by Clayton Coleman
Modified: 2022-08-04 22:32 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:44:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25895 0 None open Bug 1928297: Wait until router pod is running before checking health 2021-02-15 16:01:46 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:45:11 UTC

Description Clayton Coleman 2021-02-12 22:10:01 UTC
Sometime in last few weeks I started seeing instances of this test

"The HAProxy router should serve the correct routes when scoped to a single namespace and label set"

fail with a 500 (which is new). In general for a 500 to happen HAProxy has to crash or the backend process has to crash, which is something I can't recall seeing in the last few releases, so my immediate concern is that we have regressed via some new failure mode.


    ++ curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: 10.131.1.174' http://10.131.1.174:1936/healthz
    + code=500
    + [[ 0 -eq 0 ]]
    + echo 500
    + [[ 500 -eq 200 ]]
    + [[ 500 -ne 503 ]]
    + exit 1
    command terminated with exit code 1
    
    error:
    exit status 1
    500

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1359896257195151360

This seems to ONLY be happening on 4.7 and above, in about 0.75% of failures as per:

https://search.ci.openshift.org/?search=%5C%2B+code%3D500&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=1&maxBytes=20971520&groupBy=job

(the 4.3 reference was to a different problem).

I think we have a new crash somewhere in haproxy or the backend, and therefore it's likely we regressed HAProxy during 4.7 in a way that materializes as a 500.  Needs investigation, blocker to start.

Comment 1 Jessica Forrester 2021-02-12 22:12:25 UTC
marking blocker+ at @ccoleman's request

Comment 2 Clayton Coleman 2021-02-12 22:14:06 UTC
If this is a flaky test, or hasn't gotten worse from 4.6 (harder to prove), then this can get blocker removed.  If this is a flaky test we need to identify why.

Comment 3 Candace Holman 2021-02-13 00:06:54 UTC
It is possible to see a 500 response if the healthz check fails.  After chatting with @mmasters I'm going to investigate a fix to a possible race condition in the test.  The test should either wait for the router pod to become ready before it checks the healthz, or else tolerate some 500 responses while the router synchs.

Comment 4 Candace Holman 2021-02-13 02:06:14 UTC
Changing targeted release to 4.8, can backport to 4.7.

Comment 5 Andrew McDermott 2021-02-16 17:53:16 UTC
Clearing blocker as we believe this is a race in the test. Will land the PR and verify with ongoing CI testing that it is just a test issue.

Previous conversation for same 500 status code: https://coreos.slack.com/archives/CCH60A77E/p1585918525311400 (4.5).

Comment 7 Arvind iyengar 2021-03-29 14:57:35 UTC
There is no more failures for the "openshift-tests.[sig-network][Feature: Router] The HAProxy router should serve the correct routes when scoped to a single namespace and label set" testcase for v4.8 release:
------
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-fips
------

Hence marking as 'verified'

Comment 10 errata-xmlrpc 2021-07-27 22:44:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.