1928297 – HAProxy fails with 500 on some requests

Bug 1928297 - HAProxy fails with 500 on some requests

Summary: HAProxy fails with 500 on some requests

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Candace Holman
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-12 22:10 UTC by Clayton Coleman
Modified:	2022-08-04 22:32 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:44:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25895	0	None	open	Bug 1928297: Wait until router pod is running before checking health	2021-02-15 16:01:46 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:45:11 UTC

Description Clayton Coleman 2021-02-12 22:10:01 UTC

Sometime in last few weeks I started seeing instances of this test

"The HAProxy router should serve the correct routes when scoped to a single namespace and label set"

fail with a 500 (which is new). In general for a 500 to happen HAProxy has to crash or the backend process has to crash, which is something I can't recall seeing in the last few releases, so my immediate concern is that we have regressed via some new failure mode.


    ++ curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: 10.131.1.174' http://10.131.1.174:1936/healthz
    + code=500
    + [[ 0 -eq 0 ]]
    + echo 500
    + [[ 500 -eq 200 ]]
    + [[ 500 -ne 503 ]]
    + exit 1
    command terminated with exit code 1
    
    error:
    exit status 1
    500

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1359896257195151360

This seems to ONLY be happening on 4.7 and above, in about 0.75% of failures as per:

https://search.ci.openshift.org/?search=%5C%2B+code%3D500&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=1&maxBytes=20971520&groupBy=job

(the 4.3 reference was to a different problem).

I think we have a new crash somewhere in haproxy or the backend, and therefore it's likely we regressed HAProxy during 4.7 in a way that materializes as a 500.  Needs investigation, blocker to start.

Comment 1 Jessica Forrester 2021-02-12 22:12:25 UTC

marking blocker+ at @ccoleman's request

Comment 2 Clayton Coleman 2021-02-12 22:14:06 UTC

If this is a flaky test, or hasn't gotten worse from 4.6 (harder to prove), then this can get blocker removed.  If this is a flaky test we need to identify why.

Comment 3 Candace Holman 2021-02-13 00:06:54 UTC

It is possible to see a 500 response if the healthz check fails.  After chatting with @mmasters I'm going to investigate a fix to a possible race condition in the test.  The test should either wait for the router pod to become ready before it checks the healthz, or else tolerate some 500 responses while the router synchs.

Comment 4 Candace Holman 2021-02-13 02:06:14 UTC

Changing targeted release to 4.8, can backport to 4.7.

Comment 5 Andrew McDermott 2021-02-16 17:53:16 UTC

Clearing blocker as we believe this is a race in the test. Will land the PR and verify with ongoing CI testing that it is just a test issue.

Previous conversation for same 500 status code: https://coreos.slack.com/archives/CCH60A77E/p1585918525311400 (4.5).

Comment 7 Arvind iyengar 2021-03-29 14:57:35 UTC

There is no more failures for the "openshift-tests.[sig-network][Feature: Router] The HAProxy router should serve the correct routes when scoped to a single namespace and label set" testcase for v4.8 release:
------
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-fips
------

Hence marking as 'verified'

Comment 10 errata-xmlrpc 2021-07-27 22:44:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.