1987238 – A negative value applied for the "tlsInspectDelay" option caused the router pod to go into crashloop

Bug 1987238 - A negative value applied for the "tlsInspectDelay" option caused the router pod to go into crashloop

Summary: A negative value applied for the "tlsInspectDelay" option caused the router p...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Ryan Fredette
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-29 09:55 UTC by Arvind iyengar
Modified:	2022-08-04 22:32 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:43:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift router pull 322	0	None	open	Bug 1987238: Validate ROUTER_INSPECT_DELAY env value generating haproxy config	2021-07-29 19:02:42 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:43:26 UTC

Description Arvind iyengar 2021-07-29 09:55:16 UTC

Description of problem:
With the newly introduced "tlsInspectDelay" parameter applied through "TuningOptions" setting in the ingresscontroller, a negative value applied causes the router pod to go into crash loop. 

OpenShift release version:
Release version: 4.9.0-0.nightly-2021-07-27-125952

Cluster Platform:
OCP

How reproducible:
Always

Steps to Reproduce (in detail):
1.Deploy a cluster with said release version or above.
2.Deploy an ingresscontroller or modify any existing one to have the "tlsInspectDelay" set to a negative value:
----
If all the values are set to something negative, they are ignored and the proxy is configured with default timer values:
spec:
  tuningOptions:
    tlsInspectDelay: -10s
----
3.Check the router pod status or the logs of the pod. 


Actual results:
The pod fails to reloaded with below error:
-----
[NOTICE] 209/044252 (21) : haproxy version is 2.2.15-5e8f49d
[NOTICE] 209/044252 (21) : path to executable is /usr/sbin/haproxy
[ALERT] 209/044252 (21) : parsing [/var/lib/haproxy/conf/haproxy.config:67] : 'tcp-request inspect-delay' expects a positive delay in milliseconds, in frontend 'public' (unexpected character '-')
[ALERT] 209/044252 (21) : parsing [/var/lib/haproxy/conf/haproxy.config:93] : 'tcp-request inspect-delay' expects a positive delay in milliseconds, in frontend 'public_ssl' (unexpected character '-')
[ALERT] 209/044252 (21) : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config
[ALERT] 209/044252 (21) : Fatal errors found in configuration.
-----

Expected results:
The negative value should not be parsed or attempted to be applied in the router configuration. It should ideally be discarded and the router should load with the default values instead of failing.

Impact of the problem:
The "TuningOptions" parameters introduce many options to tune the haproxy performance and functions. There is a good chance if a negative value gets applied by human error it will lead to router crashes which are not desirable 

Additional info:

The Other tuning options such as tcp server/client and tunnel timers applied via the "TuningOptions" section appear to work perfectly where the negative value is discarded and the router gets loaded with the default value.
------
Ingresscontroller configuration:
  tuningOptions:
    clientFinTimeout: -2s
    clientTimeout: -32s
    serverFinTimeout: -2s
    serverTimeout: -25s
    tunnelTimeout: -2h

router status post the change:
oc -n openshift-ingress get pods -o wide               
NAME                                   READY   STATUS    RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-d8f4b6d59-f9bpq         1/1     Running   0          27h     10.131.0.12   ip-10-0-167-125.us-east-2.compute.internal   <none>           <none>
router-default-d8f4b6d59-qh4ff         1/1     Running   0          27h     10.128.2.7    ip-10-0-220-179.us-east-2.compute.internal   <none>           <none>
router-internalapps-6b5fdb4c89-4tjv8   2/2     Running   0          3m35s   10.131.0.35   ip-10-0-167-125.us-east-2.compute.internal   <none>           <none>

Router pod environment variables:
sh-4.4$ env | grep -i timeout
ROUTER_CLIENT_FIN_TIMEOUT=-2s
ROUTER_DEFAULT_CLIENT_TIMEOUT=-32s
ROUTER_DEFAULT_SERVER_TIMEOUT=-25s
ROUTER_DEFAULT_SERVER_FIN_TIMEOUT=-2s
ROUTER_DEFAULT_TUNNEL_TIMEOUT=-2h

haproxy.config
  timeout client 30s
  timeout client-fin 1s
  timeout server 30s
  timeout server-fin 1s

  # Long timeout for WebSocket connections.
  timeout tunnel 1h

-------

Comment 1 Arvind iyengar 2021-07-30 09:42:49 UTC

Verified in "4.9.0-0.ci.test-2021-07-30-084757-ci-ln-06q9z1b-latest" release version. With this release, the router no more appears to crash for the negative values specified for "tlsInspectDelay" parameter as it gets discarded and the router loads with the default value:
-------
 oc get clusterversion                      
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.ci.test-2021-07-30-084757-ci-ln-06q9z1b-latest   True        False         2m3s    Cluster version is 4.9.0-0.ci.test-2021-07-30-084757-ci-ln-06q9z1b-latest


Post the change:

oc -n openshift-ingress-operator get ingresscontroller internalapps -o yaml | grep -i tuning -A1
  tuningOptions:
    tlsInspectDelay: -10s

oc -n openshift-ingress get pods -o wide
NAME                                   READY   STATUS    RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
router-default-575d9dc464-7nx55        1/1     Running   0          28m     10.131.0.5    ip-10-0-165-4.us-east-2.compute.internal    <none>           <none>
router-default-575d9dc464-m9cqh        1/1     Running   0          28m     10.128.2.8    ip-10-0-250-44.us-east-2.compute.internal   <none>           <none>
router-internalapps-6747cd588d-tt6cj   2/2     Running   0          5s      10.131.0.27   ip-10-0-165-4.us-east-2.compute.internal    <none>           <none>

sh-4.4$ env | grep -i inspect
ROUTER_INSPECT_DELAY=-10s

sh-4.4$ cat haproxy.config
  frontend public
  bind :80 accept-proxy
  mode http
  tcp-request inspect-delay 5s <----
-------

Comment 5 errata-xmlrpc 2021-10-18 17:43:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.