Bug 1987238

Summary: A negative value applied for the "tlsInspectDelay" option caused the router pod to go into crashloop
Product: OpenShift Container Platform Reporter: Arvind iyengar <aiyengar>
Component: NetworkingAssignee: Ryan Fredette <rfredette>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: amcdermo, aos-bugs, skrenger
Version: 4.9   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:43:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Arvind iyengar 2021-07-29 09:55:16 UTC
Description of problem:
With the newly introduced "tlsInspectDelay" parameter applied through "TuningOptions" setting in the ingresscontroller, a negative value applied causes the router pod to go into crash loop. 

OpenShift release version:
Release version: 4.9.0-0.nightly-2021-07-27-125952

Cluster Platform:
OCP

How reproducible:
Always

Steps to Reproduce (in detail):
1.Deploy a cluster with said release version or above.
2.Deploy an ingresscontroller or modify any existing one to have the "tlsInspectDelay" set to a negative value:
----
If all the values are set to something negative, they are ignored and the proxy is configured with default timer values:
spec:
  tuningOptions:
    tlsInspectDelay: -10s
----
3.Check the router pod status or the logs of the pod. 


Actual results:
The pod fails to reloaded with below error:
-----
[NOTICE] 209/044252 (21) : haproxy version is 2.2.15-5e8f49d
[NOTICE] 209/044252 (21) : path to executable is /usr/sbin/haproxy
[ALERT] 209/044252 (21) : parsing [/var/lib/haproxy/conf/haproxy.config:67] : 'tcp-request inspect-delay' expects a positive delay in milliseconds, in frontend 'public' (unexpected character '-')
[ALERT] 209/044252 (21) : parsing [/var/lib/haproxy/conf/haproxy.config:93] : 'tcp-request inspect-delay' expects a positive delay in milliseconds, in frontend 'public_ssl' (unexpected character '-')
[ALERT] 209/044252 (21) : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config
[ALERT] 209/044252 (21) : Fatal errors found in configuration.
-----

Expected results:
The negative value should not be parsed or attempted to be applied in the router configuration. It should ideally be discarded and the router should load with the default values instead of failing.

Impact of the problem:
The "TuningOptions" parameters introduce many options to tune the haproxy performance and functions. There is a good chance if a negative value gets applied by human error it will lead to router crashes which are not desirable 

Additional info:

The Other tuning options such as tcp server/client and tunnel timers applied via the "TuningOptions" section appear to work perfectly where the negative value is discarded and the router gets loaded with the default value.
------
Ingresscontroller configuration:
  tuningOptions:
    clientFinTimeout: -2s
    clientTimeout: -32s
    serverFinTimeout: -2s
    serverTimeout: -25s
    tunnelTimeout: -2h

router status post the change:
oc -n openshift-ingress get pods -o wide               
NAME                                   READY   STATUS    RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
router-default-d8f4b6d59-f9bpq         1/1     Running   0          27h     10.131.0.12   ip-10-0-167-125.us-east-2.compute.internal   <none>           <none>
router-default-d8f4b6d59-qh4ff         1/1     Running   0          27h     10.128.2.7    ip-10-0-220-179.us-east-2.compute.internal   <none>           <none>
router-internalapps-6b5fdb4c89-4tjv8   2/2     Running   0          3m35s   10.131.0.35   ip-10-0-167-125.us-east-2.compute.internal   <none>           <none>

Router pod environment variables:
sh-4.4$ env | grep -i timeout
ROUTER_CLIENT_FIN_TIMEOUT=-2s
ROUTER_DEFAULT_CLIENT_TIMEOUT=-32s
ROUTER_DEFAULT_SERVER_TIMEOUT=-25s
ROUTER_DEFAULT_SERVER_FIN_TIMEOUT=-2s
ROUTER_DEFAULT_TUNNEL_TIMEOUT=-2h

haproxy.config
  timeout client 30s
  timeout client-fin 1s
  timeout server 30s
  timeout server-fin 1s

  # Long timeout for WebSocket connections.
  timeout tunnel 1h

-------

Comment 1 Arvind iyengar 2021-07-30 09:42:49 UTC
Verified in "4.9.0-0.ci.test-2021-07-30-084757-ci-ln-06q9z1b-latest" release version. With this release, the router no more appears to crash for the negative values specified for "tlsInspectDelay" parameter as it gets discarded and the router loads with the default value:
-------
 oc get clusterversion                      
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.ci.test-2021-07-30-084757-ci-ln-06q9z1b-latest   True        False         2m3s    Cluster version is 4.9.0-0.ci.test-2021-07-30-084757-ci-ln-06q9z1b-latest


Post the change:

oc -n openshift-ingress-operator get ingresscontroller internalapps -o yaml | grep -i tuning -A1
  tuningOptions:
    tlsInspectDelay: -10s

oc -n openshift-ingress get pods -o wide
NAME                                   READY   STATUS    RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
router-default-575d9dc464-7nx55        1/1     Running   0          28m     10.131.0.5    ip-10-0-165-4.us-east-2.compute.internal    <none>           <none>
router-default-575d9dc464-m9cqh        1/1     Running   0          28m     10.128.2.8    ip-10-0-250-44.us-east-2.compute.internal   <none>           <none>
router-internalapps-6747cd588d-tt6cj   2/2     Running   0          5s      10.131.0.27   ip-10-0-165-4.us-east-2.compute.internal    <none>           <none>

sh-4.4$ env | grep -i inspect
ROUTER_INSPECT_DELAY=-10s

sh-4.4$ cat haproxy.config
  frontend public
  bind :80 accept-proxy
  mode http
  tcp-request inspect-delay 5s <----
-------

Comment 5 errata-xmlrpc 2021-10-18 17:43:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759