Bug 1896914

Summary: Route with `haproxy.router.openshift.io/timeout: 365d` kills the ingress controller
Product: OpenShift Container Platform Reporter: Stephen Greene <sgreene>
Component: NetworkingAssignee: Stephen Greene <sgreene>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: aiyengar, aos-bugs, bbennett, bperkins, brad.williams, cblecker, erich, hongli, jeder, mjoseph, mmasters, mwoodson, nmalik, openshift-bugzilla-robot, rrackow, sdodson, wking
Version: 4.4Keywords: ServiceDeliveryImpact, UpcomingSprint, Upgrades
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1896905 Environment:
Last Closed: 2020-12-15 20:28:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1896905    
Bug Blocks:    

Comment 2 Arvind iyengar 2020-11-25 08:48:29 UTC
Verified in "4.5.0-0.ci.test-2020-11-25-061734-ci-ln-g7zsxx2" CI image. With patch, it is noted the "timer overflow" does not occur and cause any disruption for router restarts:
$ oc get clusterversion
NAME      VERSION                                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.ci.test-2020-11-25-061734-ci-ln-g7zsxx2   True        False         83m     Cluster version is 4.5.0-0.ci.test-2020-11-25-061734-ci-ln-g7zsxx2

Route with a very large timeout annotation:
$ oc annotate route service-unsecure haproxy.router.openshift.io/timeout=9999d  <---
$ oc describe route service-unsecure
Name:			service-unsecure
Namespace:		test1
Created:		26 minutes ago
Labels:			name=service-unsecure
Annotations:		haproxy.router.openshift.io/timeout=9999d <---

In the haproxy configuration: 

backend be_http:test1:service-unsecure
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  2147483647ms <--- [the number being rounded off to ~ 24.85 days

The router could be seen running without any errors post restarts unlike the older versions where the restart loop would have triggered:

$ oc -n openshift-ingress get pods -o wide
NAME                                   READY   STATUS    RESTARTS   AGE   IP            NODE                                                NOMINATED NODE   READINESS GATES
router-default-676754f5c4-gkw2m        1/1     Running   0          96m    ci-ln-g7zsxx2-002ac-qnx24-worker-centralus2-9dbkk   <none>           <none>
router-default-676754f5c4-p6qls        1/1     Running   0          96m   ci-ln-g7zsxx2-002ac-qnx24-worker-centralus3-k4zx9   <none>           <none>
router-internalapps-6569bb474b-svlsj   2/2     Running   0          74s    ci-ln-g7zsxx2-002ac-qnx24-worker-centralus2-9dbkk   <none>           <none>

$ oc -n openshift-ingress logs router-internalapps-6569bb474b-svlsj -c router --tail 50
I1125 08:28:37.868271       1 template.go:298] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: adc7d59\nversionFromGit: v0.0.0-unknown\ngitTreeState: clean\nbuildDate: 2020-11-25T06:16:03Z\n"
I1125 08:28:37.870356       1 metrics.go:154] metrics "msg"="router health and metrics port listening on HTTP and HTTPS"  "address"=""
I1125 08:28:37.877969       1 router.go:164] template "msg"="creating a new template router"  "writeDir"="/var/lib/haproxy"
I1125 08:28:37.878036       1 router.go:239] template "msg"="router will coalesce reloads within an interval of each other"  "interval"="5s"
I1125 08:28:37.878550       1 router.go:301] template "msg"="watching for changes"  "path"="/etc/pki/tls/private"
I1125 08:28:37.878626       1 router.go:257] router "msg"="router is including routes in all namespaces"  
I1125 08:28:37.878717       1 reflector.go:175] Starting reflector *v1.Service (30m0s) from github.com/openshift/router/pkg/router/template/service_lookup.go:33
I1125 08:28:37.879889       1 reflector.go:175] Starting reflector *v1.Route (30m0s) from github.com/openshift/router/pkg/router/controller/factory/factory.go:116
I1125 08:28:37.879951       1 reflector.go:175] Starting reflector *v1.Endpoints (30m0s) from github.com/openshift/router/pkg/router/controller/factory/factory.go:116
E1125 08:28:37.986820       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I1125 08:28:38.021164       1 router.go:536] template "msg"="router reloaded"  "output"="[ALERT] 329/082837 (22) : sendmsg()/writev() failed in logger #1: No such file or directory (errno=2)\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I1125 08:28:43.029006       1 router.go:536] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I1125 08:28:55.589888       1 router.go:536] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I1125 08:29:00.591829       1 router.go:536] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"

Comment 7 errata-xmlrpc 2020-12-15 20:28:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.5.23 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.