1861383 – Route with `haproxy.router.openshift.io/timeout: 365d` kills the ingress controller

Bug 1861383 - Route with `haproxy.router.openshift.io/timeout: 365d` kills the ingress controller

Summary: Route with `haproxy.router.openshift.io/timeout: 365d` kills the ingress cont...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Stephen Greene
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1896905
TreeView+	depends on / blocked

Reported:	2020-07-28 13:19 UTC by aaleman
Modified:	2022-08-04 22:30 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Creating a route that specifies the `haproxy.router.openshift.io/timeout` annotation with a value great than 25 days. Consequence: The ingress controller would crash on OCP >= 4.4 due to HAProxy 2.0's newly enforced maximum timeout limit of 25 days. Fix: The OpenShift router now clips any annotation timeout values above the maximum allowable by HAProxy for upgrade and runtime safety. Result: Setting the timeout on a route to be greater than 25 days via the timeout annotation is no longer possible since a value clip is now in place.
Clone Of:
Environment:
Last Closed:	2021-02-24 15:13:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift router pull 196	None	closed	Bug 1861383: Clip haproxy.router.openshift.io/timeout annotation values to prevent bricking on upgrade	2021-02-15 19:46:49 UTC
Red Hat Knowledge Base (Solution)	5268921	None	None	None	2020-07-29 21:10:40 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:15:02 UTC

Description aaleman 2020-07-28 13:19:02 UTC

Description of problem:

Setting a `haproxy.router.openshift.io/timeout: 365d` anotation seems to bring down the ingress controller. Our cluster was updated to 4.4, after which all routes stopped working because one route had this anotation.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a route with `haproxy.router.openshift.io/timeout: 365d` anotation
2. Observe the ingress controller not being able to load the config:
```
[ALERT] 209/102809 (107) : parsing [/var/lib/haproxy/conf/haproxy.config:859] : timer overflow in argument '365d' to 'timeout server' (maximum value is 2147483647 ms or ~24.8 days)
[ALERT] 209/102809 (107) : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config
[ALERT] 209/102809 (107) : Fatal errors found in configuration.
E0728 10:28:14.659940       1 limiter.go:165] error reloading router: exit status 1
```


Actual results:


Expected results:


Additional info:

Comment 1 Andrew McDermott 2020-07-28 13:26:27 UTC

I see no mention of "timer overflow" in the 1.8 source tree. If we were previously running OCP-4.3 then this would be haproxy-1.8.

In OCP-4.4 we are running haproxy-2.0. In that version I see plenty of references to "timer overflow". This may be new/additional verification in 2.0.

-*- mode: ag; default-directory: "~/git.haproxy.org/haproxy-2.0/src/" -*-
Ag started at Tue Jul 28 14:24:23

ag --literal --group --line-number --column --color --color-match 30\;43 --color-path 1\;32 --smart-case --stats -- timer\ overflow .
File: proto_tcp.c
1874:19:		memprintf(err, "timer overflow in argument '%s' to '%s' (maximum value is 2147483647 ms or ~24.8 days)",
1959:19:		memprintf(err, "timer overflow in argument '%s' to '%s' (maximum value is 2147483647 ms or ~24.8 days)",

File: cfgparse-listen.c
1101:33:					ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 s (~68 years).\n",
1135:33:					ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 s (~68 years).\n",
2016:32:				ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to stats refresh interval, maximum value is 2147483647 s (~68 years).\n",
3556:31:			ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to grace time, maximum value is 2147483647 ms (~24.8 days).\n",

File: hlua.c
8164:19:		memprintf(err, "timer overflow in argument <%s> to <%s> (maximum value is 2147483647 ms or ~24.8 days)",

File: proxy.c
281:19:		memprintf(err, "timer overflow in argument '%s' to 'timeout %s' (maximum value is 2147483647 ms or ~24.8 days)",
1072:19:		memprintf(err, "timer overflow in argument '%s' to '%s' (maximum value is 2147483647 ms or ~24.8 days)",

File: cfgparse-global.c
267:31:			ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 65535 ms.\n",
958:31:			ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 ms (~24.8 days).\n",

File: server.c
413:19:		memprintf(err, "timer overflow in argument '%s' to '%s' (maximum value is 2147483647 ms or ~24.8 days)",
2199:33:					ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n",
2437:33:					ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n",
2467:33:					ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n",
2497:33:					ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n",
2557:33:					ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n",

File: tcp_rules.c
931:25:				memprintf(err, "%s (timer overflow in '%s', maximum value is 2147483647 ms or ~24.8 days)", *err, args[2]);
1045:25:				memprintf(err, "%s (timer overflow in '%s', maximum value is 2147483647 ms or ~24.8 days)", *err, args[2]);

File: cfgparse.c
1216:31:			ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 ms (~24.8 days).\n",
1308:32:				ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s %s>, maximum value is 2147483647 ms (~24.8 days).\n",
1496:32:				ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s %s>, maximum value is 2147483647 ms (~24.8 days).\n",

File: ssl_sock.c
9192:19:		memprintf(err, "timer overflow in argument '%s' to <%s> (maximum value is 2147483647 s or ~68 years).",

File: stick_table.c
773:36:				ha_alert("parsing [%s:%d]: %s: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 ms (~24.8 days).\n",

File: flt_spoe.c
3500:31:			ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s %s>, maximum value is 2147483647 ms (~24.8 days).\n",

File: cli.c
313:20:			memprintf(err, "timer overflow in argument '%s' to '%s %s' (maximum value is 2147483647 ms or ~24.8 days)",
26 matches
12 files contained matches
113 files searched
4924881 bytes searched
0.004205 seconds

Comment 3 Andrew McDermott 2020-07-28 16:12:02 UTC

Looking at a fix and will backport to 4.4.

Comment 5 W. Trevor King 2020-07-29 20:03:14 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 6 Miciah Dashiel Butler Masters 2020-07-29 20:33:38 UTC

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  Customers upgrading from 4.3 to 4.4 with a route that has the haproxy.router.openshift.io/timeout annotation specifying a timeout larger than 24.8 days.
What is the impact?  Is it serious enough to warrant blocking edges?
  Routes are broken until all routes with the offending annotation are deleted or modified to remove the annotation.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  Moderately.  Administrator must use oc to list routes, find the offending route, and modify or delete it.
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  Yes, from 4.3 to 4.4.

Comment 7 Miciah Dashiel Butler Masters 2020-07-29 20:40:54 UTC

Note that OpenShift Console and OAuth use routes, so if the administrator does not have a kubeconfig or valid token and needs to use the Console or OAuth to authenticate, then the remediation becomes more involved— the administrator must use SSH to get a kubeconfig in order to use oc to fix the problem.

Therefore comment 6 should be revised as follows:

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  Customers upgrading from 4.3 to 4.4 with a route that has the haproxy.router.openshift.io/timeout annotation specifying a timeout larger than 24.8 days.
What is the impact?  Is it serious enough to warrant blocking edges?
  Routes are broken until all routes with the offending annotation are deleted or modified to remove the annotation.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  Moderately involved to very involved.  Administrator must use oc to list routes, find the offending route, and modify or delete it.  SSH access may be required if the cluster administrator does not have a valid token for oc.
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  Yes, from 4.3 to 4.4.

Comment 8 Andrew McDermott 2020-07-30 09:51:52 UTC

*** Bug 1861501 has been marked as a duplicate of this bug. ***

Comment 9 Scott Dodson 2020-07-30 17:48:01 UTC

Given that we've had hundreds of 4.3 to 4.4 upgrades already at this point without this issue having been flagged in any of those I don't think we should block any edges until we have a fix for this. Dependent on how the fix is implemented we may wish to introduce a version of 4.3 that sets Upgradeable=False and ensure that all future 4.3 to 4.4 upgrades funnel through that version. Other solutions may be to scrub the invalid input to a more sane value which would affect the other side of the upgrade path in that we'd require everyone upgrading from 4.3 to 4.4 to upgrade to the 4.4 version with the fix. The latter is the more preferable path from the view of the OTA team.

Comment 10 mfisher 2020-08-18 20:01:41 UTC

Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 11 Andrew McDermott 2020-09-10 11:52:23 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 12 Andrew McDermott 2020-10-02 17:41:50 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 13 Stephen Greene 2020-10-23 14:03:08 UTC

PR will be reviewed soon.

Comment 15 Hongan Li 2020-11-17 06:38:10 UTC

Verified with 4.7.0-0.nightly-2020-11-12-200927 and passed.

# oc annotate route service-unsecure haproxy.router.openshift.io/timeout=365d
route.route.openshift.io/service-unsecure annotated

# oc get route -oyaml
apiVersion: v1
items:
- apiVersion: route.openshift.io/v1
  kind: Route
  metadata:
    annotations:
      haproxy.router.openshift.io/timeout: 365d


# oc -n openshift-ingress exec router-default-f5454665-rfdm9 -- cat haproxy.config
<---snip--->
# Plain http backend or backend with TLS terminated at the edge or a
# secure backend with re-encryption.
backend be_http:hongli1:service-unsecure
  mode http
  option redispatch
  option forwardfor
  balance leastconn
  timeout server  2147483647ms

Comment 18 errata-xmlrpc 2021-02-24 15:13:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.