Description of problem: Setting a `haproxy.router.openshift.io/timeout: 365d` anotation seems to bring down the ingress controller. Our cluster was updated to 4.4, after which all routes stopped working because one route had this anotation. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Create a route with `haproxy.router.openshift.io/timeout: 365d` anotation 2. Observe the ingress controller not being able to load the config: ``` [ALERT] 209/102809 (107) : parsing [/var/lib/haproxy/conf/haproxy.config:859] : timer overflow in argument '365d' to 'timeout server' (maximum value is 2147483647 ms or ~24.8 days) [ALERT] 209/102809 (107) : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config [ALERT] 209/102809 (107) : Fatal errors found in configuration. E0728 10:28:14.659940 1 limiter.go:165] error reloading router: exit status 1 ``` Actual results: Expected results: Additional info:
I see no mention of "timer overflow" in the 1.8 source tree. If we were previously running OCP-4.3 then this would be haproxy-1.8. In OCP-4.4 we are running haproxy-2.0. In that version I see plenty of references to "timer overflow". This may be new/additional verification in 2.0. -*- mode: ag; default-directory: "~/git.haproxy.org/haproxy-2.0/src/" -*- Ag started at Tue Jul 28 14:24:23 ag --literal --group --line-number --column --color --color-match 30\;43 --color-path 1\;32 --smart-case --stats -- timer\ overflow . File: proto_tcp.c 1874:19: memprintf(err, "timer overflow in argument '%s' to '%s' (maximum value is 2147483647 ms or ~24.8 days)", 1959:19: memprintf(err, "timer overflow in argument '%s' to '%s' (maximum value is 2147483647 ms or ~24.8 days)", File: cfgparse-listen.c 1101:33: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 s (~68 years).\n", 1135:33: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 s (~68 years).\n", 2016:32: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to stats refresh interval, maximum value is 2147483647 s (~68 years).\n", 3556:31: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to grace time, maximum value is 2147483647 ms (~24.8 days).\n", File: hlua.c 8164:19: memprintf(err, "timer overflow in argument <%s> to <%s> (maximum value is 2147483647 ms or ~24.8 days)", File: proxy.c 281:19: memprintf(err, "timer overflow in argument '%s' to 'timeout %s' (maximum value is 2147483647 ms or ~24.8 days)", 1072:19: memprintf(err, "timer overflow in argument '%s' to '%s' (maximum value is 2147483647 ms or ~24.8 days)", File: cfgparse-global.c 267:31: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 65535 ms.\n", 958:31: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 ms (~24.8 days).\n", File: server.c 413:19: memprintf(err, "timer overflow in argument '%s' to '%s' (maximum value is 2147483647 ms or ~24.8 days)", 2199:33: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n", 2437:33: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n", 2467:33: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n", 2497:33: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n", 2557:33: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s> of server %s, maximum value is 2147483647 ms (~24.8 days).\n", File: tcp_rules.c 931:25: memprintf(err, "%s (timer overflow in '%s', maximum value is 2147483647 ms or ~24.8 days)", *err, args[2]); 1045:25: memprintf(err, "%s (timer overflow in '%s', maximum value is 2147483647 ms or ~24.8 days)", *err, args[2]); File: cfgparse.c 1216:31: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 ms (~24.8 days).\n", 1308:32: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s %s>, maximum value is 2147483647 ms (~24.8 days).\n", 1496:32: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s %s>, maximum value is 2147483647 ms (~24.8 days).\n", File: ssl_sock.c 9192:19: memprintf(err, "timer overflow in argument '%s' to <%s> (maximum value is 2147483647 s or ~68 years).", File: stick_table.c 773:36: ha_alert("parsing [%s:%d]: %s: timer overflow in argument <%s> to <%s>, maximum value is 2147483647 ms (~24.8 days).\n", File: flt_spoe.c 3500:31: ha_alert("parsing [%s:%d]: timer overflow in argument <%s> to <%s %s>, maximum value is 2147483647 ms (~24.8 days).\n", File: cli.c 313:20: memprintf(err, "timer overflow in argument '%s' to '%s %s' (maximum value is 2147483647 ms or ~24.8 days)", 26 matches 12 files contained matches 113 files searched 4924881 bytes searched 0.004205 seconds
Looking at a fix and will backport to 4.4.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Customers upgrading from 4.3 to 4.4 with a route that has the haproxy.router.openshift.io/timeout annotation specifying a timeout larger than 24.8 days. What is the impact? Is it serious enough to warrant blocking edges? Routes are broken until all routes with the offending annotation are deleted or modified to remove the annotation. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Moderately. Administrator must use oc to list routes, find the offending route, and modify or delete it. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes, from 4.3 to 4.4.
Note that OpenShift Console and OAuth use routes, so if the administrator does not have a kubeconfig or valid token and needs to use the Console or OAuth to authenticate, then the remediation becomes more involved— the administrator must use SSH to get a kubeconfig in order to use oc to fix the problem. Therefore comment 6 should be revised as follows: Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Customers upgrading from 4.3 to 4.4 with a route that has the haproxy.router.openshift.io/timeout annotation specifying a timeout larger than 24.8 days. What is the impact? Is it serious enough to warrant blocking edges? Routes are broken until all routes with the offending annotation are deleted or modified to remove the annotation. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Moderately involved to very involved. Administrator must use oc to list routes, find the offending route, and modify or delete it. SSH access may be required if the cluster administrator does not have a valid token for oc. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? Yes, from 4.3 to 4.4.
*** Bug 1861501 has been marked as a duplicate of this bug. ***
Given that we've had hundreds of 4.3 to 4.4 upgrades already at this point without this issue having been flagged in any of those I don't think we should block any edges until we have a fix for this. Dependent on how the fix is implemented we may wish to introduce a version of 4.3 that sets Upgradeable=False and ensure that all future 4.3 to 4.4 upgrades funnel through that version. Other solutions may be to scrub the invalid input to a more sane value which would affect the other side of the upgrade path in that we'd require everyone upgrading from 4.3 to 4.4 to upgrade to the 4.4 version with the fix. The latter is the more preferable path from the view of the OTA team.
Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
PR will be reviewed soon.
Verified with 4.7.0-0.nightly-2020-11-12-200927 and passed. # oc annotate route service-unsecure haproxy.router.openshift.io/timeout=365d route.route.openshift.io/service-unsecure annotated # oc get route -oyaml apiVersion: v1 items: - apiVersion: route.openshift.io/v1 kind: Route metadata: annotations: haproxy.router.openshift.io/timeout: 365d # oc -n openshift-ingress exec router-default-f5454665-rfdm9 -- cat haproxy.config <---snip---> # Plain http backend or backend with TLS terminated at the edge or a # secure backend with re-encryption. backend be_http:hongli1:service-unsecure mode http option redispatch option forwardfor balance leastconn timeout server 2147483647ms
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633