Hide Forgot
Description of problem: Ingresscontroller change the replicas from default 2 to 3 for scaling up first time, all the ingress pods will be rolling update, it makes an unwanted rolling update for the current using ingress pods. However, from the second time the replicas scaling up will not be rolling update for all the ingress pods, it will only scale up the new replicas pods. It found the reason is the first time deploying with the initial deploymenmt doesn't sort the ENV in 'spec.template.spec.containers.env' fields. - env: - name: STATS_PORT value: "1936" - name: ROUTER_SERVICE_NAMESPACE value: openshift-ingress - name: DEFAULT_CERTIFICATE_DIR value: /etc/pki/tls/private - name: DEFAULT_DESTINATION_CA_PATH value: /var/run/configmaps/service-ca/service-ca.crt - name: ROUTER_SERVICE_NAME value: default - name: STATS_USERNAME valueFrom: secretKeyRef: key: statsUsername name: router-stats-default - name: STATS_PASSWORD valueFrom: secretKeyRef: key: statsPassword name: router-stats-default - name: ROUTER_METRICS_TYPE value: haproxy - name: ROUTER_METRICS_TLS_CERT_FILE value: /etc/pki/tls/metrics-certs/tls.crt - name: ROUTER_METRICS_TLS_KEY_FILE value: /etc/pki/tls/metrics-certs/tls.key - name: ROUTER_CANONICAL_HOSTNAME value: apps.ocp4.example.com - name: ROUTER_THREADS value: "4" - name: ROUTER_CIPHERS value: TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384 - name: SSL_MIN_VERSION value: TLSv1.2 - name: ROUTER_DISABLE_NAMESPACE_OWNERSHIP_CHECK value: "false" - name: ROUTER_ALLOW_WILDCARD_ROUTES value: "false" - name: ROUTER_SET_FORWARDED_HEADERS value: append - name: ROUTER_DISABLE_HTTP2 value: "true" Once the ingress controller has been change the replicas the default 2 to 3, the sorting of ENV has been changed, it will trigger the rolling update for all the ingress pods, will make the unwanted rolling update for all the ingress pods. - env: - name: DEFAULT_CERTIFICATE_DIR value: /etc/pki/tls/private - name: DEFAULT_DESTINATION_CA_PATH value: /var/run/configmaps/service-ca/service-ca.crt - name: ROUTER_ALLOW_WILDCARD_ROUTES value: "false" - name: ROUTER_CANONICAL_HOSTNAME value: apps.ocp4.example.com - name: ROUTER_CIPHERS value: TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384 - name: ROUTER_DISABLE_HTTP2 value: "true" - name: ROUTER_DISABLE_NAMESPACE_OWNERSHIP_CHECK value: "false" - name: ROUTER_METRICS_TLS_CERT_FILE value: /etc/pki/tls/metrics-certs/tls.crt - name: ROUTER_METRICS_TLS_KEY_FILE value: /etc/pki/tls/metrics-certs/tls.key - name: ROUTER_METRICS_TYPE value: haproxy - name: ROUTER_SERVICE_NAME value: default - name: ROUTER_SERVICE_NAMESPACE value: openshift-ingress - name: ROUTER_SET_FORWARDED_HEADERS value: append - name: ROUTER_THREADS value: "4" - name: SSL_MIN_VERSION value: TLSv1.2 - name: STATS_PASSWORD valueFrom: secretKeyRef: key: statsPassword name: router-stats-default - name: STATS_PORT value: "1936" - name: STATS_USERNAME valueFrom: secretKeyRef: key: statsUsername name: router-stats-default The user's environment is OCP4.7.13. I verified on OCP4.6.40 is the same, but verified on OCP4.8.2 doesn't happen and the sorting of ENV is correct. OpenShift release version: OCP 4.7.13 Cluster Platform: UPI How reproducible: Please refer to the details. Steps to Reproduce (in detail): 1. Simulated cluster installation has been completed, change the replicas to 2 and delete the deployment for recreating. # oc patch -n openshift-ingress-operator ingresscontroller/default --patch='{"spec": {"replicas": 2}}' --type=merge # oc delete -n openshift-ingress deployment/router-default deployment.apps "router-default" deleted # oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-76fc4669d7-9qqct 0/1 Running 0 7s router-default-76fc4669d7-d8vq8 0/1 Running 0 7s # oc -n openshift-ingress get deployment/router-default -o yaml > router-default_initial_replicas.yaml 2. Change the replicas from 2 to 3 for scaling up the ingress pods. It found the current pods for 'router-default-76fc4669d7-d8vq8' and 'router-default-76fc4669d7-zpm2t' will be triggering the rolling update. # oc patch -n openshift-ingress-operator ingresscontroller/default --patch='{"spec": {"replicas": 3}}' --type=merge # oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-76fc4669d7-9qqct 1/1 Running 0 106s router-default-76fc4669d7-d8vq8 1/1 Terminating 0 106s router-default-76fc4669d7-zpm2t 1/1 Terminating 0 26s router-default-77b7b74855-tqnht 0/1 Pending 0 3s router-default-77b7b74855-xwmwh 1/1 Running 0 59s # oc -n openshift-ingress get deployment/router-default -o yaml > router-default_change_replicas.yaml 3.Comparing the files of 'router-default_initial_replicas.yaml' and 'router-default_change_replicas.yaml', not only replicas fields will be changed, it found the sorting of ENV items also will be changed. Actual results: Ingresscontroller change the replicas from default 2 to 3 for scaling up first time, all the ingress pods will be rolling update. Expected results: Ingresscontroller change the replicas from default 2 to 3 for scaling up first time will NOT be rolling update for all the ingress pods. Only scale up the new replicas pods. Impact of the problem: Ingresscontroller change the replicas from default 2 to 3, all the ingress pods will be rolling update that it makes an unwanted rolling update for the current using ingress pods and losing the trascation traffic. Additional info: None. ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.
This BZ is low severity and does not currently have any PR to merge, so we'll push this out to 4.10.
Hi, is there any update for fixing the problem of the router ENV sorting entries?
Hi Jie, I picked up this bug last week. I'm trying to replicate this bug and having issues doing so. I've tried 4.6.40, 4.7.13, and 4.10.0 versions on IPI (cluster-bot mostly). Every time, the ENV is always sorted with executing the reproducer as you have specified and I don't get a rolling update on initial scaling from 2 to 3. Could I get you to verify if this bug is reproducible on IPI or only with UPI?
Hi, Grant The environment is using offline UPI installation. The HAProxy router is using the endpointPublishingStrategy (Other: HostNetwork) binding on 80/443 ports. Could you please re-try to reproduce in UPI environment?
Okay - I was able to reproduce with HostNetwork as the endpointPublishingStrategy on a IPI cluster. So it appears the bug is specifically related to ENV sorting with that configuration. I will update you when I am able resolve the issue in code.
So I found the root cause of the bug. Like the description mentions, it has been fixed. It was fixed (whether intentionally or not) in 4.7.21 via https://github.com/openshift/cluster-ingress-operator/commit/32df7d28bc1f0ba63eedbbdd763c7124b34ce242. The problem is with the code not callling the deploymentTemplateHash() function specifically for HostNetwork endpointPublishingStrategy types. All other types were fine. The code was updated in 4.7.21 to always call deploymentTemplateHash() which will always sort the environment variables. As far as I know, we aren't planning to backport anything to 4.7, but I did add unit testing to verify we don't regress and this happens again. That can be found here: https://github.com/openshift/cluster-ingress-operator/pull/715.
Correction: We aren't planning to backport to 4.6 or older. It already is fixed in 4.7; therefore 4.7 doesn't need a backport of anything.
Closing this bug as there it is confirmed fixed. The linked PR is for internal unit testing.
Opening back up as I just learned we need a BZ open to merge our unit tests https://github.com/openshift/cluster-ingress-operator/pull/715.
As of writing, in the latest 4.7 version the env variable sorting issue is no more noted (tested with 4.7.44). The current PR merge does not require further testing from QE as it basically adds a check to verify the sorting for the test cases. Hence marking this as "verified".
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069