Bug 1995953

Summary: Ingresscontroller change the replicas to scaleup first time will be rolling update for all the ingress pods
Product: OpenShift Container Platform Reporter: Jie Wu <jiewu>
Component: NetworkingAssignee: Grant Spence <gspence>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: medium CC: gspence, hongli, mmasters
Version: 4.7Keywords: Reopened
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:36:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jie Wu 2021-08-20 09:35:55 UTC
Description of problem:
Ingresscontroller change the replicas from default 2 to 3 for scaling up first time, all the ingress pods will be rolling update, it makes an unwanted rolling update for the current using ingress pods.

However, from the second time the replicas scaling up will not be rolling update for all the ingress pods, it will only scale up the new replicas pods.

It found the reason is the first time deploying with the initial deploymenmt doesn't 
sort the ENV in 'spec.template.spec.containers.env' fields.
      - env:
        - name: STATS_PORT
          value: "1936"
        - name: ROUTER_SERVICE_NAMESPACE
          value: openshift-ingress
        - name: DEFAULT_CERTIFICATE_DIR
          value: /etc/pki/tls/private
        - name: DEFAULT_DESTINATION_CA_PATH
          value: /var/run/configmaps/service-ca/service-ca.crt
        - name: ROUTER_SERVICE_NAME
          value: default
        - name: STATS_USERNAME
          valueFrom:
            secretKeyRef:
              key: statsUsername
              name: router-stats-default
        - name: STATS_PASSWORD
          valueFrom:
            secretKeyRef:
              key: statsPassword
              name: router-stats-default
        - name: ROUTER_METRICS_TYPE
          value: haproxy
        - name: ROUTER_METRICS_TLS_CERT_FILE
          value: /etc/pki/tls/metrics-certs/tls.crt
        - name: ROUTER_METRICS_TLS_KEY_FILE
          value: /etc/pki/tls/metrics-certs/tls.key
        - name: ROUTER_CANONICAL_HOSTNAME
          value: apps.ocp4.example.com
        - name: ROUTER_THREADS
          value: "4"
        - name: ROUTER_CIPHERS
          value: TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
        - name: SSL_MIN_VERSION
          value: TLSv1.2
        - name: ROUTER_DISABLE_NAMESPACE_OWNERSHIP_CHECK
          value: "false"
        - name: ROUTER_ALLOW_WILDCARD_ROUTES
          value: "false"
        - name: ROUTER_SET_FORWARDED_HEADERS
          value: append
        - name: ROUTER_DISABLE_HTTP2
          value: "true"

Once the ingress controller has been change the replicas the default 2 to 3, the sorting of ENV has been changed, it will trigger the rolling update for all the ingress pods, will make the unwanted rolling update for all the ingress pods.

      - env:
        - name: DEFAULT_CERTIFICATE_DIR
          value: /etc/pki/tls/private
        - name: DEFAULT_DESTINATION_CA_PATH
          value: /var/run/configmaps/service-ca/service-ca.crt
        - name: ROUTER_ALLOW_WILDCARD_ROUTES
          value: "false"
        - name: ROUTER_CANONICAL_HOSTNAME
          value: apps.ocp4.example.com
        - name: ROUTER_CIPHERS
          value: TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
        - name: ROUTER_DISABLE_HTTP2
          value: "true"
        - name: ROUTER_DISABLE_NAMESPACE_OWNERSHIP_CHECK
          value: "false"
        - name: ROUTER_METRICS_TLS_CERT_FILE
          value: /etc/pki/tls/metrics-certs/tls.crt
        - name: ROUTER_METRICS_TLS_KEY_FILE
          value: /etc/pki/tls/metrics-certs/tls.key
        - name: ROUTER_METRICS_TYPE
          value: haproxy
        - name: ROUTER_SERVICE_NAME
          value: default
        - name: ROUTER_SERVICE_NAMESPACE
          value: openshift-ingress
        - name: ROUTER_SET_FORWARDED_HEADERS
          value: append
        - name: ROUTER_THREADS
          value: "4"
        - name: SSL_MIN_VERSION
          value: TLSv1.2
        - name: STATS_PASSWORD
          valueFrom:
            secretKeyRef:
              key: statsPassword
              name: router-stats-default
        - name: STATS_PORT
          value: "1936"
        - name: STATS_USERNAME
          valueFrom:
            secretKeyRef:
              key: statsUsername
              name: router-stats-default

The user's environment is OCP4.7.13.
I verified on OCP4.6.40 is the same, but verified on OCP4.8.2 doesn't happen and the sorting of ENV is correct.

OpenShift release version:
OCP 4.7.13

Cluster Platform:
UPI

How reproducible:
Please refer to the details.

Steps to Reproduce (in detail):
1. Simulated cluster installation has been completed, change the replicas to 2 and delete the deployment for recreating.

# oc patch -n openshift-ingress-operator ingresscontroller/default --patch='{"spec": {"replicas": 2}}' --type=merge

# oc delete -n openshift-ingress deployment/router-default
deployment.apps "router-default" deleted

# oc -n openshift-ingress get pods
NAME                              READY   STATUS        RESTARTS   AGE
router-default-76fc4669d7-9qqct   0/1     Running       0          7s
router-default-76fc4669d7-d8vq8   0/1     Running       0          7s

# oc -n openshift-ingress get deployment/router-default -o yaml > router-default_initial_replicas.yaml

2. Change the replicas from 2 to 3 for scaling up the ingress pods.
It found the current pods for 
'router-default-76fc4669d7-d8vq8' and 'router-default-76fc4669d7-zpm2t' 
will be triggering the rolling update.

# oc patch -n openshift-ingress-operator  ingresscontroller/default --patch='{"spec": {"replicas": 3}}' --type=merge

# oc -n openshift-ingress get pods
NAME                              READY   STATUS        RESTARTS   AGE
router-default-76fc4669d7-9qqct   1/1     Running       0          106s
router-default-76fc4669d7-d8vq8   1/1     Terminating   0          106s
router-default-76fc4669d7-zpm2t   1/1     Terminating   0          26s
router-default-77b7b74855-tqnht   0/1     Pending       0          3s
router-default-77b7b74855-xwmwh   1/1     Running       0          59s

# oc -n openshift-ingress get deployment/router-default -o yaml > router-default_change_replicas.yaml

3.Comparing the files of 'router-default_initial_replicas.yaml' and 'router-default_change_replicas.yaml', 
not only replicas fields will be changed, 
it found the sorting of ENV items also will be changed.

Actual results:
Ingresscontroller change the replicas from default 2 to 3 for scaling up first time, all the ingress pods will be rolling update.

Expected results:
Ingresscontroller change the replicas from default 2 to 3 for scaling up first time will NOT be rolling update for all the ingress pods. Only scale up the new replicas pods.

Impact of the problem:
Ingresscontroller change the replicas from default 2 to 3, all the ingress pods will be rolling update that it makes an unwanted rolling update for the current using ingress pods and losing the trascation traffic.

Additional info:
None.

** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2021-08-31 16:37:25 UTC
This BZ is low severity and does not currently have any PR to merge, so we'll push this out to 4.10.

Comment 2 Jie Wu 2022-01-27 05:02:13 UTC
Hi, is there any update for fixing the problem of the router ENV sorting entries?

Comment 3 Grant Spence 2022-02-21 20:39:22 UTC
Hi Jie,

I picked up this bug last week. I'm trying to replicate this bug and having issues doing so. I've tried 4.6.40, 4.7.13, and 4.10.0 versions on IPI (cluster-bot mostly). Every time, the ENV is always sorted with executing the reproducer as you have specified and I don't get a rolling update on initial scaling from 2 to 3.

Could I get you to verify if this bug is reproducible on IPI or only with UPI?

Comment 4 Jie Wu 2022-02-22 00:34:54 UTC
Hi, Grant

The environment is using offline UPI installation.
The HAProxy router is using the endpointPublishingStrategy (Other: HostNetwork) binding on 80/443 ports.
Could you please re-try to reproduce in UPI environment?

Comment 5 Grant Spence 2022-02-22 16:29:29 UTC
Okay - I was able to reproduce with HostNetwork as the endpointPublishingStrategy on a IPI cluster. So it appears the bug is specifically related to ENV sorting with that configuration.

I will update you when I am able resolve the issue in code.

Comment 6 Grant Spence 2022-03-03 21:11:54 UTC
So I found the root cause of the bug. Like the description mentions, it has been fixed. It was fixed (whether intentionally or not) in 4.7.21 via https://github.com/openshift/cluster-ingress-operator/commit/32df7d28bc1f0ba63eedbbdd763c7124b34ce242. The problem is with the code not callling the deploymentTemplateHash() function specifically for HostNetwork endpointPublishingStrategy types. All other types were fine. The code was updated in 4.7.21 to always call deploymentTemplateHash() which will always sort the environment variables.

As far as I know, we aren't planning to backport anything to 4.7, but I did add unit testing to verify we don't regress and this happens again. That can be found here: https://github.com/openshift/cluster-ingress-operator/pull/715.

Comment 7 Grant Spence 2022-03-03 21:14:21 UTC
Correction: We aren't planning to backport to 4.6 or older. It already is fixed in 4.7; therefore 4.7 doesn't need a backport of anything.

Comment 8 Grant Spence 2022-03-03 22:08:12 UTC
Closing this bug as there it is confirmed fixed. The linked PR is for internal unit testing.

Comment 9 Grant Spence 2022-03-03 22:14:31 UTC
Opening back up as I just learned we need a BZ open to merge our unit tests https://github.com/openshift/cluster-ingress-operator/pull/715.

Comment 12 Arvind iyengar 2022-03-14 08:08:51 UTC
As of writing, in the latest 4.7 version the env variable sorting issue is no more noted (tested with 4.7.44). The current PR merge does not require further testing from QE as it basically adds a check to verify the sorting for the test cases. Hence marking this as "verified".

Comment 15 errata-xmlrpc 2022-08-10 10:36:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069