Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2078524

Summary: Downgrading a cluster from 4.11 to 4.10 is failed
Product: OpenShift Container Platform Reporter: Shudi Li <shudili>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: router QA Contact: Shudi Li <shudili>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: high CC: aos-bugs, gspence, hongli, mmasters
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:18:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2041616    
Bug Blocks:    

Description Shudi Li 2022-04-25 13:56:42 UTC
Description of problem: In 4.11, configure timeout of liveness probe and readiness probe for the router deploy in openshift-ingress namespace with 5s, try to downgrade the cluster to 4.10, expect the timeout will change to the default 1s.
But more than 5 hours has passed, it is still in "waiting on ingress"


OpenShift release version:


Cluster Platform:
cluster access info: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/96936/

How reproducible:
configure timeout of liveness probe and readiness probe, and then downgrade the cluster

Steps to Reproduce (in detail):
1. configure timeout of liveness probe and readiness probe
 % oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":{"timeoutSeconds":5},"readinessProbe":{"timeoutSeconds":5}}]}}}}'  
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "router" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "router" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "router" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "router" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/router-default patched
%

2. check the configuration of timeout of liveness probe and readiness probe
% oc -n openshift-ingress get  deploy/router-default -o yaml | grep -A8 nessProbe:
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 1936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
--
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz/ready
            port: 1936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
% 

3. downgrade the cluster to 4.10.0-0.nightly-2022-04-24-083512
% oc patch clusterversion/version --patch '{"spec":{"upstream":"https://amd64.ocp.releases.ci.openshift.org/graph"}}' --type=merge
clusterversion.config.openshift.io/version patched
% 

% oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-04-24-083512 --allow-explicit-upgrade=true --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-04-24-083512
%

4. oc get clusterversion from time to time, it seems the downgrade is stuck in  "waiting on ingress" 
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-24-135651   True        True          3m39s   Working towards 4.10.0-0.nightly-2022-04-24-083512: 95 of 771 done (12% complete)
%

% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-24-135651   True        True          31m     Unable to apply 4.10.0-0.nightly-2022-04-24-083512: an unknown error has occurred: MultipleErrors
%

% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-24-135651   True        True          36m     Working towards 4.10.0-0.nightly-2022-04-24-083512: 610 of 771 done (79% complete)
%  

% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-24-135651   True        True          53m     Working towards 4.10.0-0.nightly-2022-04-24-083512: 611 of 771 done (79% complete), waiting on ingress
%

% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-24-135651   True        True          5h30m   Working towards 4.10.0-0.nightly-2022-04-24-083512: 611 of 771 done (79% complete), waiting on ingress
%

5. check the timeout, it is changed to 1s
% oc -n openshift-ingress get  deploy/router-default -o yaml | grep -A8 nessProbe:
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 1936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
--
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz/ready
            port: 1936
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
%

Actual results:
More than 5 hours passed, the downgrade hasn't been completed.

Expected results:
About 1 hour, the downgrade is successful.

Impact of the problem:


Additional info:



** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Shudi Li 2022-04-26 10:10:02 UTC
Changed the bug title to "Downgrading a cluster from 4.11 to 4.10 is failed":

1.
With the configuration of liveness probe and readiness probe timeout 5s, the downgrade from 4.11 to 4.10 has being in waiting on ingress for more 5 hours

2.
With the default configuration of liveness probe and readiness probe timeout 1s, when downgraded from 4.11 to 4.10, it reported "the cluster operator monitoring has not yet successfully rolled out"
a,
% oc get clusterversion                 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-25-171513   True        True          63m     Unable to apply 4.10.0-0.nightly-2022-04-24-083512: the cluster operator monitoring has not yet successfully rolled out
% 
b,
% oc get clusterversion                              
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-25-171513   True        True          169m    Unable to apply 4.10.0-0.nightly-2022-04-24-083512: the cluster operator monitoring has not yet successfully rolled out
%

Comment 3 Miciah Dashiel Butler Masters 2022-04-28 15:46:38 UTC
I was unable to unable to reproduce the original problem.  I tried the following:

* Downgrade from 4.11.0-0.nightly-2022-04-24-135651 to 4.10.0-0.nightly-2022-04-24-083512.
* Downgrade from 4.11.0-0.nightly-2022-04-26-181148 to 4.10.12.
* Downgrade from 4.11.0-0.ci-2022-04-26-195435 to 4.10.0-0.nightly-2022-04-24-083512.

In all cases, the cluster operator monitoring got stuck, but the ingress operator downgraded fine (and reverted timeoutSeconds, as expected, after being downgraded).  

If you are able to reproduce the downgrade failure with the ingress operator, can you get the clusteroperator json or yaml, the router deployment json yaml, and the ingress operator logs?  Alternatively, a must-gather archive would be helpful.

Comment 4 Shudi Li 2022-04-29 13:02:38 UTC
Downgrade from 4.11.0-0.nightly-2022-04-26-181148 to 4.10.0-0.nightly-2022-04-24-083512 hasn't the original "waiting on ingress" issue, but get the issue of "the cluster operator monitoring has not yet successfully rolled out".

1.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-26-181148   True        True          12m     Working towards 4.10.0-0.nightly-2022-04-24-083512: 117 of 771 done (15% complete)
%

2.
% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-26-181148   True        True          168m    Unable to apply 4.10.0-0.nightly-2022-04-24-083512: the cluster operator monitoring has not yet successfully rolled out
%

Comment 5 Shudi Li 2022-04-29 13:22:59 UTC
Downgrade from 4.11.0-0.nightly-2022-04-24-135651 to 4.10.0-0.nightly-2022-04-24-083512 has the original "waiting on ingress" issue

% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-24-135651   True        True          46m     Working towards 4.10.0-0.nightly-2022-04-24-083512: 611 of 771 done (79% complete), waiting on ingress
%

Please refer to the attached logs.txt for the get the router deployment json yaml, the ingress operator logs and the clusteroperator json or yaml.

Comment 12 Miciah Dashiel Butler Masters 2022-05-31 15:58:51 UTC
Thank you very much!  Based on comment 11, the downgrade failure is not dependent on the new feature, and it is not new in 4.11; the failure can be reproduced when downgrading 4.10→4.9 just by having an ingresscontroller with a domain outside the cluster's base domain when initiating the downgrade.  For that reason, I am marking this BZ as not a blocker.  

Note that fixing bug 2041616 should prevent the issue on AWS.

Comment 14 Shiftzilla 2023-03-09 01:18:01 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9237