Bug 1927364

Summary: oc idle: Clusters upgrading with an idled workload do not have annotations on the workload's service
Product: OpenShift Container Platform Reporter: Stephen Greene <sgreene>
Component: NetworkingAssignee: Stephen Greene <sgreene>
Networking sub component: router QA Contact: Arvind iyengar <aiyengar>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: amcdermo, aos-bugs, hongli, mjoseph, openshift-bugzilla-robot, scuppett
Version: 4.6Keywords: Upgrades
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Upgrading a cluster from a prior version with an idled workload. Consequence: The idled workload will not wake on HTTP request once upgraded to OCP 4.6/4.7 with BZ#1900989 due to `oc idle` feature fixups and reworks. Fix: On ingress-opreator startup, mirror any idling changes from endpoints to services (since in latest 4.6/4.7, idling is based off of service idle annotations). Result: Unidling workloads after upgrades works as expected.
Story Points: ---
Clone Of: 1927080 Environment:
Last Closed: 2021-03-25 04:45:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1927080    
Bug Blocks:    

Description Stephen Greene 2021-02-10 15:45:45 UTC
+++ This bug was initially created as a clone of Bug #1927080 +++

+++ This bug was initially created as a clone of Bug #1925245 +++

Description of problem:
Bug 1900989 fixes `oc idle` in 4.6 and 4.7 by annotating a workload's service with the proper idle annotations, in addition to the workloads endpoints, among other things. Clusters upgrading to a cluster version with the new fixes for Bug 1900989 that have idled workloads will run into issues with unidling, since unidling the idled workload will not work without manual user intervention (the service idle annotations are needed for unidling to work going forward).


Steps to Reproduce:
1. Idle a workload (ex: run `oc idle` on a service + deployment + route)
2. Upgrade the cluster to a cluster version containing the fixes for Bug 1900989


Actual results:
Curling the idled route does not "wake it up".

Expected results:
Unidling a route after an upgrade should always work without user intervention.

Additional info:

--- Additional comment from sgreene on 2021-02-04 16:54:10 UTC ---

Note that the fix for this bug should only be available in 4.6 and 4.7, since any clusters upgrading to 4.8 and beyond would already have the idle annotations mirrored over from 4.6.z/4.7.z (we can shave a couple seconds off of operator start time but not performing the idle annotations check in future releases).

--- Additional comment from sgreene on 2021-02-04 20:47:00 UTC ---

Workaround for customers upgrading with idled workloads to a version of 4.6.z/4.7.z with the new idle changes from Bug 1900989:

0) Wait for upgrade to complete
1) Remove idle annotations from idled endpoints (oc edit ...) note the idled scalable resources and their prior replica count.
2) Manually scale idled scalable resources back up to the desired number of replicas (oc scale ...)
3) Route should now be unidled.

Comment 1 Stephen Greene 2021-02-25 20:15:17 UTC
awaiting cherry pick

Comment 3 Arvind iyengar 2021-03-15 06:21:33 UTC
Tested with the upgrade from "4.5.34" to "4.6.0-0.nightly-2021-03-13-204449" payload which the latest as of writing. With Upgrade to the said payload, the idled routes get woken up and become accessible via curl without any manual interventions: 
------
$ oc get clusterversion                  
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.34    True        False         4m56s   Cluster version is 4.5.34

oc version
Client Version: 4.5.34
Server Version: 4.5.34
Kubernetes Version: v1.18.3+cdb0358

Create project resources and idle the route:
oc get all                    
curl NAME                      READY   STATUS    RESTARTS   AGE
pod/web-server-rc-x6g8h   1/1     Running   0          59s

NAME                                  DESIRED   CURRENT   READY   AGE
replicationcontroller/web-server-rc   1         1         1       60s

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/service-secure     ClusterIP   172.30.26.24    <none>        27443/TCP   60s
service/service-unsecure   ClusterIP   172.30.233.68   <none>        27017/TCP   60s

NAME                                        HOST/PORT                                                                 PATH   SERVICES           PORT   TERMINATION   WILDCARD
route.route.openshift.io/service-unsecure   service-unsecure-test1.apps.aiyengar-oc4534.qe.devcluster.openshift.com          service-unsecure   http                 None


oc idle service-unsecure
WARNING: idling when network policies are in place may cause connections to bypass network policy entirely
The service "test1/service-unsecure" has been marked as idled 
The service will unidle ReplicationController "test1/web-server-rc" to 1 replicas once it receives traffic 
ReplicationController "test1/web-server-rc" has been idled 


oc get all
NAME                                  DESIRED   CURRENT   READY   AGE
replicationcontroller/web-server-rc   0         0         0       116s

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/service-secure     ClusterIP   172.30.26.24    <none>        27443/TCP   116s
service/service-unsecure   ClusterIP   172.30.233.68   <none>        27017/TCP   116s

NAME                                        HOST/PORT                                                                 PATH   SERVICES           PORT   TERMINATION   WILDCARD
route.route.openshift.io/service-unsecure   service-unsecure-test1.apps.aiyengar-oc4534.qe.devcluster.openshift.com          service-unsecure   http                 None


* Triggering an upgrade results in success:
oc adm upgrade --to=4.6.0-0.nightly-2021-03-13-204449 --force
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to 4.6.0-0.nightly-2021-03-13-204449

oc get clusterversion                          
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.34    True        True          49m     Working towards 4.6.0-0.nightly-2021-03-13-204449: 29% complete
....
oc get clusterversion                                     
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-03-13-204449   True        False         2m52s   Cluster version is 4.6.0-0.nightly-2021-03-13-204449


* Curling the idled route yields success where the backend pods are woken up: 
oc get all                                 
NAME                                  DESIRED   CURRENT   READY   AGE
replicationcontroller/web-server-rc   0         0         0       98m

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/service-secure     ClusterIP   172.30.26.24    <none>        27443/TCP   98m
service/service-unsecure   ClusterIP   172.30.233.68   <none>        27017/TCP   98m

NAME                                        HOST/PORT                                                                 PATH   SERVICES           PORT   TERMINATION   WILDCARD
route.route.openshift.io/service-unsecure   service-unsecure-test1.apps.aiyengar-oc4534.qe.devcluster.openshift.com          service-unsecure   http                 None


curl service-unsecure-test1.apps.aiyengar-oc4534.qe.devcluster.openshift.com                                         
Hello-OpenShift web-server-rc-5772w http-8080


oc get all                          
NAME                      READY   STATUS    RESTARTS   AGE
pod/web-server-rc-5772w   1/1     Running   0          8s

NAME                                  DESIRED   CURRENT   READY   AGE
replicationcontroller/web-server-rc   1         1         1       99m

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/service-secure     ClusterIP   172.30.26.24    <none>        27443/TCP   99m
service/service-unsecure   ClusterIP   172.30.233.68   <none>        27017/TCP   99m

NAME                                        HOST/PORT                                                                 PATH   SERVICES           PORT   TERMINATION   WILDCARD
route.route.openshift.io/service-unsecure   service-unsecure-test1.apps.aiyengar-oc4534.qe.devcluster.openshift.com          service-unsecure   http                 None

------

Comment 6 errata-xmlrpc 2021-03-25 04:45:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.22 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0825