1927080 – oc idle: Clusters upgrading with an idled workload do not have annotations on the workload's service

Bug 1927080 - oc idle: Clusters upgrading with an idled workload do not have annotations on the workload's service

Summary: oc idle: Clusters upgrading with an idled workload do not have annotations on...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Stephen Greene
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:	1925245
Blocks:	1927364
TreeView+	depends on / blocked

Reported:	2021-02-10 02:31 UTC by OpenShift BugZilla Robot
Modified:	2022-08-04 22:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Upgrading a cluster from a prior version with an idled workload. Consequence: The idled workload will not wake on HTTP request once upgraded to OCP 4.6/4.7 with BZ#1900989 due to `oc idle` feature fixups and reworks. Fix: On ingress-opreator startup, mirror any idling changes from endpoints to services (since in latest 4.6/4.7, idling is based off of service idle annotations). Result: Unidling workloads after upgrades works as expected.
Clone Of:
Clones:	1927364 (view as bug list)
Environment:
Last Closed:	2021-03-16 08:42:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 546	0	None	open	[release-4.7] Bug 1927080: Verify that Serivce resources have idle annotations from corresponding Endpoint resources, sh...	2021-02-11 10:11:29 UTC
Red Hat Product Errata	RHBA-2021:0749	0	None	None	None	2021-03-16 08:43:14 UTC

Description OpenShift BugZilla Robot 2021-02-10 02:31:31 UTC

+++ This bug was initially created as a clone of Bug #1925245 +++

Description of problem:
Bug 1900989 fixes `oc idle` in 4.6 and 4.7 by annotating a workload's service with the proper idle annotations, in addition to the workloads endpoints, among other things. Clusters upgrading to a cluster version with the new fixes for Bug 1900989 that have idled workloads will run into issues with unidling, since unidling the idled workload will not work without manual user intervention (the service idle annotations are needed for unidling to work going forward).


Steps to Reproduce:
1. Idle a workload (ex: run `oc idle` on a service + deployment + route)
2. Upgrade the cluster to a cluster version containing the fixes for Bug 1900989


Actual results:
Curling the idled route does not "wake it up".

Expected results:
Unidling a route after an upgrade should always work without user intervention.

Additional info:

--- Additional comment from sgreene on 2021-02-04 16:54:10 UTC ---

Note that the fix for this bug should only be available in 4.6 and 4.7, since any clusters upgrading to 4.8 and beyond would already have the idle annotations mirrored over from 4.6.z/4.7.z (we can shave a couple seconds off of operator start time but not performing the idle annotations check in future releases).

--- Additional comment from sgreene on 2021-02-04 20:47:00 UTC ---

Workaround for customers upgrading with idled workloads to a version of 4.6.z/4.7.z with the new idle changes from Bug 1900989:

0) Wait for upgrade to complete
1) Remove idle annotations from idled endpoints (oc edit ...) note the idled scalable resources and their prior replica count.
2) Manually scale idled scalable resources back up to the desired number of replicas (oc scale ...)
3) Route should now be unidled.

Comment 1 Andrew McDermott 2021-02-11 18:36:25 UTC

Bumping the priority to urgent as we want this to go into an early (as possible) 4.7.z release because it also need to go into 4.6 as early as possible after that.

Comment 2 Stephen Greene 2021-02-25 20:14:54 UTC

awaiting cherry pick

Comment 4 Arvind iyengar 2021-03-04 12:19:50 UTC

Verified in "4.7.0-0.nightly-2021-03-04-004412" release version. Upgrading a v4.7 cluster to the said payload, the idled routes gets woken up and becomes accessible via curl without any manual interventions: 
------------
$ oc get clusterversion                    
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.20    True        False         20m     Cluster version is 4.6.20


$ oc get all                   
NAME                      READY   STATUS    RESTARTS   AGE
pod/web-server-rc-mltl9   1/1     Running   0          21s

NAME                                  DESIRED   CURRENT   READY   AGE
replicationcontroller/web-server-rc   1         1         1       21s

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/service-secure     ClusterIP   172.30.97.217   <none>        27443/TCP   22s
service/service-unsecure   ClusterIP   172.30.31.235   <none>        27017/TCP   22s

NAME                                        HOST/PORT                                                                  PATH   SERVICES           PORT   TERMINATION   WILDCARD
route.route.openshift.io/service-unsecure   service-unsecure-test1a.apps.aiyengar-oc4620.qe.devcluster.openshift.com          service-unsecure   http                 None


$ curl service-unsecure-test1a.apps.aiyengar-oc4620.qe.devcluster.openshift.com                                           
Hello-OpenShift web-server-rc-mltl9 http-8080


$ oc idle service-unsecure                        
WARNING: idling when network policies are in place may cause connections to bypass network policy entirely
The service "test1a/service-unsecure" has been marked as idled 
The service will unidle ReplicationController "test1a/web-server-rc" to 1 replicas once it receives traffic 
ReplicationController "test1a/web-server-rc" has been idled 


$ oc get all                                      
NAME                                  DESIRED   CURRENT   READY   AGE
replicationcontroller/web-server-rc   0         0         0       87s

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/service-secure     ClusterIP   172.30.97.217   <none>        27443/TCP   87s
service/service-unsecure   ClusterIP   172.30.31.235   <none>        27017/TCP   87s

NAME                                        HOST/PORT                                                                  PATH   SERVICES           PORT   TERMINATION   WILDCARD
route.route.openshift.io/service-unsecure   service-unsecure-test1a.apps.aiyengar-oc4620.qe.devcluster.openshift.com          service-unsecure   http                 None


$ oc adm upgrade                                  
Cluster version is 4.6.20

Updates:

VERSION                           IMAGE
4.7.0-0.nightly-2021-03-04-004412 registry.ci.openshift.org/ocp/release@sha256:7c742552fb326b32a6e4da1700ecbbbbe284ef4a69c0675167904fdcc2a6e2e5

$ oc adm upgrade --to=4.7.0-0.nightly-2021-03-04-004412
Updating to 4.7.0-0.nightly-2021-03-04-004412

$ oc get clusterversion                          
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.20    True        True          60m     Working towards 4.7.0-0.nightly-2021-03-04-004412: 175 of 668 done (26% complete)

....

$ oc get clusterversion                    
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-03-04-004412    True        False         175m     Cluster version is 4.7.0-0.nightly-2021-03-04-004412

$ curl service-unsecure-test1a.apps.aiyengar-oc4620.qe.devcluster.openshift.com                                       
Hello-OpenShift web-server-rc-lrgs4 http-8080


$ oc get all                                     
NAME                      READY   STATUS    RESTARTS   AGE
pod/web-server-rc-lrgs4   1/1     Running   0          24s

NAME                                  DESIRED   CURRENT   READY   AGE
replicationcontroller/web-server-rc   1         1         1       69m

NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
service/service-secure     ClusterIP   172.30.97.217   <none>        27443/TCP   69m
service/service-unsecure   ClusterIP   172.30.31.235   <none>        27017/TCP   69m

NAME                                        HOST/PORT                                                                  PATH   SERVICES           PORT   TERMINATION   WILDCARD
route.route.openshift.io/service-unsecure   service-unsecure-test1a.apps.aiyengar-oc4620.qe.devcluster.openshift.com          service-unsecure   http                 None
------------

Comment 7 errata-xmlrpc 2021-03-16 08:42:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.2 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0749

Note You need to log in before you can comment on or make changes to this bug.