Bug 2005901 - KS, KCM and KA going Degraded during master nodes upgrade [NEEDINFO]
Summary: KS, KCM and KA going Degraded during master nodes upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.10.0
Assignee: Jan Chaloupka
QA Contact: RamaKasturi
URL:
Whiteboard: LifecycleFrozen
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-20 12:54 UTC by Jan Chaloupka
Modified: 2022-03-10 16:12 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:12:09 UTC
Target Upstream Version:
Embargoed:
mfojtik: needinfo?


Attachments (Terms of Use)
KS - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded) (17.91 KB, image/png)
2021-09-20 12:54 UTC, Jan Chaloupka
no flags Details
KCM - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded) (16.93 KB, image/png)
2021-09-20 12:56 UTC, Jan Chaloupka
no flags Details
KA - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded) (16.93 KB, image/png)
2021-09-20 12:56 UTC, Jan Chaloupka
no flags Details
KS - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded) (17.91 KB, image/png)
2021-09-20 12:57 UTC, Jan Chaloupka
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 538 0 None Merged bug 2039670: Sync the library-go 2022-01-21 12:39:17 UTC
Github openshift cluster-kube-apiserver-operator pull 1275 0 None Merged bug 2005901: Guard controller pdb 2022-01-12 07:53:08 UTC
Github openshift cluster-kube-apiserver-operator pull 1295 0 None Merged bug 2005901: Sync the library-go 2022-01-19 23:19:06 UTC
Github openshift cluster-kube-controller-manager-operator pull 568 0 None Merged bug 2005901: Guard controller pdb 2022-01-07 11:44:18 UTC
Github openshift cluster-kube-controller-manager-operator pull 588 0 None Merged bug 2005901: Bump library-go 2022-01-12 07:53:07 UTC
Github openshift cluster-kube-controller-manager-operator pull 591 0 None open bug 2005901: Sync library go 2022-01-21 12:39:21 UTC
Github openshift cluster-kube-scheduler-operator pull 373 0 None Merged bug 2005901: Guard controller pdb 2022-01-07 11:44:19 UTC
Github openshift cluster-kube-scheduler-operator pull 396 0 None Merged bug 2005901: Bump library-go 2022-01-12 07:53:04 UTC
Github openshift cluster-kube-scheduler-operator pull 397 0 None open bug 2005901: Sync the library-go 2022-01-21 12:39:22 UTC
Github openshift library-go pull 1281 0 None Merged bug 2005901: staticpod builder: initialize operand label selector separately 2022-01-12 07:53:04 UTC
Github openshift library-go pull 1287 0 None Merged bug 2005901: guard controller: create the pdb if it does not exist 2022-01-19 23:19:19 UTC
Github openshift origin pull 26766 0 None Merged bug 2005901: Allow KA guard probe to fail as designed 2022-01-19 23:19:21 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:12:37 UTC

Description Jan Chaloupka 2021-09-20 12:54:52 UTC
Created attachment 1824629 [details]
KS - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)

Description of problem:
Checking the last 115 from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade/ jobs ("1433654493412593664", "1433695382893760512", "1433703446451589120", "1433737883625197568", "1433745275154862080", "1433780536173662208", "1433788182586986496", "1433819059568250880", "1433839260707852288", "1433856995122745344", "1433877723872235520", "1433903880265011200", "1433926818208944128", "1433941032432570368", "1433964704052547584", "1433985471888756736", "1434007529066598400", "1434029702175002624", "1434051521623887872", "1434073950928769024", "1434095087838564352", "1434114952909557760", "1434143394703085568", "1434167027945181184", "1434184382335160320", "1434205390085558272", "1434229565730852864", "1434247354847858688", "1434269713793290240", "1434294199980658688", "1434343280841068544", "1434351140564111360", "1434399506228580352", "1434409340411842560", "1434447076791422976", "1434453531875610624", "1434487353568661504", "1434529592227401728", "1434558739649662976", "1434588325590601728", "1434602912591384576", "1434637821036990464", "1434649790867574784", "1434681235531108352", "1434705644606197760", "1434721329231171584", "1434769769571028992", "1434779545566711808", "1434810854880055296", "1434932999106859008", "1435287204233482240", "1435293628325957632", "1435335775729225728", "1435369395000971264", "1435562576384626688", "1435569519127957504", "1435604448561860608", "1435622695654920192", "1435648148486754304", "1435925603495710720", "1435932899873394688", "1435966907135037440", "1436003421353152512", "1436330398035480576", "1436363496152371200", "1436404814526287872", "1436481246967369728", "1436843648141496320", "1437206040121708544", "1437347542516895744", "1437369972601917440", "1437438673279782912", "1437781648878866432", "1437843290115280896", "1437860876945199104", "1438097470620962816", "1438107499562536960", "1438147825039839232", "1438188055109308416", "1438206454417854464", "1438245936542257152", "1438308798660874240", "1438344115946262528", "1438432227674296320", "1438439719040978944", "1438482509896617984", "1438508208925708288", "1438535336526352384", "1438558204857421824", "1438585684343394304", "1438630961670524928", "1438688875617718272", "1438696431564099584", "1438753378527088640", "1438776960233771008", "1438846270826352640", "1438892506228985856", "1438914022194810880", "1438955520907022336", "1438980934777966592", "1439017459603476480", "1439045437314043904", "1439078926621085696", "1439129330985734144", "1439192224179949568", "1439257819504185344", "1439323099492257792", "1439386032209399808", "1439461541865852928", "1439526935569895424", "1439588831945822208", "1439652767672045568", "1439718235556548608", "1439781112212623360", "1439867158736670720"), KS, KCM and KA goes Degraded at the end of each master node upgrade. 

From https://github.com/openshift/cluster-authentication-operator/blob/9efb3c1e5ac657aaa87f237d2c6aea586b7aad49/vendor/github.com/openshift/api/config/v1/types_cluster_operator.go#L161-L177

// Degraded indicates that the operator's current state does not match its
// desired state over a period of time resulting in a lower quality of service.
// The period of time may vary by component, but a Degraded state represents
// persistent observation of a condition.
...
// ... A service should not
// report Degraded during the course of a normal upgrade

Given the operator is going through an upgrade, reporting condition/Degraded=True is incorrect. The important piece of information here is "Degraded state represents persistent observation of a condition". The reported issue is not persistent, only temporary.

Comment 1 Jan Chaloupka 2021-09-20 12:56:24 UTC
Created attachment 1824630 [details]
KCM - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)

Comment 2 Jan Chaloupka 2021-09-20 12:56:52 UTC
Created attachment 1824632 [details]
KA - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)

Comment 3 Jan Chaloupka 2021-09-20 12:57:24 UTC
Created attachment 1824633 [details]
KS - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)

Comment 4 Michal Fojtik 2021-11-19 23:09:31 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 5 Jan Chaloupka 2021-11-25 15:26:32 UTC
Will need more time to implement the relevant changes for the static pods to get guarded by a PDB

Comment 8 Jan Chaloupka 2022-01-07 11:44:43 UTC
Need more time to analyze why the CI tests in https://github.com/openshift/cluster-kube-apiserver-operator/pull/1275 are failing.

Comment 17 RamaKasturi 2022-02-04 09:00:48 UTC
Verified bug in the build below and i see that KA, KS & KCM did not go to degraded state during master nodes upgrade. Below is the proceduer i have followed to verify the same.

Procedure followed:
==================
1) Install 4.9 cluster
2) Pause worker node upgrade using the command oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/worker
3) upgrade master nodes by running the 'oc adm upgrade --to-image=<version>' command to 4.10.0-rc.0
4) I did not see any operators like KA, KCM & KS going to degraded state and the upgrade went fine.

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.0   True        False         117m    Cluster version is 4.10.0-rc.0


Based on the above moving the bug to verified state.

Comment 19 errata-xmlrpc 2022-03-10 16:12:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.