Bug 2005901

Summary: KS, KCM and KA going Degraded during master nodes upgrade
Product: OpenShift Container Platform Reporter: Jan Chaloupka <jchaloup>
Component: kube-schedulerAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: high Docs Contact:
Priority: medium    
Version: 4.10CC: aos-bugs, maszulik, mfojtik, sdodson, wking
Target Milestone: ---Flags: mfojtik: needinfo?
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: LifecycleFrozen
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:12:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
KS - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)
none
KCM - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)
none
KA - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)
none
KS - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded) none

Description Jan Chaloupka 2021-09-20 12:54:52 UTC
Created attachment 1824629 [details]
KS - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)

Description of problem:
Checking the last 115 from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade/ jobs ("1433654493412593664", "1433695382893760512", "1433703446451589120", "1433737883625197568", "1433745275154862080", "1433780536173662208", "1433788182586986496", "1433819059568250880", "1433839260707852288", "1433856995122745344", "1433877723872235520", "1433903880265011200", "1433926818208944128", "1433941032432570368", "1433964704052547584", "1433985471888756736", "1434007529066598400", "1434029702175002624", "1434051521623887872", "1434073950928769024", "1434095087838564352", "1434114952909557760", "1434143394703085568", "1434167027945181184", "1434184382335160320", "1434205390085558272", "1434229565730852864", "1434247354847858688", "1434269713793290240", "1434294199980658688", "1434343280841068544", "1434351140564111360", "1434399506228580352", "1434409340411842560", "1434447076791422976", "1434453531875610624", "1434487353568661504", "1434529592227401728", "1434558739649662976", "1434588325590601728", "1434602912591384576", "1434637821036990464", "1434649790867574784", "1434681235531108352", "1434705644606197760", "1434721329231171584", "1434769769571028992", "1434779545566711808", "1434810854880055296", "1434932999106859008", "1435287204233482240", "1435293628325957632", "1435335775729225728", "1435369395000971264", "1435562576384626688", "1435569519127957504", "1435604448561860608", "1435622695654920192", "1435648148486754304", "1435925603495710720", "1435932899873394688", "1435966907135037440", "1436003421353152512", "1436330398035480576", "1436363496152371200", "1436404814526287872", "1436481246967369728", "1436843648141496320", "1437206040121708544", "1437347542516895744", "1437369972601917440", "1437438673279782912", "1437781648878866432", "1437843290115280896", "1437860876945199104", "1438097470620962816", "1438107499562536960", "1438147825039839232", "1438188055109308416", "1438206454417854464", "1438245936542257152", "1438308798660874240", "1438344115946262528", "1438432227674296320", "1438439719040978944", "1438482509896617984", "1438508208925708288", "1438535336526352384", "1438558204857421824", "1438585684343394304", "1438630961670524928", "1438688875617718272", "1438696431564099584", "1438753378527088640", "1438776960233771008", "1438846270826352640", "1438892506228985856", "1438914022194810880", "1438955520907022336", "1438980934777966592", "1439017459603476480", "1439045437314043904", "1439078926621085696", "1439129330985734144", "1439192224179949568", "1439257819504185344", "1439323099492257792", "1439386032209399808", "1439461541865852928", "1439526935569895424", "1439588831945822208", "1439652767672045568", "1439718235556548608", "1439781112212623360", "1439867158736670720"), KS, KCM and KA goes Degraded at the end of each master node upgrade. 

From https://github.com/openshift/cluster-authentication-operator/blob/9efb3c1e5ac657aaa87f237d2c6aea586b7aad49/vendor/github.com/openshift/api/config/v1/types_cluster_operator.go#L161-L177

// Degraded indicates that the operator's current state does not match its
// desired state over a period of time resulting in a lower quality of service.
// The period of time may vary by component, but a Degraded state represents
// persistent observation of a condition.
...
// ... A service should not
// report Degraded during the course of a normal upgrade

Given the operator is going through an upgrade, reporting condition/Degraded=True is incorrect. The important piece of information here is "Degraded state represents persistent observation of a condition". The reported issue is not persistent, only temporary.

Comment 1 Jan Chaloupka 2021-09-20 12:56:24 UTC
Created attachment 1824630 [details]
KCM - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)

Comment 2 Jan Chaloupka 2021-09-20 12:56:52 UTC
Created attachment 1824632 [details]
KA - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)

Comment 3 Jan Chaloupka 2021-09-20 12:57:24 UTC
Created attachment 1824633 [details]
KS - Green interval indicates an operator going Degraded (red, yellow and blue intervals corresponds to master nodes getting upgraded)

Comment 4 Michal Fojtik 2021-11-19 23:09:31 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Whiteboard if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 5 Jan Chaloupka 2021-11-25 15:26:32 UTC
Will need more time to implement the relevant changes for the static pods to get guarded by a PDB

Comment 8 Jan Chaloupka 2022-01-07 11:44:43 UTC
Need more time to analyze why the CI tests in https://github.com/openshift/cluster-kube-apiserver-operator/pull/1275 are failing.

Comment 17 RamaKasturi 2022-02-04 09:00:48 UTC
Verified bug in the build below and i see that KA, KS & KCM did not go to degraded state during master nodes upgrade. Below is the proceduer i have followed to verify the same.

Procedure followed:
==================
1) Install 4.9 cluster
2) Pause worker node upgrade using the command oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/worker
3) upgrade master nodes by running the 'oc adm upgrade --to-image=<version>' command to 4.10.0-rc.0
4) I did not see any operators like KA, KCM & KS going to degraded state and the upgrade went fine.

NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.0   True        False         117m    Cluster version is 4.10.0-rc.0


Based on the above moving the bug to verified state.

Comment 19 errata-xmlrpc 2022-03-10 16:12:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056