Description of problem: kube-scheduler pod goes into crashLoopBackoff state when using a custom-policy while upgrading the cluster Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. create a policy.cfg file with the contents below { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "MaxGCEPDVolumeCount"}, {"name" : "GeneralPredicates"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "MaxCSIVolumeCountPred"}, {"name" : "CheckVolumeBinding"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "MatchInterPodAffinity"}, {"name" : "CheckNodeUnschedulable"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "NodePreferAvoidPodsPriority", "weight" : 1}, {"name" : "NodeAffinityPriority", "weight" : 1}, {"name" : "TaintTolerationPriority", "weight" : 1}, {"name" : "ImageLocalityPriority", "weight" : 1}, {"name" : "SelectorSpreadPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ] } 2.oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy 3. oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge 4. Run the upgrade command to perform an upgrade from 4.9 to 4.10 Actual results: kube-scheduler pod goes into crashLoopBackOffState Expected results: kube-scheduler should not go into crashLoopBackOffState. Additional info: If the policy API is configured when upgrading from 4.9 to 4.10, the kube-scheduler is expected to crash loop. @mdame created https://github.com/openshift/cluster-kube-scheduler-operator/pull/391 to avoid the upgrade from proceeding until the policy API is no longer in use.
(In reply to RamaKasturi from comment #0) > Description of problem: > kube-scheduler pod goes into crashLoopBackoff state when using a > custom-policy while upgrading the cluster > > Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2022-01-05-181126 > > > How reproducible: > Always > > Steps to Reproduce: > 1. create a policy.cfg file with the contents below > { > "kind" : "Policy", > "apiVersion" : "v1", > "predicates" : [ > {"name" : "MaxGCEPDVolumeCount"}, > {"name" : "GeneralPredicates"}, > {"name" : "MaxAzureDiskVolumeCount"}, > {"name" : "MaxCSIVolumeCountPred"}, > {"name" : "CheckVolumeBinding"}, > {"name" : "MaxEBSVolumeCount"}, > {"name" : "MatchInterPodAffinity"}, > {"name" : "CheckNodeUnschedulable"}, > {"name" : "NoDiskConflict"}, > {"name" : "NoVolumeZoneConflict"}, > {"name" : "PodToleratesNodeTaints"} > ], > "priorities" : [ > {"name" : "LeastRequestedPriority", "weight" : 1}, > {"name" : "BalancedResourceAllocation", "weight" : 1}, > {"name" : "ServiceSpreadingPriority", "weight" : 1}, > {"name" : "NodePreferAvoidPodsPriority", "weight" : 1}, > {"name" : "NodeAffinityPriority", "weight" : 1}, > {"name" : "TaintTolerationPriority", "weight" : 1}, > {"name" : "ImageLocalityPriority", "weight" : 1}, > {"name" : "SelectorSpreadPriority", "weight" : 1}, > {"name" : "InterPodAffinityPriority", "weight" : 1}, > {"name" : "EqualPriority", "weight" : 1} > ] > } > > 2.oc create configmap -n openshift-config --from-file=policy.cfg > scheduler-policy > 3. oc patch Scheduler cluster --type='merge' -p > '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge > 4. Run the upgrade command to perform an upgrade from 4.9 to 4.10 > > Actual results: > kube-scheduler pod goes into crashLoopBackOffState > > Expected results: > kube-scheduler should not go into crashLoopBackOffState. > > Additional info: > If the policy API is configured when upgrading from 4.9 to 4.10, the > kube-scheduler is expected to crash loop. @mdame created > https://github.com/openshift/cluster-kube-scheduler-operator/pull/391 to > avoid the upgrade from proceeding until the policy API is no longer in use.
*** Bug 2041985 has been marked as a duplicate of this bug. ***
This issue needs to be resolved before 4.10 is GAed and all 4.9 existing installations need to be upgrade to the latest 4.9.x version including fix in https://github.com/openshift/cluster-kube-scheduler-operator/pull/400 before upgrading to 4.10.
Marking TestBlocker to match duplicated bug https://bugzilla.redhat.com/show_bug.cgi?id=2041985. Blocks Azure Stack Hub testing.
Verified bug with the payload below and i see that even after clearing the policy field the upgradable=False warning does not go away. 4.9.0-0.nightly-2022-01-21-203405
Tried verifying the bug with the build below and when policy field is set i do not see the upggradeable=False warning, so moving the bug back to assigned state. [knarra@knarra ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2022-01-24-212243 True False 4h19m Cluster version is 4.9.0-0.nightly-2022-01-24-212243 [knarra@knarra ~]$ oc get Scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2022-01-25T05:39:38Z" generation: 2 name: cluster resourceVersion: "29144" uid: a13e2eea-7e64-41fc-9935-2649028f2222 spec: mastersSchedulable: false policy: name: scheduler-policy status: {} [knarra@knarra ~]$ oc adm upgrade Cluster version is 4.9.0-0.nightly-2022-01-24-212243 Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.9 warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.9.0-0.nightly-2022-01-24-212243 not found in the "stable-4.9" channel Based on the above moving bug back to assigned state.
With latest payload , still hit the same issue: [root@localhost tmp]# oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge scheduler.config.openshift.io/cluster patched [root@localhost tmp]# oc get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2022-01-26T08:40:38Z" generation: 2 name: cluster resourceVersion: "31179" uid: cd4fa1ba-2845-4fc6-87c8-1f3ea986f08d spec: mastersSchedulable: false policy: name: scheduler-policy status: {} [root@localhost tmp]# oc get pod NAME READY STATUS RESTARTS AGE ... openshift-kube-scheduler-guard-ip-10-0-139-80.us-east-2.compute.internal 0/1 Running 0 23m openshift-kube-scheduler-guard-ip-10-0-184-92.us-east-2.compute.internal 1/1 Running 0 23m openshift-kube-scheduler-guard-ip-10-0-193-71.us-east-2.compute.internal 1/1 Running 0 23m openshift-kube-scheduler-ip-10-0-139-80.us-east-2.compute.internal 2/3 CrashLoopBackOff 9 (111s ago) 23m openshift-kube-scheduler-ip-10-0-184-92.us-east-2.compute.internal 3/3 Running 0 49m openshift-kube-scheduler-ip-10-0-193-71.us-east-2.compute.internal 3/3 Running 0 48m ... [root@localhost tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2022-01-24-212243 True True 37m Working towards 4.10.0-0.nightly-2022-01-25-023600: 117 of 769 done (15% complete), waiting up to 40 minutes on kube-scheduler
The fix for https://bugzilla.redhat.com/show_bug.cgi?id=2037665#c12 was merged yesterday. So 4.9.0-0.nightly-2022-01-24-212243 will not have it yet. Any way to test today's build?
The bug is not a blocker for 4.9. Only for 4.10 since the fix needs to be included in 4.9 before the 4.9 -> 4.10 upgrade. However, to get the fix merged I needed to set the target release to 4.9.z. Setting the blocker to - so the 4.9.18 can be promoted.
If all of the PRs have merged then MODIFIED is the right state, the bot should move it to ON_QA when a new nightly is created, though I'm a bit surprised to have seen it do that yesterday. Anyway, lets hope a new nightly comes soon and this doesn't again move to ON_QA prematurely.
Verified bug with build below and i see that it works as expected. [knarra@knarra ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2022-01-27-035211 True False 3h16m Cluster version is 4.9.0-0.nightly-2022-01-27-035211 Below is the procedure i have followed to verify the bug: ======================================================== 1) Install latest 4.9 nightly 2) Create configmap with the policy.cfg below [knarra@knarra ~]$ cat policy.cfg { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "MaxGCEPDVolumeCount"}, {"name" : "GeneralPredicates"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "MaxCSIVolumeCountPred"}, {"name" : "CheckVolumeBinding"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "MatchInterPodAffinity"}, {"name" : "CheckNodeUnschedulable"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "NodePreferAvoidPodsPriority", "weight" : 1}, {"name" : "NodeAffinityPriority", "weight" : 1}, {"name" : "TaintTolerationPriority", "weight" : 1}, {"name" : "ImageLocalityPriority", "weight" : 1}, {"name" : "SelectorSpreadPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ] } 3) oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy 4) oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge 5) Now wait for kube-scheduler pods to restart 6) Run command "oc adm upgrade" you will not be allowed to upgrade to 4.10 and below is the warning you see. [knarra@knarra ~]$ oc adm upgrade Cluster version is 4.9.0-0.nightly-2022-01-27-035211 Upgradeable=False Reason: Policy_PolicyFieldSpecified Message: Cluster operator kube-scheduler should not be upgraded between minor versions: PolicyUpgradeable: deprecated scheduler.policy field is set, and it is to be removed in the next release Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.9 warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.9.0-0.nightly-2022-01-27-035211 not found in the "stable-4.9" channel 7) Remove the policy file name from the policy field in Scheduler cluster object 8) wait for the kube-scheduler pods to restart 9) Now run "oc adm upgrade" 10) you can see that the upgradeable=False statement is gone from the `oc adm upgrade` and user will be able to upgrade to 4.10 with out any issues. Based on the above moving bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.19 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0340