Description of problem: During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands. The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293. Example: - https://github.com/openshift/cluster-authentication-operator/pull/476/files + https://github.com/openshift/cluster-authentication-operator/pull/514/files HowTo: 1. creatinge a PDB manifest 2. using conditional resources to deploy the PDB only when the infrastructure topology is other than SNO
If necessary, please consider backporting the PDB into 4.10 as well.
This is definitely something we should be adding for the CCCMO, and I think most likely backport to 4.10
Some feedback has been left on the PR to fix this, will need to be updated before we can merge
Checked on alibaba and aws, pdb is there, labels and selectors are correct, will check more providers tomorrow. on alibaba: 1. Install a fresh cluster liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-16-163450 True False 5h58m Cluster version is 4.11.0-0.nightly-2022-04-16-163450 liuhuali@Lius-MacBook-Pro huali-test % 2. Check pdb is there liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE alibabacloud-cloud-controller-manager 1 N/A 1 6h18m liuhuali@Lius-MacBook-Pro huali-test % 3. Check labels and selectors are correct liuhuali@Lius-MacBook-Pro huali-test % oc get all -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE pod/alibaba-cloud-controller-manager-69bd7cbd9c-d4qrp 1/1 Running 0 6h21m pod/alibaba-cloud-controller-manager-69bd7cbd9c-lkstz 1/1 Running 0 6h21m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/alibaba-cloud-controller-manager 2/2 2 2 6h21m NAME DESIRED CURRENT READY AGE replicaset.apps/alibaba-cloud-controller-manager-69bd7cbd9c 2 2 2 6h21m liuhuali@Lius-MacBook-Pro huali-test % oc edit deploy alibaba-cloud-controller-manager -n openshift-cloud-controller-manager Edit cancelled, no changes made. ... labels: infrastructure.openshift.io/cloud-controller-manager: AlibabaCloud k8s-app: alibaba-cloud-controller-manager ... selector: matchLabels: infrastructure.openshift.io/cloud-controller-manager: AlibabaCloud k8s-app: alibaba-cloud-controller-manager ... on aws: 1. Install a fresh cluster liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-16-163450 True False 92m Cluster version is 4.11.0-0.nightly-2022-04-16-163450 liuhuali@Lius-MacBook-Pro huali-test % 2. Edit featuregate, then wait for nodes restart successfully until all nodes get READY status change spec: {} to spec: featureSet: TechPreviewNoUpgrade liuhuali@Lius-MacBook-Pro huali-test % oc edit featuregate cluster featuregate.config.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-92.us-east-2.compute.internal Ready worker 103m v1.23.3+54654d2 ip-10-0-134-118.us-east-2.compute.internal Ready master 116m v1.23.3+54654d2 ip-10-0-182-199.us-east-2.compute.internal Ready master 116m v1.23.3+54654d2 ip-10-0-185-140.us-east-2.compute.internal Ready worker 110m v1.23.3+54654d2 ip-10-0-212-38.us-east-2.compute.internal Ready worker 109m v1.23.3+54654d2 ip-10-0-213-209.us-east-2.compute.internal Ready master 116m v1.23.3+54654d2 liuhuali@Lius-MacBook-Pro huali-test % 3. Check pdb is there liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE aws-cloud-controller-manager 1 N/A 1 24m liuhuali@Lius-MacBook-Pro huali-test % 4. Check labels and selectors are correct liuhuali@Lius-MacBook-Pro huali-test % oc get all -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE pod/aws-cloud-controller-manager-846dd9f85c-64kp6 1/1 Running 0 25m pod/aws-cloud-controller-manager-846dd9f85c-m6kdh 1/1 Running 0 25m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/aws-cloud-controller-manager 2/2 2 2 25m NAME DESIRED CURRENT READY AGE replicaset.apps/aws-cloud-controller-manager-846dd9f85c 2 2 2 25m liuhuali@Lius-MacBook-Pro huali-test % oc edit deploy aws-cloud-controller-manager -n openshift-cloud-controller-manager Edit cancelled, no changes made. ... labels: infrastructure.openshift.io/cloud-controller-manager: AWS k8s-app: aws-cloud-controller-manager ... selector: matchLabels: infrastructure.openshift.io/cloud-controller-manager: AWS k8s-app: aws-cloud-controller-manager ...
Checked on gcp, azure, ibm, vsphere and openstack, pdb is there, labels and selectors are correct. Move this to Verifed. clusterversion: 4.11.0-0.nightly-2022-04-16-163450 liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE gcp-cloud-controller-manager 1 N/A 1 70m liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE azure-cloud-controller-manager 1 N/A 1 67m liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE ibmcloud-cloud-controller-manager 1 N/A 1 111m liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE vsphere-cloud-controller-manager 1 N/A 1 82m liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE openstack-cloud-controller-manager 1 N/A 1 33m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069