Bug 2051457 - [RFE] PDB for cloud-controller-manager to avoid going too many replicas down
Summary: [RFE] PDB for cloud-controller-manager to avoid going too many replicas down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: dmoiseev
QA Contact: Huali Liu
URL:
Whiteboard:
Depends On:
Blocks: 2099499
TreeView+ depends on / blocked
 
Reported: 2022-02-07 10:01 UTC by Jan Chaloupka
Modified: 2022-08-10 10:48 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:47:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-cloud-controller-manager-operator pull 174 0 None open Bug 2051457: CCM PodDisruptionBudgets 2022-02-23 16:20:33 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:48:15 UTC

Description Jan Chaloupka 2022-02-07 10:01:06 UTC
Description of problem:

During master nodes upgrade when nodes are getting drained there's currently no protection from two or more operands going down. If your component is required to be available during upgrade or other voluntary disruptions, please consider deploying PDB to protect your operands. The effort is tracked in https://issues.redhat.com/browse/WRKLDS-293.

Example:
- https://github.com/openshift/cluster-authentication-operator/pull/476/files + https://github.com/openshift/cluster-authentication-operator/pull/514/files

HowTo:
1. creatinge a PDB manifest
2. using conditional resources to deploy the PDB only when the infrastructure topology is other than SNO

Comment 1 Jan Chaloupka 2022-02-07 10:07:57 UTC
If necessary, please consider backporting the PDB into 4.10 as well.

Comment 2 Joel Speed 2022-02-07 17:42:25 UTC
This is definitely something we should be adding for the CCCMO, and I think most likely backport to 4.10

Comment 3 Joel Speed 2022-03-09 12:47:56 UTC
Some feedback has been left on the PR to fix this, will need to be updated before we can merge

Comment 5 Huali Liu 2022-04-18 11:43:12 UTC
Checked on alibaba and aws, pdb is there, labels and selectors are correct, will check more providers tomorrow.

on alibaba:
1. Install a fresh cluster
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-16-163450   True        False         5h58m   Cluster version is 4.11.0-0.nightly-2022-04-16-163450
liuhuali@Lius-MacBook-Pro huali-test % 

2. Check pdb is there
liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager
NAME                                    MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
alibabacloud-cloud-controller-manager   1               N/A               1                     6h18m
liuhuali@Lius-MacBook-Pro huali-test % 

3. Check labels and selectors are correct
liuhuali@Lius-MacBook-Pro huali-test % oc get all -n openshift-cloud-controller-manager
NAME                                                    READY   STATUS    RESTARTS   AGE
pod/alibaba-cloud-controller-manager-69bd7cbd9c-d4qrp   1/1     Running   0          6h21m
pod/alibaba-cloud-controller-manager-69bd7cbd9c-lkstz   1/1     Running   0          6h21m

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/alibaba-cloud-controller-manager   2/2     2            2           6h21m

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/alibaba-cloud-controller-manager-69bd7cbd9c   2         2         2       6h21m
liuhuali@Lius-MacBook-Pro huali-test % oc edit deploy alibaba-cloud-controller-manager -n openshift-cloud-controller-manager
Edit cancelled, no changes made.
...
  labels:
    infrastructure.openshift.io/cloud-controller-manager: AlibabaCloud
    k8s-app: alibaba-cloud-controller-manager
...
  selector:
    matchLabels:
      infrastructure.openshift.io/cloud-controller-manager: AlibabaCloud
      k8s-app: alibaba-cloud-controller-manager
...

on aws:
1. Install a fresh cluster
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-16-163450   True        False         92m     Cluster version is 4.11.0-0.nightly-2022-04-16-163450
liuhuali@Lius-MacBook-Pro huali-test % 

2. Edit featuregate, then wait for nodes restart successfully until all nodes get READY status
change 
spec: {}
to
spec:
  featureSet: TechPreviewNoUpgrade

liuhuali@Lius-MacBook-Pro huali-test % oc edit featuregate cluster
featuregate.config.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-128-92.us-east-2.compute.internal    Ready    worker   103m   v1.23.3+54654d2
ip-10-0-134-118.us-east-2.compute.internal   Ready    master   116m   v1.23.3+54654d2
ip-10-0-182-199.us-east-2.compute.internal   Ready    master   116m   v1.23.3+54654d2
ip-10-0-185-140.us-east-2.compute.internal   Ready    worker   110m   v1.23.3+54654d2
ip-10-0-212-38.us-east-2.compute.internal    Ready    worker   109m   v1.23.3+54654d2
ip-10-0-213-209.us-east-2.compute.internal   Ready    master   116m   v1.23.3+54654d2
liuhuali@Lius-MacBook-Pro huali-test % 

3. Check pdb is there
liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager
NAME                           MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
aws-cloud-controller-manager   1               N/A               1                     24m
liuhuali@Lius-MacBook-Pro huali-test % 

4. Check labels and selectors are correct
liuhuali@Lius-MacBook-Pro huali-test % oc get all -n openshift-cloud-controller-manager
NAME                                                READY   STATUS    RESTARTS   AGE
pod/aws-cloud-controller-manager-846dd9f85c-64kp6   1/1     Running   0          25m
pod/aws-cloud-controller-manager-846dd9f85c-m6kdh   1/1     Running   0          25m

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/aws-cloud-controller-manager   2/2     2            2           25m

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/aws-cloud-controller-manager-846dd9f85c   2         2         2       25m
liuhuali@Lius-MacBook-Pro huali-test % oc edit deploy aws-cloud-controller-manager -n openshift-cloud-controller-manager
Edit cancelled, no changes made. 
...
  labels:
    infrastructure.openshift.io/cloud-controller-manager: AWS
    k8s-app: aws-cloud-controller-manager
...
  selector:
    matchLabels:
      infrastructure.openshift.io/cloud-controller-manager: AWS
      k8s-app: aws-cloud-controller-manager
...

Comment 7 Huali Liu 2022-04-19 04:05:08 UTC
Checked on gcp, azure, ibm, vsphere and openstack, pdb is there, labels and selectors are correct. Move this to Verifed.
clusterversion: 4.11.0-0.nightly-2022-04-16-163450

liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager
NAME                           MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
gcp-cloud-controller-manager   1               N/A               1                     70m

liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager
NAME                             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
azure-cloud-controller-manager   1               N/A               1                     67m

liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager
NAME                                MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
ibmcloud-cloud-controller-manager   1               N/A               1                     111m

liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager
NAME                               MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
vsphere-cloud-controller-manager   1               N/A               1                     82m

liuhuali@Lius-MacBook-Pro huali-test % oc get pdb -n openshift-cloud-controller-manager
NAME                                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
openstack-cloud-controller-manager   1               N/A               1                     33m

Comment 10 errata-xmlrpc 2022-08-10 10:47:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.