Description of problem: As an aggregated API server, packageserver should follow the OpenShift high-availability conventions [1], to not disrupt the apiserver. Moreover, the AggregatedAPIDown alert is very sensitive to disruption and packageserver may be at the origin of CI failures during upgrades because of that [2]. As part of an effort to facilitate the transition from non-HA to HA, the monitoring team has loosened the alert in 4.8 [3], but reverted the change in 4.9 expecting all aggregated APIs to support disruptions. Thus, it would be best to also backport this to 4.8. What is expected of packageserver is to have the following in HA topology: - hard pod anti-affinity on hostname - rolling update strategy with maxUnavailability of 1 - a pod disruption budget [1] https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability [2] https://search.ci.openshift.org/?search=alert+AggregatedAPIDown+fired+for.%2Bv1.packages.operators.coreos.com.%2B&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job [3] https://bugzilla.redhat.com/show_bug.cgi?id=1970624 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The PDB resource is documented after the High Availability section as part of the Upgrade and Reconfiguration section [1]. [1] https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#upgrade-and-reconfiguration
1, Create a HA cluster, mac:openshift-tests-private jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-30-232019 True False 137m Cluster version is 4.9.0-0.nightly-2021-08-30-232019 mac:openshift-tests-private jianzhang$ oc exec catalog-operator-5556959747-b58n4 -- olm --version OLM version: 0.18.3 git commit: 01e1cf8ca9b4ec532d4b134b11e09bed8efc5b60 mac:openshift-tests-private jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-129-64.us-east-2.compute.internal Ready master 166m v1.22.0-rc.0+249ab87 ip-10-0-136-66.us-east-2.compute.internal Ready worker 162m v1.22.0-rc.0+249ab87 ip-10-0-169-191.us-east-2.compute.internal Ready worker 162m v1.22.0-rc.0+249ab87 ip-10-0-178-145.us-east-2.compute.internal Ready master 167m v1.22.0-rc.0+249ab87 ip-10-0-195-135.us-east-2.compute.internal Ready worker 158m v1.22.0-rc.0+249ab87 ip-10-0-206-170.us-east-2.compute.internal Ready master 168m v1.22.0-rc.0+249ab87 2, Check the pod anti-affinity configuration and check if the two pods are in different nodes. mac:openshift-tests-private jianzhang$ oc get deployment packageserver -o yaml ... spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: packageserver topologyKey: kubernetes.io/hostname mac:openshift-tests-private jianzhang$ oc get pods -l app=packageserver -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES packageserver-bdbb545d6-55mhs 1/1 Running 0 170m 10.129.0.5 ip-10-0-178-145.us-east-2.compute.internal <none> <none> packageserver-bdbb545d6-72ppk 1/1 Running 0 170m 10.128.0.41 ip-10-0-206-170.us-east-2.compute.internal <none> <none> 3, Recreate one packageserver pods, mac:openshift-tests-private jianzhang$ oc delete pods packageserver-bdbb545d6-55mhs pod "packageserver-bdbb545d6-55mhs" deleted mac:openshift-tests-private jianzhang$ oc get pods -l app=packageserver -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES packageserver-bdbb545d6-72ppk 1/1 Running 0 171m 10.128.0.41 ip-10-0-206-170.us-east-2.compute.internal <none> <none> packageserver-bdbb545d6-hqh4m 0/1 ContainerCreating 0 7s <none> ip-10-0-129-64.us-east-2.compute.internal <none> <none> mac:openshift-tests-private jianzhang$ oc get pods -l app=packageserver -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES packageserver-bdbb545d6-72ppk 1/1 Running 0 171m 10.128.0.41 ip-10-0-206-170.us-east-2.compute.internal <none> <none> packageserver-bdbb545d6-hqh4m 1/1 Running 0 20s 10.130.0.39 ip-10-0-129-64.us-east-2.compute.internal <none> <none> LGTM, the packageserver pods never running on the same node. 3, Create a non-HA cluster, check the pods and pdb. [cloud-user@preserve-olm-env jian]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-159-81.us-east-2.compute.internal Ready master,worker 133m v1.22.0-rc.0+249ab87 [cloud-user@preserve-olm-env jian]$ oc get deployment packageserver -o yaml apiVersion: apps/v1 ... spec: affinity: {} I guess we don't need to support the PDB in SNO since SNO doesn't support the HA. But, it doesn't have any negative impact on the SNO since the "maxUnavailable=1". I will verify it, please let me know if any problem, thanks! [cloud-user@preserve-olm-env jian]$ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE packageserver-pdb N/A 1 1 134m [cloud-user@preserve-olm-env jian]$ oc get pdb packageserver-pdb -o yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-08-31T03:47:52Z" generation: 1 name: packageserver-pdb namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 824bda17-3f7d-4665-81a4-4f825b612de8 resourceVersion: "6737" uid: 7ae3700d-3b0b-4885-a3a7-ca0655bd0fb9 spec: maxUnavailable: 1 selector: matchLabels: app: packageserver status: conditions: - lastTransitionTime: "2021-08-31T03:50:24Z" message: "" observedGeneration: 1 reason: SufficientPods status: "True" type: DisruptionAllowed currentHealthy: 1 desiredHealthy: 0 disruptionsAllowed: 1 expectedPods: 1 observedGeneration: 1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759