Bug 1978340 - packageserver isn't following the OpenShift HA conventions
Summary: packageserver isn't following the OpenShift HA conventions
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: tflannag
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-01 15:31 UTC by Damien Grisonnet
Modified: 2021-10-26 06:05 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:37:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 137 0 None open Bug 1978340: Ensure the PackageServer CSV contains a hard pod anti-affinity configuration 2021-07-27 22:14:42 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:37:51 UTC

Description Damien Grisonnet 2021-07-01 15:31:30 UTC
Description of problem:

As an aggregated API server, packageserver should follow the OpenShift high-availability conventions [1], to not disrupt the apiserver.
Moreover, the AggregatedAPIDown alert is very sensitive to disruption and packageserver may be at the origin of CI failures during upgrades because of that [2].

As part of an effort to facilitate the transition from non-HA to HA, the monitoring team has loosened the alert in 4.8 [3], but reverted the change in 4.9 expecting all aggregated APIs to support disruptions. Thus, it would be best to also backport this to 4.8.

What is expected of packageserver is to have the following in HA topology:
- hard pod anti-affinity on hostname
- rolling update strategy with maxUnavailability of 1
- a pod disruption budget 

[1] https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability
[2] https://search.ci.openshift.org/?search=alert+AggregatedAPIDown+fired+for.%2Bv1.packages.operators.coreos.com.%2B&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1970624

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Damien Grisonnet 2021-07-12 12:38:03 UTC
The PDB resource is documented after the High Availability section as part of the Upgrade and Reconfiguration section [1].

[1] https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#upgrade-and-reconfiguration

Comment 5 Jian Zhang 2021-08-31 06:30:25 UTC
1, Create a HA cluster,
mac:openshift-tests-private jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-30-232019   True        False         137m    Cluster version is 4.9.0-0.nightly-2021-08-30-232019
mac:openshift-tests-private jianzhang$ oc exec catalog-operator-5556959747-b58n4 -- olm --version
OLM version: 0.18.3
git commit: 01e1cf8ca9b4ec532d4b134b11e09bed8efc5b60


mac:openshift-tests-private jianzhang$ oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-129-64.us-east-2.compute.internal    Ready    master   166m   v1.22.0-rc.0+249ab87
ip-10-0-136-66.us-east-2.compute.internal    Ready    worker   162m   v1.22.0-rc.0+249ab87
ip-10-0-169-191.us-east-2.compute.internal   Ready    worker   162m   v1.22.0-rc.0+249ab87
ip-10-0-178-145.us-east-2.compute.internal   Ready    master   167m   v1.22.0-rc.0+249ab87
ip-10-0-195-135.us-east-2.compute.internal   Ready    worker   158m   v1.22.0-rc.0+249ab87
ip-10-0-206-170.us-east-2.compute.internal   Ready    master   168m   v1.22.0-rc.0+249ab87

2, Check the pod anti-affinity configuration and check if the two pods are in different nodes.
mac:openshift-tests-private jianzhang$ oc get deployment packageserver  -o yaml
...
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: packageserver
            topologyKey: kubernetes.io/hostname


mac:openshift-tests-private jianzhang$ oc get pods -l app=packageserver -o wide
NAME                            READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
packageserver-bdbb545d6-55mhs   1/1     Running   0          170m   10.129.0.5    ip-10-0-178-145.us-east-2.compute.internal   <none>           <none>
packageserver-bdbb545d6-72ppk   1/1     Running   0          170m   10.128.0.41   ip-10-0-206-170.us-east-2.compute.internal   <none>           <none>

3, Recreate one packageserver pods, 
mac:openshift-tests-private jianzhang$ oc delete pods packageserver-bdbb545d6-55mhs 
pod "packageserver-bdbb545d6-55mhs" deleted
mac:openshift-tests-private jianzhang$ oc get pods -l app=packageserver -o wide
NAME                            READY   STATUS              RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
packageserver-bdbb545d6-72ppk   1/1     Running             0          171m   10.128.0.41   ip-10-0-206-170.us-east-2.compute.internal   <none>           <none>
packageserver-bdbb545d6-hqh4m   0/1     ContainerCreating   0          7s     <none>        ip-10-0-129-64.us-east-2.compute.internal    <none>           <none>
mac:openshift-tests-private jianzhang$ oc get pods -l app=packageserver -o wide
NAME                            READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
packageserver-bdbb545d6-72ppk   1/1     Running   0          171m   10.128.0.41   ip-10-0-206-170.us-east-2.compute.internal   <none>           <none>
packageserver-bdbb545d6-hqh4m   1/1     Running   0          20s    10.130.0.39   ip-10-0-129-64.us-east-2.compute.internal    <none>           <none>

LGTM, the packageserver pods never running on the same node.

3, Create a non-HA cluster, check the pods and pdb.

[cloud-user@preserve-olm-env jian]$ oc get nodes
NAME                                        STATUS   ROLES           AGE    VERSION
ip-10-0-159-81.us-east-2.compute.internal   Ready    master,worker   133m   v1.22.0-rc.0+249ab87

[cloud-user@preserve-olm-env jian]$ oc get deployment packageserver -o yaml
apiVersion: apps/v1
...
    spec:
      affinity: {}


I guess we don't need to support the PDB in SNO since SNO doesn't support the HA. But, it doesn't have any negative impact on the SNO since the "maxUnavailable=1".
I will verify it, please let me know if any problem, thanks!

[cloud-user@preserve-olm-env jian]$ oc get pdb
NAME                MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
packageserver-pdb   N/A             1                 1                     134m
[cloud-user@preserve-olm-env jian]$ oc get pdb packageserver-pdb -o yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2021-08-31T03:47:52Z"
  generation: 1
  name: packageserver-pdb
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 824bda17-3f7d-4665-81a4-4f825b612de8
  resourceVersion: "6737"
  uid: 7ae3700d-3b0b-4885-a3a7-ca0655bd0fb9
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: packageserver
status:
  conditions:
  - lastTransitionTime: "2021-08-31T03:50:24Z"
    message: ""
    observedGeneration: 1
    reason: SufficientPods
    status: "True"
    type: DisruptionAllowed
  currentHealthy: 1
  desiredHealthy: 0
  disruptionsAllowed: 1
  expectedPods: 1
  observedGeneration: 1

Comment 8 errata-xmlrpc 2021-10-18 17:37:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.