Description of problem: It is possible that all the pods of a Service are failing readiness probe. In such case Endpoints object related to that Service will only have slices listing `notReadyAddresses` and the `addresses` key will be missing. kuryr-controller starts a crashloop in that case. Version-Release number of selected component (if applicable): 3.11 How reproducible: Always. Steps to Reproduce: 1. Create pods with some dummy readinessProbe that will always fail. 2. Create a service exposing that pods. 3. Double-check that Endpoints object related to that Service only contains slices with `notReadyAddresses` and doesn't have ones with `addresses`. Actual results: kuryr-controller will enter a crashloop. Expected results: kuryr-controller nicely handles such Endpoints objects and doesn't crash. Additional info:
Verified with: v3.11.394 Created the following demo apiVersion: apps/v1 kind: Deployment metadata: name: demo labels: app: demo spec: replicas: 3 selector: matchLabels: app: demo template: metadata: labels: app: demo spec: containers: - name: demo image: kuryr/demo ports: - containerPort: 8080 readinessProbe: httpGet: path: /healthz port: 8089 initialDelaySeconds: 15 timeoutSeconds: 1 And created a service: apiVersion: v1 kind: Service metadata: name: demo labels: app: demo spec: selector: app: demo ports: - port: 80 protocol: TCP targetPort: 8080 Pods: $ oc get pods -l app=demo NAME READY STATUS RESTARTS AGE demo-7c768cff5-2bjd6 0/1 Running 0 3h demo-7c768cff5-8cjbp 0/1 Running 0 3h demo-7c768cff5-8rzn9 0/1 Running 0 3h Endpoints: $ oc -o yaml get endpoints demo apiVersion: v1 kind: Endpoints metadata: annotations: openstack.org/kuryr-lbaas-spec: '{"versioned_object.data": {"ip": "172.30.213.127", "lb_ip": null, "ports": [{"versioned_object.data": {"name": null, "port": 80, "protocol": "TCP", "targetPort": "8080"}, "versioned_object.name": "LBaaSPortSpec", "versioned_object.namespace": "kuryr_kubernetes", "versioned_object.version": "1.1"}], "project_id": "c40b96037b09463bac299e29aacb674d", "security_groups_ids": ["e91e809a-a29a-4fb9-9ca5-b8e6688d23ad", "cfe78c8f-48ab-4c4c-ac9e-f51c20060d2b"], "subnet_id": "3aafbd77-5315-478f-a064-bbe307d408e4", "type": "ClusterIP"}, "versioned_object.name": "LBaaSServiceSpec", "versioned_object.namespace": "kuryr_kubernetes", "versioned_object.version": "1.0"}' creationTimestamp: "2021-03-02T04:49:35Z" name: demo namespace: default resourceVersion: "57086" selfLink: /api/v1/namespaces/default/endpoints/demo uid: af9b2bb8-7b12-11eb-9244-fa163ef06fb8 subsets: - notReadyAddresses: - ip: 10.11.1.147 nodeName: master-2.openshift.example.com targetRef: kind: Pod name: demo-7c768cff5-8cjbp namespace: default resourceVersion: "57059" uid: a6f731a9-7b12-11eb-9244-fa163ef06fb8 - ip: 10.11.1.233 nodeName: app-node-0.openshift.example.com targetRef: kind: Pod name: demo-7c768cff5-8rzn9 namespace: default resourceVersion: "57058" uid: a6fab10d-7b12-11eb-9244-fa163ef06fb8 - ip: 10.11.1.45 nodeName: master-0.openshift.example.com targetRef: kind: Pod name: demo-7c768cff5-2bjd6 namespace: default resourceVersion: "57060" uid: a6fc0f98-7b12-11eb-9244-fa163ef06fb8 ports: - port: 8080 protocol: TCP Kuryr pods: $ oc get pods -n kuryr NAME READY STATUS RESTARTS AGE kuryr-cni-ds-2r9b2 2/2 Running 0 9h kuryr-cni-ds-7245d 2/2 Running 0 9h kuryr-cni-ds-cqz5s 2/2 Running 0 9h kuryr-cni-ds-lfc2c 2/2 Running 0 9h kuryr-cni-ds-lzbkj 2/2 Running 0 9h kuryr-cni-ds-tl96l 2/2 Running 0 9h kuryr-cni-ds-xrjfr 2/2 Running 0 9h kuryr-cni-ds-zxprh 2/2 Running 0 9h kuryr-controller-6bf6f8958f-j6pch 1/1 Running 0 9h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 3.11.394 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0637