Bug 1929216 - KeyError: 'addresses' in kuryr-controller when Endpoints' slice only lists notReadyAddresses
Summary: KeyError: 'addresses' in kuryr-controller when Endpoints' slice only lists no...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: 3.11.z
Assignee: Michał Dulko
QA Contact: Itzik Brown
URL:
Whiteboard:
Depends On:
Blocks: 1980957
TreeView+ depends on / blocked
 
Reported: 2021-02-16 13:55 UTC by Michał Dulko
Modified: 2021-07-10 06:32 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1980957 (view as bug list)
Environment:
Last Closed: 2021-03-03 12:28:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 458 0 None closed Bug 1929216: Handle Endpoints missing `addresses` field 2021-07-21 22:06:09 UTC
Red Hat Product Errata RHSA-2021:0637 0 None None None 2021-03-03 12:28:12 UTC

Description Michał Dulko 2021-02-16 13:55:59 UTC
Description of problem:
It is possible that all the pods of a Service are failing readiness probe. In such case Endpoints object related to that Service will only have slices listing `notReadyAddresses` and the `addresses` key will be missing. kuryr-controller starts a crashloop in that case.

Version-Release number of selected component (if applicable):
3.11

How reproducible:
Always.

Steps to Reproduce:
1. Create pods with some dummy readinessProbe that will always fail.
2. Create a service exposing that pods.
3. Double-check that Endpoints object related to that Service only contains slices with `notReadyAddresses` and doesn't have ones with `addresses`.

Actual results:
kuryr-controller will enter a crashloop.

Expected results:
kuryr-controller nicely handles such Endpoints objects and doesn't crash.

Additional info:

Comment 3 Itzik Brown 2021-03-02 08:15:38 UTC
Verified with: v3.11.394

Created the following demo
apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo
  labels:
    app: demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: demo
  template:
    metadata:
      labels:
        app: demo
    spec:
      containers:
      - name: demo
        image: kuryr/demo
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8089
          initialDelaySeconds: 15
          timeoutSeconds: 1

And created a service:
apiVersion: v1
kind: Service
metadata:
  name: demo
labels:
  app: demo
spec:
  selector:                  
    app: demo
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8080


Pods:
$ oc get pods -l app=demo
NAME                   READY     STATUS    RESTARTS   AGE
demo-7c768cff5-2bjd6   0/1       Running   0          3h
demo-7c768cff5-8cjbp   0/1       Running   0          3h
demo-7c768cff5-8rzn9   0/1       Running   0          3h

Endpoints:
$ oc -o yaml get endpoints demo
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    openstack.org/kuryr-lbaas-spec: '{"versioned_object.data": {"ip": "172.30.213.127",
      "lb_ip": null, "ports": [{"versioned_object.data": {"name": null, "port": 80,
      "protocol": "TCP", "targetPort": "8080"}, "versioned_object.name": "LBaaSPortSpec",
      "versioned_object.namespace": "kuryr_kubernetes", "versioned_object.version":
      "1.1"}], "project_id": "c40b96037b09463bac299e29aacb674d", "security_groups_ids":
      ["e91e809a-a29a-4fb9-9ca5-b8e6688d23ad", "cfe78c8f-48ab-4c4c-ac9e-f51c20060d2b"],
      "subnet_id": "3aafbd77-5315-478f-a064-bbe307d408e4", "type": "ClusterIP"}, "versioned_object.name":
      "LBaaSServiceSpec", "versioned_object.namespace": "kuryr_kubernetes", "versioned_object.version":
      "1.0"}'
  creationTimestamp: "2021-03-02T04:49:35Z"
  name: demo
  namespace: default
  resourceVersion: "57086"
  selfLink: /api/v1/namespaces/default/endpoints/demo
  uid: af9b2bb8-7b12-11eb-9244-fa163ef06fb8
subsets:
- notReadyAddresses:
  - ip: 10.11.1.147
    nodeName: master-2.openshift.example.com
    targetRef:
      kind: Pod
      name: demo-7c768cff5-8cjbp
      namespace: default
      resourceVersion: "57059"
      uid: a6f731a9-7b12-11eb-9244-fa163ef06fb8
  - ip: 10.11.1.233
    nodeName: app-node-0.openshift.example.com
    targetRef:
      kind: Pod
      name: demo-7c768cff5-8rzn9
      namespace: default
      resourceVersion: "57058"
      uid: a6fab10d-7b12-11eb-9244-fa163ef06fb8
  - ip: 10.11.1.45
    nodeName: master-0.openshift.example.com
    targetRef:
      kind: Pod
      name: demo-7c768cff5-2bjd6
      namespace: default
      resourceVersion: "57060"
      uid: a6fc0f98-7b12-11eb-9244-fa163ef06fb8
  ports:
  - port: 8080
    protocol: TCP

Kuryr pods:
$ oc get pods -n kuryr
NAME                                READY     STATUS    RESTARTS   AGE
kuryr-cni-ds-2r9b2                  2/2       Running   0          9h
kuryr-cni-ds-7245d                  2/2       Running   0          9h
kuryr-cni-ds-cqz5s                  2/2       Running   0          9h
kuryr-cni-ds-lfc2c                  2/2       Running   0          9h
kuryr-cni-ds-lzbkj                  2/2       Running   0          9h
kuryr-cni-ds-tl96l                  2/2       Running   0          9h
kuryr-cni-ds-xrjfr                  2/2       Running   0          9h
kuryr-cni-ds-zxprh                  2/2       Running   0          9h
kuryr-controller-6bf6f8958f-j6pch   1/1       Running   0          9h

Comment 5 errata-xmlrpc 2021-03-03 12:28:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 3.11.394 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0637


Note You need to log in before you can comment on or make changes to this bug.