Bug 2105706 - Race condition with pendingCloudPrivateIPConfigsOps in EgressIP code
Summary: Race condition with pendingCloudPrivateIPConfigsOps in EgressIP code
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.12
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.12.0
Assignee: Andreas Karis
QA Contact: jechen
URL:
Whiteboard:
: 2105712 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-09 20:02 UTC by Andreas Karis
Modified: 2023-01-17 19:52 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:51:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1231 0 None open Bug 2109945: [Downstream Merge: 04-08-2022] 2022-08-03 19:04:34 UTC
Github ovn-org ovn-kubernetes pull 3066 0 None open EgressIP: Merge ops into pendingCloudPrivateIPConfigsOps on add 2022-07-10 17:30:39 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:52:28 UTC

Description Andreas Karis 2022-07-09 20:02:37 UTC
Description of problem:

When editing an EgressIP:
> oc edit egressip egressip

~~~
    1 # Please edit the object below. Lines beginning with a '#' will be ignored,
    2 # and an empty file will abort the edit. If an error occurs while saving this file will be
    3 # reopened with the relevant failures.
    4 #
    5 apiVersion: k8s.ovn.org/v1
    6 kind: EgressIP
    7 metadata:
    8   annotations:
    9     kubectl.kubernetes.io/last-applied-configuration: |
   10       {"apiVersion":"k8s.ovn.org/v1","kind":"EgressIP","metadata":{"annotations":{},"name":"egressip"},"spec":{"egressIPs":["10.0.129.5"],"namespaceSelector":{"matchLabels":{"kubernetes.io/metadata.name":"egressip"}}}}
   11   creationTimestamp: "2022-07-09T19:48:26Z"
   12   generation: 8
   13   name: egressip
   14   resourceVersion: "38677"
   15   uid: 84cb1880-39ce-4ffa-bdd6-1adaeb8507ce
   16 spec:
   17   egressIPs:
   18   - 10.0.129.7
   19   namespaceSelector:
   20     matchLabels:
   21       kubernetes.io/metadata.name: egressip
   22 status:
   23   items:
   24   - egressIP: 10.0.129.7
   25     node: ci-ln-birng3t-1d09d-qkr8s-worker-centralus1-28kjf
~~~

To:
~~~
    1 # Please edit the object below. Lines beginning with a '#' will be ignored,
    2 # and an empty file will abort the edit. If an error occurs while saving this file will be
    3 # reopened with the relevant failures.
    4 #
    5 apiVersion: k8s.ovn.org/v1
    6 kind: EgressIP
    7 metadata:
    8   annotations:
    9     kubectl.kubernetes.io/last-applied-configuration: |
   10       {"apiVersion":"k8s.ovn.org/v1","kind":"EgressIP","metadata":{"annotations":{},"name":"egressip"},"spec":{"egressIPs":["10.0.129.5"],"namespaceSelector":{"matchLabels":{"kubernetes.io/metadata.name":"egressip"}}}}
   11   creationTimestamp: "2022-07-09T19:48:26Z"
   12   generation: 8
   13   name: egressip
   14   resourceVersion: "38677"
   15   uid: 84cb1880-39ce-4ffa-bdd6-1adaeb8507ce
   16 spec:
   17   egressIPs:
   18   - 10.0.129.7
   19   namespaceSelector:
   20     matchLabels:
   21       kubernetes.io/metadata.name: egressip
   22 status:
   23   items:
   24   - egressIP: 10.0.129.10
   25     node: ci-ln-birng3t-1d09d-qkr8s-worker-centralus1-28kjf
~~~

The code is subject to races whenever executeCloudPrivateIPConfigChange executed the following code ...
~~~
oc.eIPC.pendingCloudPrivateIPConfigsOps[egressIPName] = ops
~~~

... before  removePendingOpsAndGetResyncs called by reconcileCloudPrivateIPConfig has a chance to process the delete operation.

This happens because the add operation unconditionally overwrites the current entry in oc.eIPC.pendingCloudPrivateIPConfigsOps[egressIPName].

The delete operation thus reports a failure and is requeued and the following log message is the result:
~~~
[akaris@linux 2101992]$ oc logs -c ovnkube-master -n openshift-ovn-kubernetes ovnkube-master-gtwvn -f | grep "no pending operation found"
I0709 19:54:42.118862       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0709 19:55:12.129453       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0709 19:55:42.114429       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0709 19:56:12.118145       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0709 19:57:42.119788       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0709 19:59:12.117858       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0709 20:00:42.116703       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip

~~~

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Andreas Karis 2022-07-10 16:35:30 UTC
*** Bug 2105712 has been marked as a duplicate of this bug. ***

Comment 3 Andreas Karis 2022-07-10 17:55:26 UTC
How to reproduce:

* spawn cluster on Azure
~~~
[akaris@linux 2101992]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-07-08-231743   True        False         30m     Cluster version is 4.11.0-0.nightly-2022-07-08-231743
~~~
* create an egressip setup, e.g. [0]
* edit the egressIP and change the IP address, e.g. from:
~~~
  egressIPs:
  - 10.0.129.7
~~~
to:
~~~
  egressIPs:
  - 10.0.129.8
~~~

* check the ovnkube-master leader's logs for message:
~~~
[akaris@linux 2101992]$ p=ovnkube-master-wxp4z; oc logs -n openshift-ovn-kubernetes $p -c ovnkube-master -f | grep 'will try again later: no pending operation found for EgressIP: egressip'
I0710 17:47:11.513301       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:47:41.498750       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:48:11.500208       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:48:41.502791       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:50:11.499971       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:51:41.499073       1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
~~~

That message shows even though the update was successful:
~~~
[akaris@linux 2101992]$ oc get egressip
NAME       EGRESSIPS    ASSIGNED NODE                                     ASSIGNED EGRESSIPS
egressip   10.0.129.8   ci-ln-4p9hg2b-1d09d-c4pt7-worker-eastus21-g2f59   10.0.129.8
~~~

This is a race, so several updates to the IP might be needed.
Potentially you might have to restart the ovnkube-master pod if you can't hit it, at all.

==================================================

[0]

[akaris@linux 2101992]$ cat fedora.yaml 
# oc adm policy add-scc-to-user privileged -z default
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fedora-deployment
  labels:
    app: fedora-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fedora-pod
  template:
    metadata:
      labels:
        app: fedora-pod
    spec:
      containers:
      - name: fedora
        image: fedora
        command:
          - sleep
          - infinity
        imagePullPolicy: IfNotPresent
        securityContext:
          runAsUser: 0
          capabilities:
            add:
              - "SETFCAP"
              - "CAP_NET_RAW"
              - "CAP_NET_ADMIN"
[akaris@linux 2101992]$ cat egressip.yaml 
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egressip
spec:
  egressIPs:
  - 10.0.129.7
  namespaceSelector:
    matchLabels:
      kubernetes.io/metadata.name: egressip
[akaris@linux 2101992]$ cat label.sh 
#!/bin/bash

EGRESS_ASSIGNABLE_LABEL="k8s.ovn.org/egress-assignable"
n=$(oc get nodes -o name -l node-role.kubernetes.io/worker=  | head -1)
oc label $n ${EGRESS_ASSIGNABLE_LABEL}="" --overwrite

echo "When picking the egressip, select it from the following range:"
oc get -o yaml $n | grep egress-ipconfig

oc new-project egressip
oc adm policy add-scc-to-user privileged -z default
oc apply -f fedora.yaml
oc apply -f egressip.yaml

Comment 5 Andreas Karis 2022-08-09 15:47:50 UTC
Downstream merge https://github.com/openshift/ovn-kubernetes/pull/1231 landed

Comment 6 jechen 2022-08-15 20:03:12 UTC
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-08-15-092951   True        False         65m     Cluster version is 4.12.0-0.nightly-2022-08-15-092951


$ oc get node
NAME                                         STATUS   ROLES                  AGE   VERSION
jechen-0815a-7nnn7-master-0                  Ready    control-plane,master   98m   v1.24.0+da80cd0
jechen-0815a-7nnn7-master-1                  Ready    control-plane,master   98m   v1.24.0+da80cd0
jechen-0815a-7nnn7-master-2                  Ready    control-plane,master   99m   v1.24.0+da80cd0
jechen-0815a-7nnn7-worker-southcentralus-1   Ready    worker                 81m   v1.24.0+da80cd0
jechen-0815a-7nnn7-worker-southcentralus-2   Ready    worker                 82m   v1.24.0+da80cd0
jechen-0815a-7nnn7-worker-southcentralus-3   Ready    worker                 82m   v1.24.0+da80cd0


$ oc label node jechen-0815a-7nnn7-worker-southcentralus-1 "k8s.ovn.org/egress-assignable"=""
node/jechen-0815a-7nnn7-worker-southcentralus-1 labeled


$ cat config_egressip1_ovn_ns_team_red_azure.yaml
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  name: egressip1
spec:
  egressIPs:
  - 10.0.1.201
  namespaceSelector:
    matchLabels:
      team: red 


$ oc new-project test

$ oc create -f config_egressip1_ovn_ns_team_red_azure.yaml
egressip.k8s.ovn.org/egressip1 created

$ oc get egressips.k8s.ovn.org 
NAME        EGRESSIPS    ASSIGNED NODE                                ASSIGNED EGRESSIPS
egressip1   10.0.1.201   jechen-0815a-7nnn7-worker-southcentralus-1   10.0.1.201


$ oc get egressip egressip1 -oyaml
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  creationTimestamp: "2022-08-15T19:17:30Z"
  generation: 3
  name: egressip1
  resourceVersion: "64866"
  uid: 7dbfee19-db99-4e64-950a-f73a9dbfd3e8
spec:
  egressIPs:
  - 10.0.1.201
  namespaceSelector:
    matchLabels:
      team: red
status:
  items:
  - egressIP: 10.0.1.201
    node: jechen-0815a-7nnn7-worker-southcentralus-1


$  oc get -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'  -n openshift-ovn-kubernetes  cm ovn-kubernetes-master
{"holderIdentity":"jechen-0815a-7nnn7-master-0","leaseDurationSeconds":60,"acquireTime":"2022-08-15T17:35:59Z","renewTime":"2022-08-15T19:20:48Z","leaderTransitions":0}


$ oc get pod -n openshift-ovn-kubernetes -l app=ovnkube-master --field-selector=spec.nodeName=jechen-0815a-7nnn7-master-0 -o jsonpath={.items[*].metadata.name}
ovnkube-master-5dd9t


# updated egressip from 10.0.1.201 to 10.0.1.202, then check log on leader node ovnkube-master pod
$ p=ovnkube-master-5dd9t; oc logs -n openshift-ovn-kubernetes $p -c ovnkube-master -f | grep 'will try again later: no pending operation found for EgressIP: egressip'

==> no CloudPrivateIPConfig retry message with "will try again later: no pending operation found for EgressIP: egressip" was found in log



$ oc get egressip
NAME        EGRESSIPS    ASSIGNED NODE                                ASSIGNED EGRESSIPS
egressip1   10.0.1.202   jechen-0815a-7nnn7-worker-southcentralus-1   10.0.1.202


$ oc get egressip egressip1 -oyaml
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
  creationTimestamp: "2022-08-15T19:17:30Z"
  generation: 5
  name: egressip1
  resourceVersion: "65451"
  uid: 7dbfee19-db99-4e64-950a-f73a9dbfd3e8
spec:
  egressIPs:
  - 10.0.1.202
  namespaceSelector:
    matchLabels:
      team: red
status:
  items:
  - egressIP: 10.0.1.202
    node: jechen-0815a-7nnn7-worker-southcentralus-1

==> egressip was updated correctly

Comment 7 jechen 2022-08-15 20:04:19 UTC
verified in 4.12.0-0.nightly-2022-08-15-092951 above.

Comment 12 errata-xmlrpc 2023-01-17 19:51:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.