Bug 2105706
| Summary: | Race condition with pendingCloudPrivateIPConfigsOps in EgressIP code | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Andreas Karis <akaris> |
| Component: | Networking | Assignee: | Andreas Karis <akaris> |
| Networking sub component: | ovn-kubernetes | QA Contact: | jechen <jechen> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | ffernand, vlaad, zzhao |
| Version: | 4.12 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-17 19:51:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
*** Bug 2105712 has been marked as a duplicate of this bug. *** How to reproduce:
* spawn cluster on Azure
~~~
[akaris@linux 2101992]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-07-08-231743 True False 30m Cluster version is 4.11.0-0.nightly-2022-07-08-231743
~~~
* create an egressip setup, e.g. [0]
* edit the egressIP and change the IP address, e.g. from:
~~~
egressIPs:
- 10.0.129.7
~~~
to:
~~~
egressIPs:
- 10.0.129.8
~~~
* check the ovnkube-master leader's logs for message:
~~~
[akaris@linux 2101992]$ p=ovnkube-master-wxp4z; oc logs -n openshift-ovn-kubernetes $p -c ovnkube-master -f | grep 'will try again later: no pending operation found for EgressIP: egressip'
I0710 17:47:11.513301 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:47:41.498750 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:48:11.500208 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:48:41.502791 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:50:11.499971 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
I0710 17:51:41.499073 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip
~~~
That message shows even though the update was successful:
~~~
[akaris@linux 2101992]$ oc get egressip
NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS
egressip 10.0.129.8 ci-ln-4p9hg2b-1d09d-c4pt7-worker-eastus21-g2f59 10.0.129.8
~~~
This is a race, so several updates to the IP might be needed.
Potentially you might have to restart the ovnkube-master pod if you can't hit it, at all.
==================================================
[0]
[akaris@linux 2101992]$ cat fedora.yaml
# oc adm policy add-scc-to-user privileged -z default
apiVersion: apps/v1
kind: Deployment
metadata:
name: fedora-deployment
labels:
app: fedora-deployment
spec:
replicas: 1
selector:
matchLabels:
app: fedora-pod
template:
metadata:
labels:
app: fedora-pod
spec:
containers:
- name: fedora
image: fedora
command:
- sleep
- infinity
imagePullPolicy: IfNotPresent
securityContext:
runAsUser: 0
capabilities:
add:
- "SETFCAP"
- "CAP_NET_RAW"
- "CAP_NET_ADMIN"
[akaris@linux 2101992]$ cat egressip.yaml
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
name: egressip
spec:
egressIPs:
- 10.0.129.7
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: egressip
[akaris@linux 2101992]$ cat label.sh
#!/bin/bash
EGRESS_ASSIGNABLE_LABEL="k8s.ovn.org/egress-assignable"
n=$(oc get nodes -o name -l node-role.kubernetes.io/worker= | head -1)
oc label $n ${EGRESS_ASSIGNABLE_LABEL}="" --overwrite
echo "When picking the egressip, select it from the following range:"
oc get -o yaml $n | grep egress-ipconfig
oc new-project egressip
oc adm policy add-scc-to-user privileged -z default
oc apply -f fedora.yaml
oc apply -f egressip.yaml
Downstream merge https://github.com/openshift/ovn-kubernetes/pull/1231 landed $ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.0-0.nightly-2022-08-15-092951 True False 65m Cluster version is 4.12.0-0.nightly-2022-08-15-092951
$ oc get node
NAME STATUS ROLES AGE VERSION
jechen-0815a-7nnn7-master-0 Ready control-plane,master 98m v1.24.0+da80cd0
jechen-0815a-7nnn7-master-1 Ready control-plane,master 98m v1.24.0+da80cd0
jechen-0815a-7nnn7-master-2 Ready control-plane,master 99m v1.24.0+da80cd0
jechen-0815a-7nnn7-worker-southcentralus-1 Ready worker 81m v1.24.0+da80cd0
jechen-0815a-7nnn7-worker-southcentralus-2 Ready worker 82m v1.24.0+da80cd0
jechen-0815a-7nnn7-worker-southcentralus-3 Ready worker 82m v1.24.0+da80cd0
$ oc label node jechen-0815a-7nnn7-worker-southcentralus-1 "k8s.ovn.org/egress-assignable"=""
node/jechen-0815a-7nnn7-worker-southcentralus-1 labeled
$ cat config_egressip1_ovn_ns_team_red_azure.yaml
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
name: egressip1
spec:
egressIPs:
- 10.0.1.201
namespaceSelector:
matchLabels:
team: red
$ oc new-project test
$ oc create -f config_egressip1_ovn_ns_team_red_azure.yaml
egressip.k8s.ovn.org/egressip1 created
$ oc get egressips.k8s.ovn.org
NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS
egressip1 10.0.1.201 jechen-0815a-7nnn7-worker-southcentralus-1 10.0.1.201
$ oc get egressip egressip1 -oyaml
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
creationTimestamp: "2022-08-15T19:17:30Z"
generation: 3
name: egressip1
resourceVersion: "64866"
uid: 7dbfee19-db99-4e64-950a-f73a9dbfd3e8
spec:
egressIPs:
- 10.0.1.201
namespaceSelector:
matchLabels:
team: red
status:
items:
- egressIP: 10.0.1.201
node: jechen-0815a-7nnn7-worker-southcentralus-1
$ oc get -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' -n openshift-ovn-kubernetes cm ovn-kubernetes-master
{"holderIdentity":"jechen-0815a-7nnn7-master-0","leaseDurationSeconds":60,"acquireTime":"2022-08-15T17:35:59Z","renewTime":"2022-08-15T19:20:48Z","leaderTransitions":0}
$ oc get pod -n openshift-ovn-kubernetes -l app=ovnkube-master --field-selector=spec.nodeName=jechen-0815a-7nnn7-master-0 -o jsonpath={.items[*].metadata.name}
ovnkube-master-5dd9t
# updated egressip from 10.0.1.201 to 10.0.1.202, then check log on leader node ovnkube-master pod
$ p=ovnkube-master-5dd9t; oc logs -n openshift-ovn-kubernetes $p -c ovnkube-master -f | grep 'will try again later: no pending operation found for EgressIP: egressip'
==> no CloudPrivateIPConfig retry message with "will try again later: no pending operation found for EgressIP: egressip" was found in log
$ oc get egressip
NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS
egressip1 10.0.1.202 jechen-0815a-7nnn7-worker-southcentralus-1 10.0.1.202
$ oc get egressip egressip1 -oyaml
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
creationTimestamp: "2022-08-15T19:17:30Z"
generation: 5
name: egressip1
resourceVersion: "65451"
uid: 7dbfee19-db99-4e64-950a-f73a9dbfd3e8
spec:
egressIPs:
- 10.0.1.202
namespaceSelector:
matchLabels:
team: red
status:
items:
- egressIP: 10.0.1.202
node: jechen-0815a-7nnn7-worker-southcentralus-1
==> egressip was updated correctly
verified in 4.12.0-0.nightly-2022-08-15-092951 above. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |
Description of problem: When editing an EgressIP: > oc edit egressip egressip ~~~ 1 # Please edit the object below. Lines beginning with a '#' will be ignored, 2 # and an empty file will abort the edit. If an error occurs while saving this file will be 3 # reopened with the relevant failures. 4 # 5 apiVersion: k8s.ovn.org/v1 6 kind: EgressIP 7 metadata: 8 annotations: 9 kubectl.kubernetes.io/last-applied-configuration: | 10 {"apiVersion":"k8s.ovn.org/v1","kind":"EgressIP","metadata":{"annotations":{},"name":"egressip"},"spec":{"egressIPs":["10.0.129.5"],"namespaceSelector":{"matchLabels":{"kubernetes.io/metadata.name":"egressip"}}}} 11 creationTimestamp: "2022-07-09T19:48:26Z" 12 generation: 8 13 name: egressip 14 resourceVersion: "38677" 15 uid: 84cb1880-39ce-4ffa-bdd6-1adaeb8507ce 16 spec: 17 egressIPs: 18 - 10.0.129.7 19 namespaceSelector: 20 matchLabels: 21 kubernetes.io/metadata.name: egressip 22 status: 23 items: 24 - egressIP: 10.0.129.7 25 node: ci-ln-birng3t-1d09d-qkr8s-worker-centralus1-28kjf ~~~ To: ~~~ 1 # Please edit the object below. Lines beginning with a '#' will be ignored, 2 # and an empty file will abort the edit. If an error occurs while saving this file will be 3 # reopened with the relevant failures. 4 # 5 apiVersion: k8s.ovn.org/v1 6 kind: EgressIP 7 metadata: 8 annotations: 9 kubectl.kubernetes.io/last-applied-configuration: | 10 {"apiVersion":"k8s.ovn.org/v1","kind":"EgressIP","metadata":{"annotations":{},"name":"egressip"},"spec":{"egressIPs":["10.0.129.5"],"namespaceSelector":{"matchLabels":{"kubernetes.io/metadata.name":"egressip"}}}} 11 creationTimestamp: "2022-07-09T19:48:26Z" 12 generation: 8 13 name: egressip 14 resourceVersion: "38677" 15 uid: 84cb1880-39ce-4ffa-bdd6-1adaeb8507ce 16 spec: 17 egressIPs: 18 - 10.0.129.7 19 namespaceSelector: 20 matchLabels: 21 kubernetes.io/metadata.name: egressip 22 status: 23 items: 24 - egressIP: 10.0.129.10 25 node: ci-ln-birng3t-1d09d-qkr8s-worker-centralus1-28kjf ~~~ The code is subject to races whenever executeCloudPrivateIPConfigChange executed the following code ... ~~~ oc.eIPC.pendingCloudPrivateIPConfigsOps[egressIPName] = ops ~~~ ... before removePendingOpsAndGetResyncs called by reconcileCloudPrivateIPConfig has a chance to process the delete operation. This happens because the add operation unconditionally overwrites the current entry in oc.eIPC.pendingCloudPrivateIPConfigsOps[egressIPName]. The delete operation thus reports a failure and is requeued and the following log message is the result: ~~~ [akaris@linux 2101992]$ oc logs -c ovnkube-master -n openshift-ovn-kubernetes ovnkube-master-gtwvn -f | grep "no pending operation found" I0709 19:54:42.118862 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip I0709 19:55:12.129453 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip I0709 19:55:42.114429 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip I0709 19:56:12.118145 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip I0709 19:57:42.119788 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip I0709 19:59:12.117858 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip I0709 20:00:42.116703 1 obj_retry.go:1119] *v1.CloudPrivateIPConfig retry delete failed for 10.0.129.7, will try again later: no pending operation found for EgressIP: egressip ~~~ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: