Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2079439

Summary: OVN Pods Assigned Same IP Simultaneously
Product: OpenShift Container Platform Reporter: Devan Goodwin <dgoodwin>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: deads, dhellmann, surya, trozet, vpickard, wking
Version: 4.11   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:08:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Devan Goodwin 2022-04-27 14:38:18 UTC
Rare (only one instance found so far, but work is underway to better detect automatically) problem has been identified where two pods running simultaneously with the same IP.

Problem was found in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519175974577508352

Originally surfaced as a test failure with:

"error trying to reach service: x509: certificate is valid for cluster-storage-operator-metrics.openshift-cluster-storage-operator.svc, cluster-storage-operator-metrics.openshift-cluster-storage-operator.svc.cluster.local, not api.openshift-apiserver.svc"

In must-gather, in the openshift-apiserver namespace, pod/apiserver-f7c88fc74-bjjzj and in the openshift-cluster-storage-operator namespace, pod/cluster-storage-operator-b7f7dd8b4-z6t2h, you will find both pods using the IP: 10.129.0.6


Despite possible rarity, the implications are severe, and as such TRT is filing as sev high and requests relatively urgent attention if possible. TRT will continue working to refine tests to automatically detect this.

Comment 1 David Eads 2022-04-27 15:50:21 UTC
https://github.com/openshift/origin/pull/27062 is an attempt to catch this condition more reliably and associate the failure better.

Given the implications of improper connections being made in the cluster, I think this problem is a blocker even if it is relatively rare.

Comment 2 Tim Rozet 2022-04-27 19:31:36 UTC
This is a regression in 4.11 caused by https://github.com/openshift/ovn-kubernetes/pull/1010/commits/999f344459d089b61f4780f11d0f90e8e7974501

During upgrade or restart of ovnkube-master, completed pods may have their networking recreated. Additionally this could happen:
1. pod A is created
2. pod A goes complete, IP is freed
3. pod B is created, gets the same IP
4. pod A is deleted, now we accidentally free the IP pod B is now using
5. pod C is created, gets duplicate IP as B

Working on a fix...

Comment 3 Devan Goodwin 2022-04-28 13:53:03 UTC
This looks a legit hit just on the PR rehersal, it may be that this is very common.

Tim does this still fit your theory above:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328

three overlaps detected. working with the first one:

reason/ReusedPodIP podIP 10.128.0.60 is currently assigned to multiple pods: ns/openshift-image-registry pod/image-registry-57dd7f8-b2mqb node/ip-10-0-186-38.ec2.internal uid/3d633d60-d441-43c9-9af0-d42ac6b177be;ns/openshift-kube-controller-manager pod/installer-8-ip-10-0-186-38.ec2.internal node/ip-10-0-186-38.ec2.internal uid/d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66

❯ jq -r '.items[] | select(.metadata.uid == "3d633d60-d441-43c9-9af0-d42ac6b177be").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " +
.reason' openshift-image-registry/pods.json
2022-04-27T22:57:46Z Initialized=True
2022-04-28T00:55:11Z Ready=False ContainersNotReady
2022-04-28T00:55:11Z ContainersReady=False ContainersNotReady
2022-04-27T22:57:46Z PodScheduled=True

❯ jq -r '.items[] | select(.metadata.uid == "d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " +
.reason' openshift-kube-controller-manager/pods.json
2022-04-27T23:19:21Z Initialized=True
2022-04-27T23:19:33Z Ready=False ContainersNotReady
2022-04-27T23:19:33Z ContainersReady=False ContainersNotReady
2022-04-27T23:19:21Z PodScheduled=True

The registry pod has deletion timestamp: 2022-04-28T00:54:44Z

A couple events from: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328/artifacts/e2e-aws-single-node-upgrade/gather-must-gather/artifacts/event-filter.html

22:58:35 image-registry-57dd7f8-b2mqb Add eth0 [10.128.0.60/23] from ovn-kubernetes
23:19:23 installer-8-ip-10-0-186-38.ec2.internal Add eth0 [10.128.0.60/23] from ovn-kubernetes


So to me that looks like it was given 10.128.0.60 when started around 22:58, and it was live at 23:19 when the controller manager pod was given the same IP.

Comment 8 errata-xmlrpc 2022-08-10 11:08:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069