2079439 – OVN Pods Assigned Same IP Simultaneously

Bug 2079439 - OVN Pods Assigned Same IP Simultaneously

Summary: OVN Pods Assigned Same IP Simultaneously

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Tim Rozet
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-27 14:38 UTC by Devan Goodwin
Modified:	2022-08-10 11:08 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:08:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 1064	None	Merged	Bug 2079439: [DownstreamMerge] 4-29-22	2022-05-02 19:04:59 UTC
Github	ovn-org ovn-kubernetes pull 2957	None	Merged	Fixes various issues with completed pods	2022-05-02 19:04:55 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 11:08:50 UTC

Description Devan Goodwin 2022-04-27 14:38:18 UTC

Rare (only one instance found so far, but work is underway to better detect automatically) problem has been identified where two pods running simultaneously with the same IP.

Problem was found in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519175974577508352

Originally surfaced as a test failure with:

"error trying to reach service: x509: certificate is valid for cluster-storage-operator-metrics.openshift-cluster-storage-operator.svc, cluster-storage-operator-metrics.openshift-cluster-storage-operator.svc.cluster.local, not api.openshift-apiserver.svc"

In must-gather, in the openshift-apiserver namespace, pod/apiserver-f7c88fc74-bjjzj and in the openshift-cluster-storage-operator namespace, pod/cluster-storage-operator-b7f7dd8b4-z6t2h, you will find both pods using the IP: 10.129.0.6


Despite possible rarity, the implications are severe, and as such TRT is filing as sev high and requests relatively urgent attention if possible. TRT will continue working to refine tests to automatically detect this.

Comment 1 David Eads 2022-04-27 15:50:21 UTC

https://github.com/openshift/origin/pull/27062 is an attempt to catch this condition more reliably and associate the failure better.

Given the implications of improper connections being made in the cluster, I think this problem is a blocker even if it is relatively rare.

Comment 2 Tim Rozet 2022-04-27 19:31:36 UTC

This is a regression in 4.11 caused by https://github.com/openshift/ovn-kubernetes/pull/1010/commits/999f344459d089b61f4780f11d0f90e8e7974501

During upgrade or restart of ovnkube-master, completed pods may have their networking recreated. Additionally this could happen:
1. pod A is created
2. pod A goes complete, IP is freed
3. pod B is created, gets the same IP
4. pod A is deleted, now we accidentally free the IP pod B is now using
5. pod C is created, gets duplicate IP as B

Working on a fix...

Comment 3 Devan Goodwin 2022-04-28 13:53:03 UTC

This looks a legit hit just on the PR rehersal, it may be that this is very common.

Tim does this still fit your theory above:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328

three overlaps detected. working with the first one:

reason/ReusedPodIP podIP 10.128.0.60 is currently assigned to multiple pods: ns/openshift-image-registry pod/image-registry-57dd7f8-b2mqb node/ip-10-0-186-38.ec2.internal uid/3d633d60-d441-43c9-9af0-d42ac6b177be;ns/openshift-kube-controller-manager pod/installer-8-ip-10-0-186-38.ec2.internal node/ip-10-0-186-38.ec2.internal uid/d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66

❯ jq -r '.items[] | select(.metadata.uid == "3d633d60-d441-43c9-9af0-d42ac6b177be").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " +
.reason' openshift-image-registry/pods.json
2022-04-27T22:57:46Z Initialized=True
2022-04-28T00:55:11Z Ready=False ContainersNotReady
2022-04-28T00:55:11Z ContainersReady=False ContainersNotReady
2022-04-27T22:57:46Z PodScheduled=True

❯ jq -r '.items[] | select(.metadata.uid == "d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " +
.reason' openshift-kube-controller-manager/pods.json
2022-04-27T23:19:21Z Initialized=True
2022-04-27T23:19:33Z Ready=False ContainersNotReady
2022-04-27T23:19:33Z ContainersReady=False ContainersNotReady
2022-04-27T23:19:21Z PodScheduled=True

The registry pod has deletion timestamp: 2022-04-28T00:54:44Z

A couple events from: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328/artifacts/e2e-aws-single-node-upgrade/gather-must-gather/artifacts/event-filter.html

22:58:35 image-registry-57dd7f8-b2mqb Add eth0 [10.128.0.60/23] from ovn-kubernetes
23:19:23 installer-8-ip-10-0-186-38.ec2.internal Add eth0 [10.128.0.60/23] from ovn-kubernetes


So to me that looks like it was given 10.128.0.60 when started around 22:58, and it was live at 23:19 when the controller manager pod was given the same IP.

Comment 8 errata-xmlrpc 2022-08-10 11:08:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.