Bug 2079439 - OVN Pods Assigned Same IP Simultaneously
Summary: OVN Pods Assigned Same IP Simultaneously
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.11.0
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-27 14:38 UTC by Devan Goodwin
Modified: 2022-08-10 11:08 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:08:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1064 0 None Merged Bug 2079439: [DownstreamMerge] 4-29-22 2022-05-02 19:04:59 UTC
Github ovn-org ovn-kubernetes pull 2957 0 None Merged Fixes various issues with completed pods 2022-05-02 19:04:55 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:08:50 UTC

Description Devan Goodwin 2022-04-27 14:38:18 UTC
Rare (only one instance found so far, but work is underway to better detect automatically) problem has been identified where two pods running simultaneously with the same IP.

Problem was found in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519175974577508352

Originally surfaced as a test failure with:

"error trying to reach service: x509: certificate is valid for cluster-storage-operator-metrics.openshift-cluster-storage-operator.svc, cluster-storage-operator-metrics.openshift-cluster-storage-operator.svc.cluster.local, not api.openshift-apiserver.svc"

In must-gather, in the openshift-apiserver namespace, pod/apiserver-f7c88fc74-bjjzj and in the openshift-cluster-storage-operator namespace, pod/cluster-storage-operator-b7f7dd8b4-z6t2h, you will find both pods using the IP: 10.129.0.6


Despite possible rarity, the implications are severe, and as such TRT is filing as sev high and requests relatively urgent attention if possible. TRT will continue working to refine tests to automatically detect this.

Comment 1 David Eads 2022-04-27 15:50:21 UTC
https://github.com/openshift/origin/pull/27062 is an attempt to catch this condition more reliably and associate the failure better.

Given the implications of improper connections being made in the cluster, I think this problem is a blocker even if it is relatively rare.

Comment 2 Tim Rozet 2022-04-27 19:31:36 UTC
This is a regression in 4.11 caused by https://github.com/openshift/ovn-kubernetes/pull/1010/commits/999f344459d089b61f4780f11d0f90e8e7974501

During upgrade or restart of ovnkube-master, completed pods may have their networking recreated. Additionally this could happen:
1. pod A is created
2. pod A goes complete, IP is freed
3. pod B is created, gets the same IP
4. pod A is deleted, now we accidentally free the IP pod B is now using
5. pod C is created, gets duplicate IP as B

Working on a fix...

Comment 3 Devan Goodwin 2022-04-28 13:53:03 UTC
This looks a legit hit just on the PR rehersal, it may be that this is very common.

Tim does this still fit your theory above:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328

three overlaps detected. working with the first one:

reason/ReusedPodIP podIP 10.128.0.60 is currently assigned to multiple pods: ns/openshift-image-registry pod/image-registry-57dd7f8-b2mqb node/ip-10-0-186-38.ec2.internal uid/3d633d60-d441-43c9-9af0-d42ac6b177be;ns/openshift-kube-controller-manager pod/installer-8-ip-10-0-186-38.ec2.internal node/ip-10-0-186-38.ec2.internal uid/d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66

❯ jq -r '.items[] | select(.metadata.uid == "3d633d60-d441-43c9-9af0-d42ac6b177be").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " +
.reason' openshift-image-registry/pods.json
2022-04-27T22:57:46Z Initialized=True
2022-04-28T00:55:11Z Ready=False ContainersNotReady
2022-04-28T00:55:11Z ContainersReady=False ContainersNotReady
2022-04-27T22:57:46Z PodScheduled=True

❯ jq -r '.items[] | select(.metadata.uid == "d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " +
.reason' openshift-kube-controller-manager/pods.json
2022-04-27T23:19:21Z Initialized=True
2022-04-27T23:19:33Z Ready=False ContainersNotReady
2022-04-27T23:19:33Z ContainersReady=False ContainersNotReady
2022-04-27T23:19:21Z PodScheduled=True

The registry pod has deletion timestamp: 2022-04-28T00:54:44Z

A couple events from: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328/artifacts/e2e-aws-single-node-upgrade/gather-must-gather/artifacts/event-filter.html

22:58:35 image-registry-57dd7f8-b2mqb Add eth0 [10.128.0.60/23] from ovn-kubernetes
23:19:23 installer-8-ip-10-0-186-38.ec2.internal Add eth0 [10.128.0.60/23] from ovn-kubernetes


So to me that looks like it was given 10.128.0.60 when started around 22:58, and it was live at 23:19 when the controller manager pod was given the same IP.

Comment 8 errata-xmlrpc 2022-08-10 11:08:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.