Bug 2079439
| Summary: | OVN Pods Assigned Same IP Simultaneously | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Devan Goodwin <dgoodwin> |
| Component: | Networking | Assignee: | Tim Rozet <trozet> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | deads, dhellmann, surya, trozet, vpickard, wking |
| Version: | 4.11 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 11:08:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Devan Goodwin
2022-04-27 14:38:18 UTC
https://github.com/openshift/origin/pull/27062 is an attempt to catch this condition more reliably and associate the failure better. Given the implications of improper connections being made in the cluster, I think this problem is a blocker even if it is relatively rare. This is a regression in 4.11 caused by https://github.com/openshift/ovn-kubernetes/pull/1010/commits/999f344459d089b61f4780f11d0f90e8e7974501 During upgrade or restart of ovnkube-master, completed pods may have their networking recreated. Additionally this could happen: 1. pod A is created 2. pod A goes complete, IP is freed 3. pod B is created, gets the same IP 4. pod A is deleted, now we accidentally free the IP pod B is now using 5. pod C is created, gets duplicate IP as B Working on a fix... This looks a legit hit just on the PR rehersal, it may be that this is very common. Tim does this still fit your theory above: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328 three overlaps detected. working with the first one: reason/ReusedPodIP podIP 10.128.0.60 is currently assigned to multiple pods: ns/openshift-image-registry pod/image-registry-57dd7f8-b2mqb node/ip-10-0-186-38.ec2.internal uid/3d633d60-d441-43c9-9af0-d42ac6b177be;ns/openshift-kube-controller-manager pod/installer-8-ip-10-0-186-38.ec2.internal node/ip-10-0-186-38.ec2.internal uid/d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66 ❯ jq -r '.items[] | select(.metadata.uid == "3d633d60-d441-43c9-9af0-d42ac6b177be").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason' openshift-image-registry/pods.json 2022-04-27T22:57:46Z Initialized=True 2022-04-28T00:55:11Z Ready=False ContainersNotReady 2022-04-28T00:55:11Z ContainersReady=False ContainersNotReady 2022-04-27T22:57:46Z PodScheduled=True ❯ jq -r '.items[] | select(.metadata.uid == "d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason' openshift-kube-controller-manager/pods.json 2022-04-27T23:19:21Z Initialized=True 2022-04-27T23:19:33Z Ready=False ContainersNotReady 2022-04-27T23:19:33Z ContainersReady=False ContainersNotReady 2022-04-27T23:19:21Z PodScheduled=True The registry pod has deletion timestamp: 2022-04-28T00:54:44Z A couple events from: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328/artifacts/e2e-aws-single-node-upgrade/gather-must-gather/artifacts/event-filter.html 22:58:35 image-registry-57dd7f8-b2mqb Add eth0 [10.128.0.60/23] from ovn-kubernetes 23:19:23 installer-8-ip-10-0-186-38.ec2.internal Add eth0 [10.128.0.60/23] from ovn-kubernetes So to me that looks like it was given 10.128.0.60 when started around 22:58, and it was live at 23:19 when the controller manager pod was given the same IP. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |