Rare (only one instance found so far, but work is underway to better detect automatically) problem has been identified where two pods running simultaneously with the same IP. Problem was found in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519175974577508352 Originally surfaced as a test failure with: "error trying to reach service: x509: certificate is valid for cluster-storage-operator-metrics.openshift-cluster-storage-operator.svc, cluster-storage-operator-metrics.openshift-cluster-storage-operator.svc.cluster.local, not api.openshift-apiserver.svc" In must-gather, in the openshift-apiserver namespace, pod/apiserver-f7c88fc74-bjjzj and in the openshift-cluster-storage-operator namespace, pod/cluster-storage-operator-b7f7dd8b4-z6t2h, you will find both pods using the IP: 10.129.0.6 Despite possible rarity, the implications are severe, and as such TRT is filing as sev high and requests relatively urgent attention if possible. TRT will continue working to refine tests to automatically detect this.
https://github.com/openshift/origin/pull/27062 is an attempt to catch this condition more reliably and associate the failure better. Given the implications of improper connections being made in the cluster, I think this problem is a blocker even if it is relatively rare.
This is a regression in 4.11 caused by https://github.com/openshift/ovn-kubernetes/pull/1010/commits/999f344459d089b61f4780f11d0f90e8e7974501 During upgrade or restart of ovnkube-master, completed pods may have their networking recreated. Additionally this could happen: 1. pod A is created 2. pod A goes complete, IP is freed 3. pod B is created, gets the same IP 4. pod A is deleted, now we accidentally free the IP pod B is now using 5. pod C is created, gets duplicate IP as B Working on a fix...
This looks a legit hit just on the PR rehersal, it may be that this is very common. Tim does this still fit your theory above: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328 three overlaps detected. working with the first one: reason/ReusedPodIP podIP 10.128.0.60 is currently assigned to multiple pods: ns/openshift-image-registry pod/image-registry-57dd7f8-b2mqb node/ip-10-0-186-38.ec2.internal uid/3d633d60-d441-43c9-9af0-d42ac6b177be;ns/openshift-kube-controller-manager pod/installer-8-ip-10-0-186-38.ec2.internal node/ip-10-0-186-38.ec2.internal uid/d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66 ❯ jq -r '.items[] | select(.metadata.uid == "3d633d60-d441-43c9-9af0-d42ac6b177be").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason' openshift-image-registry/pods.json 2022-04-27T22:57:46Z Initialized=True 2022-04-28T00:55:11Z Ready=False ContainersNotReady 2022-04-28T00:55:11Z ContainersReady=False ContainersNotReady 2022-04-27T22:57:46Z PodScheduled=True ❯ jq -r '.items[] | select(.metadata.uid == "d0a54e02-ed9c-4709-92bb-d9ddd0ecaa66").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason' openshift-kube-controller-manager/pods.json 2022-04-27T23:19:21Z Initialized=True 2022-04-27T23:19:33Z Ready=False ContainersNotReady 2022-04-27T23:19:33Z ContainersReady=False ContainersNotReady 2022-04-27T23:19:21Z PodScheduled=True The registry pod has deletion timestamp: 2022-04-28T00:54:44Z A couple events from: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27062/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1519436264984547328/artifacts/e2e-aws-single-node-upgrade/gather-must-gather/artifacts/event-filter.html 22:58:35 image-registry-57dd7f8-b2mqb Add eth0 [10.128.0.60/23] from ovn-kubernetes 23:19:23 installer-8-ip-10-0-186-38.ec2.internal Add eth0 [10.128.0.60/23] from ovn-kubernetes So to me that looks like it was given 10.128.0.60 when started around 22:58, and it was live at 23:19 when the controller manager pod was given the same IP.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069