Description of problem: We hit this issue here in CI: ~~~ { fail [github.com/openshift/origin/test/extended/scheduling/pods.go:151]: Apr 11 06:38:49.338: ns/openshift-etcd pod etcd-quorum-guard-7cbbc8db97-d9bv8 and pod etcd-quorum-guard-7cbbc8db97-5pfvx are running on the same node: ip-10-0-183-222.us-east-2.compute.internal} ~~~ Here's the situation from the must-gather: ~~~ [akaris@linux sdn2958]$ omg get pods -A -o wide | grep etcd openshift-etcd etcd-ip-10-0-140-126.us-east-2.compute.internal 4/4 Running 0 55m 10.0.140.126 ip-10-0-140-126.us-east-2.compute.internal openshift-etcd etcd-ip-10-0-183-222.us-east-2.compute.internal 4/4 Running 0 57m 10.0.183.222 ip-10-0-183-222.us-east-2.compute.internal openshift-etcd etcd-ip-10-0-247-37.us-east-2.compute.internal 4/4 Running 0 54m 10.0.247.37 ip-10-0-247-37.us-east-2.compute.internal openshift-etcd etcd-quorum-guard-7cbbc8db97-5pfvx 1/1 Running 0 1h14m 10.0.183.222 ip-10-0-183-222.us-east-2.compute.internal openshift-etcd etcd-quorum-guard-7cbbc8db97-d9bv8 1/1 Running 0 1h14m 10.0.183.222 ip-10-0-183-222.us-east-2.compute.internal openshift-etcd etcd-quorum-guard-7cbbc8db97-vbjqr 1/1 Running 0 1h14m 10.0.140.126 ip-10-0-140-126.us-east-2.compute.internal ~~~ Looks like a scheduler issue to me. The quorum-guards have the following antiaffinity: ~~~ [akaris@linux sdn2958]$ omg get pod -n openshift-etcd etcd-quorum-guard-7cbbc8db97-5pfvx -o yaml | grep -i affinity -A10 f:affinity: .: {} f:podAffinity: .: {} f:requiredDuringSchedulingIgnoredDuringExecution: {} f:podAntiAffinity: .: {} f:requiredDuringSchedulingIgnoredDuringExecution: {} f:containers: k:{"name":"guard"}: .: {} f:args: {} f:command: {} f:image: {} f:imagePullPolicy: {} f:name: {} -- affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: k8s-app operator: In values: - etcd topologyKey: kubernetes.io/hostname podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: k8s-app operator: In values: - etcd-quorum-guard topologyKey: kubernetes.io/hostname containers: - args: ~~~ ~~~ [akaris@linux sdn2958]$ omg get co kube-apiserver -o yaml | tail name: '' resource: apirequestcounts versions: - name: raw-internal version: 4.11.0-0.nightly-2022-04-11-055105 - name: kube-apiserver version: 1.23.3 - name: operator version: 4.11.0-0.nightly-2022-04-11-055105 ~~~ Both of these pods are scheduled exactly at the same time: ~~~ gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn/1513396005347790848/build-log.txt:Apr 11 06:38:45.719 - 2887s I ns/openshift-etcd pod/etcd-quorum-guard-7cbbc8db97-d9bv8 uid/3cdd1c21-d150-4a08-af4a-2d10ee0bd93f constructed/true reason/Scheduled node/ip-10-0-183-222.us-east-2.compute.internal gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn/1513396005347790848/build-log.txt:Apr 11 06:38:45.719 - 2887s I ns/openshift-etcd pod/etcd-quorum-guard-7cbbc8db97-5pfvx uid/9248a8ca-8493-423f-80ae-48ec9a66ca9d constructed/true reason/Scheduled node/ip-10-0-183-222.us-east-2.compute.internal gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn/1513396005347790848/build-log.txt:Apr 11 06:38:46.169 I ns/openshift-etcd pod/etcd-quorum-guard-7cbbc8db97-5pfvx node/ip-10-0-183-222.us-east-2.compute.internal uid/9248a8ca-8493-423f-80ae-48ec9a66ca9d reason/Scheduled node/ip-10-0-183-222.us-east-2.compute.internal gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn/1513396005347790848/build-log.txt:Apr 11 06:38:46.170 I ns/openshift-etcd pod/etcd-quorum-guard-7cbbc8db97-d9bv8 node/ip-10-0-183-222.us-east-2.compute.internal uid/3cdd1c21-d150-4a08-af4a-2d10ee0bd93f reason/Scheduled node/ip-10-0-183-222.us-east-2.compute.internal ~~~ So I suspect that the scheduler might not be honoring podAntiAffinity during races? Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Dup of https://bugzilla.redhat.com/show_bug.cgi?id=2062459 *** This bug has been marked as a duplicate of bug 2062459 ***
Feel free to open if you think otherwise.