Bug 2062459

Summary:

Ingress pods scheduled on the same node

Product:

OpenShift Container Platform

Reporter:

Ken Zhang <kenzhang>

Component:

kube-scheduler

Assignee:

Jan Chaloupka <jchaloup>

Status:

CLOSED ERRATA

QA Contact:

RamaKasturi <knarra>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.10

CC:

akaris, aos-bugs, cblecker, deads, dgoodwin, jchaloup, mfojtik, sippy, wking

Target Milestone:

---

Target Release:

4.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

[sig-scheduling][Early] The HAProxy router pods should be scheduled on different nodes [Suite:openshift/conformance/parallel]

Last Closed:

2022-08-10 10:53:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2089336

Attachments:

Description	Flags
Events showing both pods scheduled on same node	none

Description Ken Zhang 2022-03-09 19:42:12 UTC

Created attachment 1864996 [details]
Events showing both pods scheduled on same node

Created attachment 1864996 [details]
Events showing both pods scheduled on same node

Created attachment 1864996 [details]
Events showing both pods scheduled on same node

While debugging a disruption test failure "[sig-imageregistry] Image registry remains available using new connections", we noticed that the two ingress pods were scheduled on the same node. 

A couple of example job runs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1500434000819261440 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1500433998298484736


See attached of the events showing the two pods were scheduled on the same node.

Ingress has anti-affinity rule that should prevent this: https://github.com/openshift/cluster-ingress-operator/blob/5040f65551851b3ee284f0803bfdd1c64631c4c6/pkg/operator/controller/ingress/deployment.go#L337-L357

But somehow the pods ended up on the same node.

Comment 1 David Eads 2022-03-09 19:55:22 UTC

Ken, I cannot find the events from your screenshot in the linked CI job.

Comment 2 David Eads 2022-03-09 19:58:52 UTC

yeah, the screenshot doesn't match, but the events show the bug pretty clearly.  `	router-default-79dfc95ff7-wtzl6` and router-default-79dfc95ff7-f96fj in the linked run

Comment 3 W. Trevor King 2022-03-09 20:16:18 UTC

Details from the events with the pods David points out in comment 2 both getting scheduled to the same node in the same second:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1500434000819261440/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-ingress" and (.reason == "Scheduled" or .reason == "Killing")) | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort
2022-03-06T11:50:02Z null router-default-79dfc95ff7-f96fj Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-f96fj to ip-10-0-129-93.us-west-1.compute.internal
2022-03-06T11:50:02Z null router-default-79dfc95ff7-wtzl6 Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-wtzl6 to ip-10-0-129-93.us-west-1.compute.internal
2022-03-06T12:26:05Z null router-default-79dfc95ff7-27b2v Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-27b2v to ip-10-0-225-12.us-west-1.compute.internal
2022-03-06T12:26:06Z 1 router-default-79dfc95ff7-wtzl6 Killing: Stopping container router
2022-03-06T12:26:15Z 1 router-default-79dfc95ff7-f96fj Killing: Stopping container router
2022-03-06T12:26:15Z null router-default-79dfc95ff7-ltwrt Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-ltwrt to ip-10-0-155-172.us-west-1.compute.internal
2022-03-06T12:29:40Z 1 router-default-79dfc95ff7-27b2v Killing: Stopping container router
2022-03-06T12:29:40Z null router-default-79dfc95ff7-t6d8g Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-t6d8g to ip-10-0-129-93.us-west-1.compute.internal
2022-03-06T12:33:05Z null router-default-79dfc95ff7-59cgm Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-59cgm to ip-10-0-225-12.us-west-1.compute.internal
2022-03-06T12:33:08Z 2 router-default-79dfc95ff7-ltwrt Killing: Stopping container router

Comment 6 W. Trevor King 2022-03-16 18:46:13 UTC

Moving back to ASSIGNED, per [1], openshift/kubernetes#1210 is a debugging aid and not a fix.

[1]: https://github.com/openshift/kubernetes/pull/1210#issuecomment-1068235121

Comment 9 David Eads 2022-03-21 14:39:10 UTC

"The openshift-etcd pods should be scheduled on different nodes" appears to be failing 8% of the time on metal OVN.  This means that etcd quorum is not protected by the PDB.

https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=168h&context=0&type=junit&name=4.11.*metal.*ovn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 10 ravig 2022-05-04 12:43:04 UTC

*** Bug 2080471 has been marked as a duplicate of this bug. ***

Comment 16 RamaKasturi 2022-05-26 10:12:09 UTC

Hello Ravi,

   I tried to verify the issue by checking the link below, the only time i see it passing was at [2] i.e before 44 hours but after that i see it failing with error at [3]. Any idea if we have a bug tracking this ? And i think we should wait until this issue is fixed. WDYS ?

[1] https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=168h&context=0&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1529051880641007616
[3] blob:https://prow.ci.openshift.org/47301f28-6c2f-4a64-9815-031b18b036ab

Thanks
kasturi

Comment 17 ravig 2022-05-31 18:30:27 UTC

Hi Kasturi,

There is an issue with build controller SA which should be exempted from pod security, Standa has opened a PR and it should solve the problem. I looked at the search ci again and it seems the failures are unrelated to the symptom we usually see.

Comment 18 RamaKasturi 2022-06-02 05:55:21 UTC

Ravi, yes agree. But can we wait until we have the test passing atleast couple of times  before the bug is moved to verified state ?

Comment 19 ravig 2022-06-03 13:50:17 UTC

Sure. We should wait till we have clear signal. No point in rushing to close this BZ.

Comment 22 RamaKasturi 2022-06-15 07:08:03 UTC

Hello Ravi,

   I tried to verify the bug again but this time i am not sure of the reason it failed but i do see below messages when checking the logs at [1] , could you please help take a look? thanks !!

{Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts to have a chance at success  name: '[sig-scheduling][Early] The openshift-etcd pods should be scheduled on different
  nodes [Suite:openshift/conformance/parallel]'
testsuitename: openshift-tests-upgrade
summary: 'Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts
  to have a chance at success'
passes:
- jobrunid: "1536907377092071424"
  humanurl: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907377092071424
  gcsartifacturl: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907377092071424/artifacts
- jobrunid: "1536907379621236736"
  humanurl: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907379621236736
  gcsartifacturl: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907379621236736/artifacts
failures: []
skips: []
}

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907382129430528

Thanks
kasturi

Comment 23 RamaKasturi 2022-06-15 17:00:15 UTC

Moving the test back to assigned state because when i looked at the ci logs i still see that pod got scheduled on to the same node.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure-modern/1535199601215148032
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure-modern/1535286255833583616

Comment 24 ravig 2022-06-16 13:06:28 UTC

That test is just for launch jobs by cluster bot and when I looked at the failed jobs they seem to be upgrade from 4.10 which doesn't include fix. So, moving back to `ON_QA`:

xref: https://coreos.slack.com/archives/C01CQA76KMX/p1655314005818299

Comment 25 RamaKasturi 2022-06-21 15:33:03 UTC

Looking at [1] i do see there are failures but they are not related to the actual error which is originally reported here in the bug and it is something to do with the env (based on my observations in the log), so based on that moving the test to verified.

[1] https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=168h&context=0&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 28 errata-xmlrpc 2022-08-10 10:53:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069