Bug 2062459
Summary: | Ingress pods scheduled on the same node | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ken Zhang <kenzhang> | ||||
Component: | kube-scheduler | Assignee: | Jan Chaloupka <jchaloup> | ||||
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.10 | CC: | akaris, aos-bugs, cblecker, deads, dgoodwin, jchaloup, mfojtik, sippy, wking | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.11.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: |
[sig-scheduling][Early] The HAProxy router pods should be scheduled on different nodes [Suite:openshift/conformance/parallel]
|
|||||
Last Closed: | 2022-08-10 10:53:13 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 2089336 | ||||||
Attachments: |
|
Description
Ken Zhang
2022-03-09 19:42:12 UTC
Ken, I cannot find the events from your screenshot in the linked CI job. yeah, the screenshot doesn't match, but the events show the bug pretty clearly. ` router-default-79dfc95ff7-wtzl6` and router-default-79dfc95ff7-f96fj in the linked run Details from the events with the pods David points out in comment 2 both getting scheduled to the same node in the same second: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1500434000819261440/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-ingress" and (.reason == "Scheduled" or .reason == "Killing")) | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort 2022-03-06T11:50:02Z null router-default-79dfc95ff7-f96fj Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-f96fj to ip-10-0-129-93.us-west-1.compute.internal 2022-03-06T11:50:02Z null router-default-79dfc95ff7-wtzl6 Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-wtzl6 to ip-10-0-129-93.us-west-1.compute.internal 2022-03-06T12:26:05Z null router-default-79dfc95ff7-27b2v Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-27b2v to ip-10-0-225-12.us-west-1.compute.internal 2022-03-06T12:26:06Z 1 router-default-79dfc95ff7-wtzl6 Killing: Stopping container router 2022-03-06T12:26:15Z 1 router-default-79dfc95ff7-f96fj Killing: Stopping container router 2022-03-06T12:26:15Z null router-default-79dfc95ff7-ltwrt Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-ltwrt to ip-10-0-155-172.us-west-1.compute.internal 2022-03-06T12:29:40Z 1 router-default-79dfc95ff7-27b2v Killing: Stopping container router 2022-03-06T12:29:40Z null router-default-79dfc95ff7-t6d8g Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-t6d8g to ip-10-0-129-93.us-west-1.compute.internal 2022-03-06T12:33:05Z null router-default-79dfc95ff7-59cgm Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-59cgm to ip-10-0-225-12.us-west-1.compute.internal 2022-03-06T12:33:08Z 2 router-default-79dfc95ff7-ltwrt Killing: Stopping container router Moving back to ASSIGNED, per [1], openshift/kubernetes#1210 is a debugging aid and not a fix. [1]: https://github.com/openshift/kubernetes/pull/1210#issuecomment-1068235121 "The openshift-etcd pods should be scheduled on different nodes" appears to be failing 8% of the time on metal OVN. This means that etcd quorum is not protected by the PDB. https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=168h&context=0&type=junit&name=4.11.*metal.*ovn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job *** Bug 2080471 has been marked as a duplicate of this bug. *** Hello Ravi, I tried to verify the issue by checking the link below, the only time i see it passing was at [2] i.e before 44 hours but after that i see it failing with error at [3]. Any idea if we have a bug tracking this ? And i think we should wait until this issue is fixed. WDYS ? [1] https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=168h&context=0&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1529051880641007616 [3] blob:https://prow.ci.openshift.org/47301f28-6c2f-4a64-9815-031b18b036ab Thanks kasturi Hi Kasturi, There is an issue with build controller SA which should be exempted from pod security, Standa has opened a PR and it should solve the problem. I looked at the search ci again and it seems the failures are unrelated to the symptom we usually see. Ravi, yes agree. But can we wait until we have the test passing atleast couple of times before the bug is moved to verified state ? Sure. We should wait till we have clear signal. No point in rushing to close this BZ. Hello Ravi, I tried to verify the bug again but this time i am not sure of the reason it failed but i do see below messages when checking the logs at [1] , could you please help take a look? thanks !! {Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts to have a chance at success name: '[sig-scheduling][Early] The openshift-etcd pods should be scheduled on different nodes [Suite:openshift/conformance/parallel]' testsuitename: openshift-tests-upgrade summary: 'Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts to have a chance at success' passes: - jobrunid: "1536907377092071424" humanurl: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907377092071424 gcsartifacturl: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907377092071424/artifacts - jobrunid: "1536907379621236736" humanurl: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907379621236736 gcsartifacturl: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907379621236736/artifacts failures: [] skips: [] } [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907382129430528 Thanks kasturi Moving the test back to assigned state because when i looked at the ci logs i still see that pod got scheduled on to the same node. [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure-modern/1535199601215148032 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure-modern/1535286255833583616 That test is just for launch jobs by cluster bot and when I looked at the failed jobs they seem to be upgrade from 4.10 which doesn't include fix. So, moving back to `ON_QA`: xref: https://coreos.slack.com/archives/C01CQA76KMX/p1655314005818299 Looking at [1] i do see there are failures but they are not related to the actual error which is originally reported here in the bug and it is something to do with the env (based on my observations in the log), so based on that moving the test to verified. [1] https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=168h&context=0&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |