Created attachment 1864996 [details] Events showing both pods scheduled on same node Created attachment 1864996 [details] Events showing both pods scheduled on same node Created attachment 1864996 [details] Events showing both pods scheduled on same node While debugging a disruption test failure "[sig-imageregistry] Image registry remains available using new connections", we noticed that the two ingress pods were scheduled on the same node. A couple of example job runs: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1500434000819261440 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1500433998298484736 See attached of the events showing the two pods were scheduled on the same node. Ingress has anti-affinity rule that should prevent this: https://github.com/openshift/cluster-ingress-operator/blob/5040f65551851b3ee284f0803bfdd1c64631c4c6/pkg/operator/controller/ingress/deployment.go#L337-L357 But somehow the pods ended up on the same node.
Ken, I cannot find the events from your screenshot in the linked CI job.
yeah, the screenshot doesn't match, but the events show the bug pretty clearly. ` router-default-79dfc95ff7-wtzl6` and router-default-79dfc95ff7-f96fj in the linked run
Details from the events with the pods David points out in comment 2 both getting scheduled to the same node in the same second: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1500434000819261440/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-ingress" and (.reason == "Scheduled" or .reason == "Killing")) | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' | sort 2022-03-06T11:50:02Z null router-default-79dfc95ff7-f96fj Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-f96fj to ip-10-0-129-93.us-west-1.compute.internal 2022-03-06T11:50:02Z null router-default-79dfc95ff7-wtzl6 Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-wtzl6 to ip-10-0-129-93.us-west-1.compute.internal 2022-03-06T12:26:05Z null router-default-79dfc95ff7-27b2v Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-27b2v to ip-10-0-225-12.us-west-1.compute.internal 2022-03-06T12:26:06Z 1 router-default-79dfc95ff7-wtzl6 Killing: Stopping container router 2022-03-06T12:26:15Z 1 router-default-79dfc95ff7-f96fj Killing: Stopping container router 2022-03-06T12:26:15Z null router-default-79dfc95ff7-ltwrt Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-ltwrt to ip-10-0-155-172.us-west-1.compute.internal 2022-03-06T12:29:40Z 1 router-default-79dfc95ff7-27b2v Killing: Stopping container router 2022-03-06T12:29:40Z null router-default-79dfc95ff7-t6d8g Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-t6d8g to ip-10-0-129-93.us-west-1.compute.internal 2022-03-06T12:33:05Z null router-default-79dfc95ff7-59cgm Scheduled: Successfully assigned openshift-ingress/router-default-79dfc95ff7-59cgm to ip-10-0-225-12.us-west-1.compute.internal 2022-03-06T12:33:08Z 2 router-default-79dfc95ff7-ltwrt Killing: Stopping container router
Moving back to ASSIGNED, per [1], openshift/kubernetes#1210 is a debugging aid and not a fix. [1]: https://github.com/openshift/kubernetes/pull/1210#issuecomment-1068235121
"The openshift-etcd pods should be scheduled on different nodes" appears to be failing 8% of the time on metal OVN. This means that etcd quorum is not protected by the PDB. https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=168h&context=0&type=junit&name=4.11.*metal.*ovn&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
*** Bug 2080471 has been marked as a duplicate of this bug. ***
Hello Ravi, I tried to verify the issue by checking the link below, the only time i see it passing was at [2] i.e before 44 hours but after that i see it failing with error at [3]. Any idea if we have a bug tracking this ? And i think we should wait until this issue is fixed. WDYS ? [1] https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=168h&context=0&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1529051880641007616 [3] blob:https://prow.ci.openshift.org/47301f28-6c2f-4a64-9815-031b18b036ab Thanks kasturi
Hi Kasturi, There is an issue with build controller SA which should be exempted from pod security, Standa has opened a PR and it should solve the problem. I looked at the search ci again and it seems the failures are unrelated to the symptom we usually see.
Ravi, yes agree. But can we wait until we have the test passing atleast couple of times before the bug is moved to verified state ?
Sure. We should wait till we have clear signal. No point in rushing to close this BZ.
Hello Ravi, I tried to verify the bug again but this time i am not sure of the reason it failed but i do see below messages when checking the logs at [1] , could you please help take a look? thanks !! {Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts to have a chance at success name: '[sig-scheduling][Early] The openshift-etcd pods should be scheduled on different nodes [Suite:openshift/conformance/parallel]' testsuitename: openshift-tests-upgrade summary: 'Passed 2 times, failed 0 times, skipped 0 times: we require at least 3 attempts to have a chance at success' passes: - jobrunid: "1536907377092071424" humanurl: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907377092071424 gcsartifacturl: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907377092071424/artifacts - jobrunid: "1536907379621236736" humanurl: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907379621236736 gcsartifacturl: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-origin-27244-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907379621236736/artifacts failures: [] skips: [] } [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregator-periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade/1536907382129430528 Thanks kasturi
Moving the test back to assigned state because when i looked at the ci logs i still see that pod got scheduled on to the same node. [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure-modern/1535199601215148032 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure-modern/1535286255833583616
That test is just for launch jobs by cluster bot and when I looked at the failed jobs they seem to be upgrade from 4.10 which doesn't include fix. So, moving back to `ON_QA`: xref: https://coreos.slack.com/archives/C01CQA76KMX/p1655314005818299
Looking at [1] i do see there are failures but they are not related to the actual error which is originally reported here in the bug and it is something to do with the env (based on my observations in the log), so based on that moving the test to verified. [1] https://search.ci.openshift.org/?search=The+openshift-etcd+pods+should+be+scheduled+on+different+nodes&maxAge=168h&context=0&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069