Description of problem: https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-vsphere-upi-serial-4.2/47/build-log.txt Failing tests: [sig-scheduling] SchedulerPriorities [Serial] Pod should avoid nodes that have avoidPod annotation [Suite:openshift/conformance/serial] [Suite:k8s] [sig-scheduling] SchedulerPriorities [Serial] Pod should be scheduled to node that don't match the PodAntiAffinity terms [Suite:openshift/conformance/serial] [Suite:k8s] Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20190905-002104.xml error: 2 fail, 45 pass, 177 skip (1h43m46s) 2019/09/05 00:21:05 Container test in pod e2e-vsphere-upi-serial failed, exit code 1, reason Error 2019/09/05 00:27:31 Copied 329.80Mi of artifacts from e2e-vsphere-upi-serial to /logs/artifacts/e2e-vsphere-upi-serial 2019/09/05 00:27:33 Ran for 2h27m23s error: could not run steps: step e2e-vsphere-upi-serial failed: template pod "e2e-vsphere-upi-serial" failed: the pod ci-op-iqbzrlsz/e2e-vsphere-upi-serial failed after 2h24m51s (failed containers: test): ContainerFailed one or more containers exited Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-09-04-215255 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Here is another one: https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-vsphere-upi-serial-4.2/106
update: pod anti-affinity flake may be fixed by the upstream changes here which need to be picked: https://github.com/openshift/origin/pull/23805 (can we move the target of this bug to 4.3?) The errors in the avoidPod test seem to be timeouts coming from just trying to create the balanced pod on the nodes: https://github.com/openshift/origin/blob/d680d680da/vendor/k8s.io/kubernetes/test/e2e/scheduling/priorities.go#L150. weird so we'll have to keep an eye on that
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.2/152
Encountered in 4.3: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.3/939 fail [k8s.io/kubernetes/test/e2e/framework/util.go:1167]: Expected <string>: 2lh1htqp-9f2ed-5jn8q-worker-9s7dv not to equal <string>: 2lh1htqp-9f2ed-5jn8q-worker-9s7dv
Failing in 4.4 informing today Example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1291 fail [k8s.io/kubernetes/test/e2e/scheduling/priorities.go:238]: Expected <string>: ci-op-356v08rd-0ba00-mr87q-worker-centralus2-nq7xp not to equal <string>: ci-op-356v08rd-0ba00-mr87q-worker-centralus2-nq7xp
this is one of the top flakes on the test job: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-installer-e2e-azure-serial-4.4 raising severity accordingly.
This is consistently failing https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-installer-e2e-gcp-ovn-4.4 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-installer-e2e-azure-ovn-4.4
Sorry - please ignore the above comment, posted to the wrong bugzilla and unable to delete it
There was a PR open to try to address this by picking upstream changes to this test (https://github.com/openshift/origin/pull/23805). I rebased that PR today and it was automatically closed, indicating that the upstream changes have been pulled in to master. This seems to be another scheduler test that doesn't flake upstream, but is a problem under our higher-utilized nodes. Will be looking more into it, these are tricky to debug
I looked more into the failures and think the issue here could be due to how priorities (score plugins) work. They aren't 100% deterministic, they only attempt to score nodes based on conditions in the cluster. Imo, this e2e is probably adequately covered by existing integration tests for this priority. The schedulers in these tests are initialized with the following configs: I0317 19:31:11.115464 1 factory.go:219] Creating scheduler with fit predicates 'map[CheckNodeUnschedulable:{} CheckVolumeBinding:{} GeneralPredicates:{} MatchInterPodAffinity:{} MaxAzureDiskVolumeCount:{} MaxCSIVolumeCountPred:{} MaxEBSVolumeCount:{} MaxGCEPDVolumeCount:{} NoDiskConflict:{} NoVolumeZoneConflict:{} PodToleratesNodeTaints:{}]' and priority functions 'map[BalancedResourceAllocation:{} ImageLocalityPriority:{} InterPodAffinityPriority:{} LeastRequestedPriority:{} NodeAffinityPriority:{} NodePreferAvoidPodsPriority:{} SelectorSpreadPriority:{} TaintTolerationPriority:{}]' So InterPodAffinityPriority is enabled, but I noticed that LeastRequestedPriority is also enabled. This is key because, in each of the CI failures I saw, the pod always landed on the node with the lowest resource consumption (against the antiaffinity rule). Example logs: =================== STEP: Compute Cpu, Mem Fraction after create balanced pods. Mar 17 20:04:59.393: INFO: ComputeCPUMemFraction for node: ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus1-bjgcc Mar 17 20:04:59.591: INFO: Pod for on the node: a3abb2d5-c75c-4a1f-978b-59f5c7d4b5f3-0, Cpu: 397, Mem: 4870328320 Mar 17 20:04:59.591: INFO: Pod for on the node: pod-with-label-security-s1, Cpu: 100, Mem: 209715200 Mar 17 20:04:59.591: INFO: Pod for on the node: tuned-lgcfz, Cpu: 10, Mem: 52428800 Mar 17 20:04:59.591: INFO: Pod for on the node: dns-default-9mxgr, Cpu: 110, Mem: 283115520 Mar 17 20:04:59.591: INFO: Pod for on the node: image-registry-764ccc8f68-b9b9l, Cpu: 100, Mem: 268435456 Mar 17 20:04:59.591: INFO: Pod for on the node: node-ca-qch42, Cpu: 10, Mem: 10485760 Mar 17 20:04:59.591: INFO: Pod for on the node: migrator-86f4f4c84f-xzzhl, Cpu: 100, Mem: 209715200 Mar 17 20:04:59.591: INFO: Pod for on the node: machine-config-daemon-snlmb, Cpu: 40, Mem: 104857600 Mar 17 20:04:59.591: INFO: Pod for on the node: alertmanager-main-2, Cpu: 210, Mem: 256901120 Mar 17 20:04:59.592: INFO: Pod for on the node: node-exporter-rsqzk, Cpu: 112, Mem: 209715200 Mar 17 20:04:59.592: INFO: Pod for on the node: multus-6666t, Cpu: 10, Mem: 157286400 Mar 17 20:04:59.592: INFO: Pod for on the node: ovs-6zp5t, Cpu: 200, Mem: 419430400 Mar 17 20:04:59.592: INFO: Pod for on the node: sdn-4m5s6, Cpu: 100, Mem: 209715200 Mar 17 20:04:59.592: INFO: Node: ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus1-bjgcc, totalRequestedCPUResource: 1499, cpuAllocatableMil: 1500, cpuFraction: 0.9993333333333333 Mar 17 20:04:59.592: INFO: Node: ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus1-bjgcc, totalRequestedMemResource: 7157272576, memAllocatableVal: 7157272576, memFraction: 1 STEP: Compute Cpu, Mem Fraction after create balanced pods. Mar 17 20:04:59.592: INFO: ComputeCPUMemFraction for node: ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus2-tqjz7 Mar 17 20:04:59.702: INFO: Pod for on the node: 5466f525-8be3-45e6-b1d1-1f55a9824914-0, Cpu: 0, Mem: 3167440896 Mar 17 20:04:59.702: INFO: Pod for on the node: tuned-8hnm6, Cpu: 10, Mem: 52428800 Mar 17 20:04:59.702: INFO: Pod for on the node: csi-snapshot-controller-operator-5d8f7747b5-4vxnf, Cpu: 10, Mem: 52428800 Mar 17 20:04:59.702: INFO: Pod for on the node: dns-default-2wlr5, Cpu: 110, Mem: 283115520 Mar 17 20:04:59.702: INFO: Pod for on the node: node-ca-9vqmm, Cpu: 10, Mem: 10485760 Mar 17 20:04:59.702: INFO: Pod for on the node: router-default-ddd5dcd96-pt5wc, Cpu: 100, Mem: 268435456 Mar 17 20:04:59.702: INFO: Pod for on the node: machine-config-daemon-knshh, Cpu: 40, Mem: 104857600 Mar 17 20:04:59.702: INFO: Pod for on the node: certified-operators-694d5468cc-mv8nl, Cpu: 10, Mem: 104857600 Mar 17 20:04:59.702: INFO: Pod for on the node: community-operators-6f8f9f4d4c-l4cvm, Cpu: 10, Mem: 104857600 Mar 17 20:04:59.702: INFO: Pod for on the node: redhat-marketplace-56f5879f57-zmcv6, Cpu: 10, Mem: 104857600 Mar 17 20:04:59.702: INFO: Pod for on the node: redhat-operators-654574895d-9sjpp, Cpu: 10, Mem: 104857600 Mar 17 20:04:59.702: INFO: Pod for on the node: alertmanager-main-1, Cpu: 210, Mem: 256901120 Mar 17 20:04:59.702: INFO: Pod for on the node: grafana-5cd66d9fd7-t8wkk, Cpu: 110, Mem: 125829120 Mar 17 20:04:59.702: INFO: Pod for on the node: node-exporter-l7lfh, Cpu: 112, Mem: 209715200 Mar 17 20:04:59.702: INFO: Pod for on the node: prometheus-adapter-8449dcccbd-rdbtc, Cpu: 10, Mem: 20971520 Mar 17 20:04:59.702: INFO: Pod for on the node: prometheus-k8s-0, Cpu: 480, Mem: 1293942784 Mar 17 20:04:59.702: INFO: Pod for on the node: multus-xdp86, Cpu: 10, Mem: 157286400 Mar 17 20:04:59.702: INFO: Pod for on the node: ovs-tcz96, Cpu: 200, Mem: 419430400 Mar 17 20:04:59.702: INFO: Pod for on the node: sdn-m9r77, Cpu: 100, Mem: 209715200 Mar 17 20:04:59.702: INFO: Node: ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus2-tqjz7, totalRequestedCPUResource: 1652, cpuAllocatableMil: 1500, cpuFraction: 1 Mar 17 20:04:59.702: INFO: Node: ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus2-tqjz7, totalRequestedMemResource: 7157272576, memAllocatableVal: 7157272576, memFraction: 1 STEP: Compute Cpu, Mem Fraction after create balanced pods. Mar 17 20:04:59.702: INFO: ComputeCPUMemFraction for node: ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus3-j8lvw Mar 17 20:04:59.780: INFO: Pod for on the node: d8787a3a-24b0-43c8-9a05-d6c3ad00d4e1-0, Cpu: 0, Mem: 2947239936 Mar 17 20:04:59.780: INFO: Pod for on the node: tuned-skmh9, Cpu: 10, Mem: 52428800 Mar 17 20:04:59.780: INFO: Pod for on the node: csi-snapshot-controller-548f84b6c6-zbmtq, Cpu: 10, Mem: 52428800 Mar 17 20:04:59.780: INFO: Pod for on the node: dns-default-w5t4x, Cpu: 110, Mem: 283115520 Mar 17 20:04:59.780: INFO: Pod for on the node: node-ca-65667, Cpu: 10, Mem: 10485760 Mar 17 20:04:59.780: INFO: Pod for on the node: router-default-ddd5dcd96-ggsnm, Cpu: 100, Mem: 268435456 Mar 17 20:04:59.780: INFO: Pod for on the node: machine-config-daemon-njbkc, Cpu: 40, Mem: 104857600 Mar 17 20:04:59.780: INFO: Pod for on the node: alertmanager-main-0, Cpu: 210, Mem: 256901120 Mar 17 20:04:59.780: INFO: Pod for on the node: kube-state-metrics-588f488fd4-28btj, Cpu: 30, Mem: 125829120 Mar 17 20:04:59.780: INFO: Pod for on the node: node-exporter-xplrp, Cpu: 112, Mem: 209715200 Mar 17 20:04:59.780: INFO: Pod for on the node: openshift-state-metrics-598766fdd8-bt2kd, Cpu: 120, Mem: 199229440 Mar 17 20:04:59.780: INFO: Pod for on the node: prometheus-adapter-8449dcccbd-zndcl, Cpu: 10, Mem: 20971520 Mar 17 20:04:59.780: INFO: Pod for on the node: prometheus-k8s-1, Cpu: 480, Mem: 1293942784 Mar 17 20:04:59.780: INFO: Pod for on the node: telemeter-client-6df759b4b7-w69k7, Cpu: 210, Mem: 440401920 Mar 17 20:04:59.780: INFO: Pod for on the node: multus-rckkf, Cpu: 10, Mem: 157286400 Mar 17 20:04:59.780: INFO: Pod for on the node: ovs-jwqpf, Cpu: 200, Mem: 419430400 Mar 17 20:04:59.780: INFO: Pod for on the node: sdn-7bfz2, Cpu: 100, Mem: 209715200 Mar 17 20:04:59.780: INFO: Node: ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus3-j8lvw, totalRequestedCPUResource: 1862, cpuAllocatableMil: 1500, cpuFraction: 1 Mar 17 20:04:59.780: INFO: Node: ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus3-j8lvw, totalRequestedMemResource: 7157272576, memAllocatableVal: 7157272576, memFraction: 1 ==================== You can see that the lowest-requested node is ci-op-w7mkl8jd-0ba00-r9gqz-worker-centralus1-bjgcc, and you can also see that the pod-with-label-security-s1 is on that node. This is ultimately where our test pod with the antiaffinity for that label ends up, I believe due to its resource requests outweighing its antiaffinity for 1 pod already on the node. I updated the PR above (https://github.com/openshift/origin/pull/23805) to increase the antiaffinity weight in the test to the maximum (100), which I believe may help but will not permanently solve this problem. To solve it I would like to either: evaluate the requests on these nodes and reduce them where possible, or remove this test from e2e entirely
A similar case is showing up in release 4.3 as well Prow link : https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/1163 Error message : [sig-scheduling] SchedulerPriorities [Serial] Pod should be scheduled to node that don't match the PodAntiAffinity terms [Suite:openshift/conformance/serial] [Suite:k8s]
The problem here is similar to noted in comment 10, namely the nodes are already overcommitted, samples from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/1445/build-log.txt: Apr 9 17:02:52.945: INFO: ComputeCPUMemFraction for node: ci-op-r214mv6c-0ba00-pnvvd-worker-centralus1-9dzxr Apr 9 17:02:53.141: INFO: Pod for on the node: 9c599173-67e4-41cf-9628-7d18841f9c04-0, Cpu: 397, Mem: 4870328320 Apr 9 17:02:53.141: INFO: Pod for on the node: pod-with-label-security-s1, Cpu: 100, Mem: 209715200 Apr 9 17:02:53.141: INFO: Pod for on the node: tuned-jfn4b, Cpu: 10, Mem: 52428800 Apr 9 17:02:53.141: INFO: Pod for on the node: dns-default-j44v6, Cpu: 110, Mem: 283115520 Apr 9 17:02:53.141: INFO: Pod for on the node: node-ca-7bc8v, Cpu: 10, Mem: 10485760 Apr 9 17:02:53.141: INFO: Pod for on the node: router-default-865d6d797b-wqtkt, Cpu: 100, Mem: 268435456 Apr 9 17:02:53.141: INFO: Pod for on the node: migrator-64dc564487-fvhfw, Cpu: 100, Mem: 209715200 Apr 9 17:02:53.141: INFO: Pod for on the node: machine-config-daemon-bkzr7, Cpu: 40, Mem: 104857600 Apr 9 17:02:53.141: INFO: Pod for on the node: alertmanager-main-2, Cpu: 210, Mem: 256901120 Apr 9 17:02:53.141: INFO: Pod for on the node: node-exporter-cdkgd, Cpu: 112, Mem: 209715200 Apr 9 17:02:53.141: INFO: Pod for on the node: multus-mxdvh, Cpu: 10, Mem: 157286400 Apr 9 17:02:53.141: INFO: Pod for on the node: ovs-4bpv9, Cpu: 200, Mem: 419430400 Apr 9 17:02:53.141: INFO: Pod for on the node: sdn-bmbff, Cpu: 100, Mem: 209715200 Apr 9 17:02:53.141: INFO: Node: ci-op-r214mv6c-0ba00-pnvvd-worker-centralus1-9dzxr, totalRequestedCPUResource: 1499, cpuAllocatableMil: 1500, cpuFraction: 0.9993333333333333 Apr 9 17:02:53.141: INFO: Node: ci-op-r214mv6c-0ba00-pnvvd-worker-centralus1-9dzxr, totalRequestedMemResource: 7157272576, memAllocatableVal: 7157272576, memFraction: 1 STEP: Compute Cpu, Mem Fraction after create balanced pods. Apr 9 17:02:53.141: INFO: ComputeCPUMemFraction for node: ci-op-r214mv6c-0ba00-pnvvd-worker-centralus2-z7lmz Apr 9 17:02:53.257: INFO: Pod for on the node: 9e298f73-555c-45ed-914d-e9d7c1b03955-0, Cpu: 0, Mem: 3215675392 Apr 9 17:02:53.257: INFO: Pod for on the node: tuned-295cv, Cpu: 10, Mem: 52428800 Apr 9 17:02:53.257: INFO: Pod for on the node: csi-snapshot-controller-operator-5c49f4d4cc-xrjfs, Cpu: 10, Mem: 52428800 Apr 9 17:02:53.257: INFO: Pod for on the node: dns-default-h6tq8, Cpu: 110, Mem: 283115520 Apr 9 17:02:53.257: INFO: Pod for on the node: node-ca-f69g4, Cpu: 10, Mem: 10485760 Apr 9 17:02:53.257: INFO: Pod for on the node: machine-config-daemon-bgfjx, Cpu: 40, Mem: 104857600 Apr 9 17:02:53.257: INFO: Pod for on the node: certified-operators-6c84647684-vnzcm, Cpu: 10, Mem: 104857600 Apr 9 17:02:53.257: INFO: Pod for on the node: community-operators-6b678dd54-7bp89, Cpu: 10, Mem: 104857600 Apr 9 17:02:53.257: INFO: Pod for on the node: redhat-operators-6cfdb6f49-7wd46, Cpu: 10, Mem: 104857600 Apr 9 17:02:53.257: INFO: Pod for on the node: alertmanager-main-0, Cpu: 210, Mem: 256901120 Apr 9 17:02:53.257: INFO: Pod for on the node: grafana-6b8b9b4d89-5tn76, Cpu: 110, Mem: 125829120 Apr 9 17:02:53.257: INFO: Pod for on the node: kube-state-metrics-564646876-f64mz, Cpu: 30, Mem: 125829120 Apr 9 17:02:53.257: INFO: Pod for on the node: node-exporter-tw9hf, Cpu: 112, Mem: 209715200 Apr 9 17:02:53.257: INFO: Pod for on the node: openshift-state-metrics-6cbf5bcf54-w45k9, Cpu: 120, Mem: 199229440 Apr 9 17:02:53.257: INFO: Pod for on the node: prometheus-adapter-fb66dcf9f-czcbm, Cpu: 10, Mem: 20971520 Apr 9 17:02:53.257: INFO: Pod for on the node: prometheus-k8s-1, Cpu: 480, Mem: 1293942784 Apr 9 17:02:53.257: INFO: Pod for on the node: multus-rqsc2, Cpu: 10, Mem: 157286400 Apr 9 17:02:53.257: INFO: Pod for on the node: ovs-5vd6n, Cpu: 200, Mem: 419430400 Apr 9 17:02:53.257: INFO: Pod for on the node: sdn-vkjff, Cpu: 100, Mem: 209715200 Apr 9 17:02:53.257: INFO: Node: ci-op-r214mv6c-0ba00-pnvvd-worker-centralus2-z7lmz, totalRequestedCPUResource: 1692, cpuAllocatableMil: 1500, cpuFraction: 1 Apr 9 17:02:53.257: INFO: Node: ci-op-r214mv6c-0ba00-pnvvd-worker-centralus2-z7lmz, totalRequestedMemResource: 7157272576, memAllocatableVal: 7157272576, memFraction: 1 STEP: Compute Cpu, Mem Fraction after create balanced pods. Apr 9 17:02:53.257: INFO: ComputeCPUMemFraction for node: ci-op-r214mv6c-0ba00-pnvvd-worker-centralus3-vhjnm Apr 9 17:02:53.338: INFO: Pod for on the node: 112c5b65-d773-4255-b25a-ff4c7606c176-0, Cpu: 0, Mem: 2898997248 Apr 9 17:02:53.338: INFO: Pod for on the node: tuned-l64jl, Cpu: 10, Mem: 52428800 Apr 9 17:02:53.338: INFO: Pod for on the node: csi-snapshot-controller-8c54cd5d9-zrxm7, Cpu: 10, Mem: 52428800 Apr 9 17:02:53.338: INFO: Pod for on the node: dns-default-vxd5m, Cpu: 110, Mem: 283115520 Apr 9 17:02:53.338: INFO: Pod for on the node: image-registry-55ccb977d4-kzltv, Cpu: 100, Mem: 268435456 Apr 9 17:02:53.338: INFO: Pod for on the node: node-ca-tzwgg, Cpu: 10, Mem: 10485760 Apr 9 17:02:53.338: INFO: Pod for on the node: router-default-865d6d797b-vgjzc, Cpu: 100, Mem: 268435456 Apr 9 17:02:53.338: INFO: Pod for on the node: machine-config-daemon-m9l5s, Cpu: 40, Mem: 104857600 Apr 9 17:02:53.338: INFO: Pod for on the node: redhat-marketplace-68856bd858-bqxd7, Cpu: 10, Mem: 104857600 Apr 9 17:02:53.338: INFO: Pod for on the node: alertmanager-main-1, Cpu: 210, Mem: 256901120 Apr 9 17:02:53.338: INFO: Pod for on the node: node-exporter-zsmvw, Cpu: 112, Mem: 209715200 Apr 9 17:02:53.338: INFO: Pod for on the node: prometheus-adapter-fb66dcf9f-ts78w, Cpu: 10, Mem: 20971520 Apr 9 17:02:53.338: INFO: Pod for on the node: prometheus-k8s-0, Cpu: 480, Mem: 1293942784 Apr 9 17:02:53.338: INFO: Pod for on the node: telemeter-client-758749469-t5pkv, Cpu: 210, Mem: 440401920 Apr 9 17:02:53.338: INFO: Pod for on the node: multus-wlfh5, Cpu: 10, Mem: 157286400 Apr 9 17:02:53.338: INFO: Pod for on the node: ovs-m7w7n, Cpu: 200, Mem: 419430400 Apr 9 17:02:53.338: INFO: Pod for on the node: sdn-ccc27, Cpu: 100, Mem: 209715200 Apr 9 17:02:53.338: INFO: Node: ci-op-r214mv6c-0ba00-pnvvd-worker-centralus3-vhjnm, totalRequestedCPUResource: 1822, cpuAllocatableMil: 1500, cpuFraction: 1 Apr 9 17:02:53.338: INFO: Node: ci-op-r214mv6c-0ba00-pnvvd-worker-centralus3-vhjnm, totalRequestedMemResource: 7157264384, memAllocatableVal: 7157264384, memFraction: 1 The first node (ci-op-r214mv6c-0ba00-pnvvd-worker-centralus1-9dzxr) which happens to be the least crowded one already runs pod-with-label-security-s1 and thus is also picked and this leads to this failure.
We aren't seeing this test flake in the 4.5 suite: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-ocp-installer-e2e-azure-serial-4.5 So with the usages we've analyzed here, we're thinking that backporting this BZ will resolve the problem: https://bugzilla.redhat.com/show_bug.cgi?id=1812583
Based on the previous comment I'm going to lower the priority of this to high, we need to wait for all the requests related PRs to merge and verify this again.
The following backports from https://bugzilla.redhat.com/show_bug.cgi?id=1812583 landed in 4.4: https://github.com/openshift/cluster-etcd-operator/pull/255 https://github.com/openshift/cluster-kube-apiserver-operator/pull/796 https://github.com/openshift/cluster-kube-controller-manager-operator/pull/377 https://github.com/openshift/cluster-kube-scheduler-operator/pull/228 https://github.com/openshift/cluster-network-operator/pull/531 https://github.com/openshift/cluster-openshift-apiserver-operator/pull/342 which should solve the issue. I'm moving this for verification.
I don't think this should be verified. It is still a top flake: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-installer-e2e-azure-serial-4.4 It may have been verified in independent testing because running a small cluster for this single test won't have the resource consumption seen in the flakes, and this needs to be verified against the high-usage CI clusters. Is there a way to take this off of verified? Can I just move it back to assigned? BTW: there is an upstream issue which references what may be the issue with this test: https://github.com/kubernetes/kubernetes/issues/88174 - "Increase the weights of scoring plugins that implement explicit preferences expressed in the pod spec"
I am not sure what needs to be done here, but i think it should then be not moved to modified state is what i feel, we could have it at POST state may be ? I will wait for maszulik to comment on this.
> Is there a way to take this off of verified? Can I just move it back to > assigned? Yeah, I'll move it back to you with target release set to 4.5. It moved b/c related PR merged.
Following up on this, this test no longer appears to be consistently flaking on testgrid. It may be that the backported changes Maciej linked above just took more time than I thought to have an effect, or something else has changed to enable these tests to run more consistently, but because it's not red anymore I think we can go ahead and verify it now.
Moving the bug back to assigned state because i see that one of the test is passed and another test always gets skipped due to the reason below. started: (0/89/97) "[sig-scheduling] SchedulerPriorities [Serial] Pod should be scheduled to node that don't match the PodAntiAffinity terms [Suite:openshift/conformance/serial] [Suite:k8s]" skip [k8s.io/kubernetes/test/e2e/scheduling/priorities.go:161]: Requires at least 2 nodes (not 0) skipped: (1m2s) 2020-05-04T06:57:30 "[sig-scheduling] SchedulerPriorities [Serial] Pod should be scheduled to node that don't match the PodAntiAffinity terms [Suite:openshift/conformance/serial] [Suite:k8s]" Tried running the same on my localsetup and still see the same error. [ramakasturinarra@dhcp35-60 origin]$ ./_output/local/bin/linux/amd64/openshift-tests run-test "[sig-scheduling] SchedulerPriorities [Serial] Pod should be scheduled to node that don't match the PodAntiAffinity terms [Suite:openshift/conformance/serial] [Suite:k8s]" I0504 19:52:36.520035 28074 test_context.go:423] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready May 4 19:52:36.562: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable May 4 19:52:36.835: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready May 4 19:52:37.750: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed) May 4 19:52:37.750: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready. May 4 19:52:37.750: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start May 4 19:52:38.079: INFO: e2e test version: v1.18.0-rc.1 May 4 19:52:38.358: INFO: kube-apiserver version: v1.18.0-rc.1 May 4 19:52:38.669: INFO: Cluster IP family: ipv4 [BeforeEach] [Top Level] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/framework.go:1413 [BeforeEach] [Top Level] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/framework.go:1413 [BeforeEach] [Top Level] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:58 [BeforeEach] [sig-scheduling] SchedulerPriorities [Serial] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:178 STEP: Creating a kubernetes client STEP: Building a namespace api object, basename sched-priority May 4 19:52:39.589: INFO: About to run a Kube e2e test, ensuring namespace is privileged May 4 19:52:42.351: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled. STEP: Waiting for a default service account to be provisioned in namespace [BeforeEach] [sig-scheduling] SchedulerPriorities [Serial] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/scheduling/priorities.go:140 May 4 19:52:42.659: INFO: Waiting up to 1m0s for all nodes to be ready May 4 19:53:45.944: INFO: Waiting for terminating namespaces to be deleted... May 4 19:53:46.170: INFO: Waiting up to 5m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready May 4 19:53:46.883: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed) May 4 19:53:46.883: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready. [It] Pod should be scheduled to node that don't match the PodAntiAffinity terms [Suite:openshift/conformance/serial] [Suite:k8s] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/scheduling/priorities.go:159 May 4 19:53:46.883: INFO: Requires at least 2 nodes (not 0) [AfterEach] [sig-scheduling] SchedulerPriorities [Serial] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:179 May 4 19:53:46.883: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-sched-priority-4363" for this suite. [AfterEach] [sig-scheduling] SchedulerPriorities [Serial] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/scheduling/priorities.go:137 May 4 19:53:47.337: INFO: Running AfterSuite actions on all nodes May 4 19:53:47.337: INFO: Running AfterSuite actions on node 1 skip [k8s.io/kubernetes/test/e2e/scheduling/priorities.go:161]: Requires at least 2 nodes (not 0) [ramakasturinarra@dhcp35-60 origin]$ ./_output/local/bin/linux/amd64/openshift-tests run-test "[sig-scheduling] SchedulerPriorities [Serial] Pod should be scheduled to node that don't match the PodAntiAffinity terms [Suite:openshift/conformance/serial] [Suite:k8s]" I0504 19:54:02.332074 28252 test_context.go:423] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready May 4 19:54:02.452: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable May 4 19:54:03.459: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready May 4 19:54:04.381: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed) May 4 19:54:04.381: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready. May 4 19:54:04.381: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start May 4 19:54:04.690: INFO: e2e test version: v1.18.0-rc.1 May 4 19:54:04.989: INFO: kube-apiserver version: v1.18.0-rc.1 May 4 19:54:05.297: INFO: Cluster IP family: ipv4 [BeforeEach] [Top Level] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/framework.go:1413 [BeforeEach] [Top Level] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/framework.go:1413 [BeforeEach] [Top Level] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:58 [BeforeEach] [sig-scheduling] SchedulerPriorities [Serial] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:178 STEP: Creating a kubernetes client STEP: Building a namespace api object, basename sched-priority May 4 19:54:06.127: INFO: About to run a Kube e2e test, ensuring namespace is privileged May 4 19:54:08.470: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled. STEP: Waiting for a default service account to be provisioned in namespace [BeforeEach] [sig-scheduling] SchedulerPriorities [Serial] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/scheduling/priorities.go:140 May 4 19:54:08.686: INFO: Waiting up to 1m0s for all nodes to be ready May 4 19:55:13.310: INFO: Waiting for terminating namespaces to be deleted... May 4 19:55:13.583: INFO: Waiting up to 5m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready May 4 19:55:14.350: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed) May 4 19:55:14.350: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready. [It] Pod should be scheduled to node that don't match the PodAntiAffinity terms [Suite:openshift/conformance/serial] [Suite:k8s] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/scheduling/priorities.go:159 May 4 19:55:14.350: INFO: Requires at least 2 nodes (not 0) [AfterEach] [sig-scheduling] SchedulerPriorities [Serial] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:179 May 4 19:55:14.351: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-sched-priority-6575" for this suite. [AfterEach] [sig-scheduling] SchedulerPriorities [Serial] /home/ramakasturinarra/automation/OpenShift/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/scheduling/priorities.go:137 May 4 19:55:14.802: INFO: Running AfterSuite actions on all nodes May 4 19:55:14.802: INFO: Running AfterSuite actions on node 1 skip [k8s.io/kubernetes/test/e2e/scheduling/priorities.go:161]: Requires at least 2 nodes (not 0) Below are the jobs where i see this test is getting skipped: ============================================================ https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.5/1021 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.5/981 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.5/986 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.5/673 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.5/974
It seems that this test disappeared from the testgrid flakes because it is always skipping now (for some reason, failing the NodeCount check) I opened https://github.com/openshift/origin/pull/24944 to test removing this check, to see if the test runs and passes without it
The above PR fixes that e2eskipper check, so these should start showing up in CI test runs now
Verified that test does not get skipped for the last two days, will watch it for few more days and then move the bug to verified state.
Moving the bug to verified as i see that the test has passed in todays run as well in the CI also running it locally on the cluster works fine. I also see that in the runs below the test gets skipped, but With no failures i think we can mark the bug verified. It’s running, and passing now when it does, which means it’s not flaking (which was the original intent of the bz) Runs in which bug gets skipped: ==================================== https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.5/1126 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.5/1124 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.5/1114 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.5/1125 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.5/681 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.5/673 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.5/81 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.5/79
I've started seeing this test fail consistently in 4.3.z since Saturday June 27 in PR https://github.com/openshift/origin/pull/25215 Also seeing hits in other releases via https://search.apps.build01.ci.devcluster.openshift.com/?search=Pod+should+be+scheduled+to+node+that+don%27t+match+the+PodAnti&maxAge=24h&context=2&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job That said is it possible the upstream k8s change pulled in with this BZ is applicable to 4.3 and would help to at least reduce the flakes? Or was that change k8s version specific ?
Figured out the answer to my question in #Comment 32 skipper/skipper.go is not present in https://github.com/openshift/origin/tree/release-4.3/vendor/k8s.io/kubernetes/test/e2e/framework will open a separate 4.3.x bug for the consistent failures of this test seen there.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409