Bug 1934085
| Summary: | Scheduling conformance tests failing in a single node cluster | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jan Chaloupka <jchaloup> |
| Component: | kube-scheduler | Assignee: | Jan Chaloupka <jchaloup> |
| Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.8 | CC: | aos-bugs, mdame, mfojtik, rfreiman |
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:49:00 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jan Chaloupka
2021-03-02 13:38:59 UTC
[sig-scheduling] SchedulerPreemption [Serial] PodTopologySpread Preemption validates proper pods are preempted [Suite:openshift/conformance/serial] [Suite:k8s]
- requires 2 nodes by default
PodTopologySpread makes sense in two or more nodes scenarios. The feature place pods with aim to minimize skew between topology domains. Not applicable for a single domain.
[sig-scheduling] SchedulerPredicates [Serial] validates that there is no conflict between pods with same hostPort but different hostIP and protocol [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
```
Feb 25 16:22:30.656: INFO: At 2021-02-25 16:21:36 +0000 UTC - event for without-label: {kubelet ip-10-0-183-130.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_without-label_e2e-sched-pred-8365_bbec2192-36d7-44ac-abd8-069723f8c565_0(35c716f7bae318cb47da4ac2fb917dca29cbe5750d5a78da76c69e1c7df0cfd2): [e2e-sched-pred-8365/without-label:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'failed to find netid for namespace: e2e-sched-pred-8365, netnamespaces.network.openshift.io "e2e-sched-pred-8365" not found
```
Possibly a flake
[sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
- the test does not require two nodes though it creates two pods, each consuming 2/3 of "scheduling.k8s.io/foo" extended resources, which in total consume 4/3
- the idea is to have two pods (one low level priority, the other high level priority), once a third pod gets scheduler, the test make sure only the low level priority pod is scheduled and the high level priority pod is never preempted.
We might update the condition the have the same number of priority pods as there is number of nodes. Though we need at least two pods so we can check only the low level pod is always preempted. We might have each node run two pods. First node will run low and high priority pod (each eating 2/5 of the extended resource) and all other nodes running just high priority pods:
- 2/5 + 2/5 will consume 4/5, leaving no resources for the third (preemptor) pod
- the third pod will then always have to preempt the low priority pod while still keeping the original intention of the test
[sig-scheduling] SchedulerPredicates [Serial] PodTopologySpread Filtering validates 4 pods with MaxSkew=1 are evenly distributed into 2 nodes [Suite:openshift/conformance/serial] [Suite:k8s]
- requires 2 nodes by default
PodTopologySpread makes sense in two or more nodes scenarios. The feature place pods with aim to minimize skew between topology domains. Not applicable for a single domain.
[sig-scheduling] SchedulerPriorities [Serial] PodTopologySpread Scoring validates pod should be preferably scheduled to node which makes the matching pods more evenly distributed [Suite:openshift/conformance/serial] [Suite:k8s]
- priorities require at least two nodes to get evaluated
[sig-scheduling] SchedulerPreemption [Serial] validates lower priority pod preemption by critical pod [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
- the same as in the case of "[sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works", i.e. creating two pods instead of one consuming 2/5 of the extended resource
Summarized:
- skip:
[sig-scheduling] SchedulerPreemption [Serial] PodTopologySpread Preemption validates proper pods are preempted [Suite:openshift/conformance/serial] [Suite:k8s]
[sig-scheduling] SchedulerPredicates [Serial] PodTopologySpread Filtering validates 4 pods with MaxSkew=1 are evenly distributed into 2 nodes [Suite:openshift/conformance/serial] [Suite:k8s]
[sig-scheduling] SchedulerPriorities [Serial] PodTopologySpread Scoring validates pod should be preferably scheduled to node which makes the matching pods more evenly distributed [Suite:openshift/conformance/serial] [Suite:k8s]
- redesign:
[sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
[sig-scheduling] SchedulerPreemption [Serial] validates lower priority pod preemption by critical pod [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
- flaking:
[sig-scheduling] SchedulerPredicates [Serial] validates that there is no conflict between pods with same hostPort but different hostIP and protocol [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
Checking other jobs:
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/16290/rehearse-16290-pull-ci-openshift-machine-config-operator-master-e2e-aws-single-node-serial/1366431373765644288
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/16290/rehearse-16290-pull-ci-openshift-machine-config-operator-master-e2e-aws-single-node-serial/1366397921024544768
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/16290/rehearse-16290-pull-ci-openshift-machine-config-operator-master-e2e-aws-single-node-serial/1366370466431766528
- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/16290/rehearse-16290-pull-ci-openshift-machine-config-operator-master-e2e-aws-single-node-serial/1366339159647588352
Only 5 sig-scheduling tests (to skip and to redesigne) are failing:
[sig-scheduling] SchedulerPreemption [Serial] PodTopologySpread Preemption validates proper pods are preempted [Suite:openshift/conformance/serial] [Suite:k8s]
[sig-scheduling] SchedulerPredicates [Serial] PodTopologySpread Filtering validates 4 pods with MaxSkew=1 are evenly distributed into 2 nodes [Suite:openshift/conformance/serial] [Suite:k8s]
[sig-scheduling] SchedulerPriorities [Serial] PodTopologySpread Scoring validates pod should be preferably scheduled to node which makes the matching pods more evenly distributed [Suite:openshift/conformance/serial] [Suite:k8s]
[sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
[sig-scheduling] SchedulerPreemption [Serial] validates lower priority pod preemption by critical pod [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]
Upstream PR: https://github.com/kubernetes/kubernetes/pull/100128 Waiting for upstream review Waiting for https://github.com/openshift/origin/pull/26054 to land https://github.com/openshift/origin/pull/26054 merged. Re-running the tests again the following sig-scheduling tests are failing now: - [sig-scheduling] SchedulerPredicates [Serial] validates that NodeSelector is respected if not matching [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s] - [sig-scheduling] SchedulerPredicates [Serial] validates that NodeAffinity is respected if not matching [Suite:openshift/conformance/serial] [Suite:k8s] - [sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s] - [sig-scheduling] SchedulerPredicates [Serial] validates pod overhead is considered along with resource limits of pods that are allowed to run verify pod overhead is accounted for [Suite:openshift/conformance/serial] [Suite:k8s] All due to (image-registry's pod name sufix may be different): ``` May 10 13:35:36.601: INFO: Timed out waiting for the following pods to schedule May 10 13:35:36.601: INFO: openshift-image-registry/image-registry-746897d64f-stgls May 10 13:35:36.601: FAIL: Timed out after 10m0s waiting for stable cluster. ``` The kube-scheduler logs says: ``` I0510 14:50:03.461339 1 factory.go:338] "Unable to schedule pod; no fit; waiting" pod="openshift-image-registry/image-registry-746897d64f-stgls" err="0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules." ``` From image-registry-746897d64f-stgls's manifest: ``` "spec": { "affinity": { "podAntiAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": [ { "labelSelector": { "matchLabels": { "docker-registry": "default" } }, "namespaces": [ "openshift-image-registry" ], "topologyKey": "kubernetes.io/hostname" } ] } }, ``` Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/17822/rehearse-17822-pull-ci-openshift-origin-master-e2e-aws-single-node-serial/1391727856265990144/artifacts/e2e-aws-single-node-serial/gather-extra/artifacts/pods.json, there are two instances of image-registry-746897d64f pod. Thus the reason why the second instance of the pod can't be scheduled. Based on the previous comment none of the original tests from [sig-scheduling] SchedulerPreemption [Serial] PodTopologySpread Preemption validates proper pods are preempted [Suite:openshift/conformance/serial] [Suite:k8s] [sig-scheduling] SchedulerPredicates [Serial] PodTopologySpread Filtering validates 4 pods with MaxSkew=1 are evenly distributed into 2 nodes [Suite:openshift/conformance/serial] [Suite:k8s] [sig-scheduling] SchedulerPriorities [Serial] PodTopologySpread Scoring validates pod should be preferably scheduled to node which makes the matching pods more evenly distributed [Suite:openshift/conformance/serial] [Suite:k8s] [sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s] [sig-scheduling] SchedulerPreemption [Serial] validates lower priority pod preemption by critical pod [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s] are failing. Moving to MODIFIED. Verified that all cases mentioned in comment 9 in link [1] and see that they do not have any failures in the past 48 hours, so moving the bug to verified state. [1] https://search.ci.openshift.org/ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |