Bug 1916489 - [sig-scheduling] SchedulerPriorities [Serial] fails with "Error waiting for 1 pods to be running - probably a timeout: Timeout while waiting for pods with labels to be ready"
Summary: [sig-scheduling] SchedulerPriorities [Serial] fails with "Error waiting for 1...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.8.0
Assignee: Mike Dame
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-14 21:51 UTC by Mike Dame
Modified: 2021-07-27 22:36 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:36:14 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 526 0 None open Bug 1916489: (e2e/scheduler) Ensure minimum memory limit in createBalancedPodForNodes 2021-02-15 19:35:33 UTC
Github openshift origin pull 26054 0 None None None 2021-04-08 19:54:51 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:36:35 UTC

Internal Links: 1999288

Description Mike Dame 2021-01-14 21:51:13 UTC
CI Run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/494/pull-ci-openshift-kubernetes-master-e2e-aws-serial/1349654189046763520

Full test output: 
Jan 14 11:57:53.712: INFO: ComputeCPUMemFraction for node: ip-10-0-139-64.us-east-2.compute.internal
Jan 14 11:57:53.770: INFO: Pod for on the node: aws-ebs-csi-driver-node-9q7bl, Cpu: 30, Mem: 157286400
Jan 14 11:57:53.770: INFO: Pod for on the node: tuned-gd7p9, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: dns-default-psmx6, Cpu: 65, Mem: 137363456
Jan 14 11:57:53.770: INFO: Pod for on the node: image-registry-869d5f5f94-5vvzp, Cpu: 100, Mem: 268435456
Jan 14 11:57:53.770: INFO: Pod for on the node: node-ca-5tf98, Cpu: 10, Mem: 10485760
Jan 14 11:57:53.770: INFO: Pod for on the node: ingress-canary-6wpnz, Cpu: 10, Mem: 20971520
Jan 14 11:57:53.770: INFO: Pod for on the node: router-default-84fdf5895-hjhzq, Cpu: 100, Mem: 268435456
Jan 14 11:57:53.770: INFO: Pod for on the node: migrator-6754b8bcd7-9zj7c, Cpu: 100, Mem: 209715200
Jan 14 11:57:53.770: INFO: Pod for on the node: machine-config-daemon-srq4g, Cpu: 40, Mem: 104857600
Jan 14 11:57:53.770: INFO: Pod for on the node: certified-operators-rzhcs, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: redhat-marketplace-jmpk9, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: redhat-operators-g2n99, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: alertmanager-main-0, Cpu: 8, Mem: 283115520
Jan 14 11:57:53.770: INFO: Pod for on the node: alertmanager-main-2, Cpu: 8, Mem: 283115520
Jan 14 11:57:53.770: INFO: Pod for on the node: grafana-64ff674fbf-94q2d, Cpu: 5, Mem: 125829120
Jan 14 11:57:53.770: INFO: Pod for on the node: node-exporter-85wt9, Cpu: 9, Mem: 220200960
Jan 14 11:57:53.770: INFO: Pod for on the node: prometheus-adapter-857d7d57b-xs286, Cpu: 1, Mem: 26214400
Jan 14 11:57:53.770: INFO: Pod for on the node: prometheus-k8s-0, Cpu: 76, Mem: 1262485504
Jan 14 11:57:53.770: INFO: Pod for on the node: thanos-querier-784758586-p72zt, Cpu: 9, Mem: 96468992
Jan 14 11:57:53.770: INFO: Pod for on the node: multus-xzgpv, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.770: INFO: Pod for on the node: network-metrics-daemon-k7tn2, Cpu: 20, Mem: 125829120
Jan 14 11:57:53.770: INFO: Pod for on the node: network-check-source-9c8c79599-hsrch, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: network-check-target-hmtj9, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.770: INFO: Pod for on the node: ovs-gmxmv, Cpu: 100, Mem: 419430400
Jan 14 11:57:53.770: INFO: Pod for on the node: sdn-8jxrc, Cpu: 110, Mem: 230686720
Jan 14 11:57:53.770: INFO: Node: ip-10-0-139-64.us-east-2.compute.internal, totalRequestedCPUResource: 971, cpuAllocatableMil: 1500, cpuFraction: 0.6473333333333333
Jan 14 11:57:53.770: INFO: Node: ip-10-0-139-64.us-east-2.compute.internal, totalRequestedMemResource: 4932501504, memAllocatableVal: 6946533376, memFraction: 0.7100666241727995
Jan 14 11:57:53.770: INFO: ComputeCPUMemFraction for node: ip-10-0-152-250.us-east-2.compute.internal
Jan 14 11:57:53.829: INFO: Pod for on the node: aws-ebs-csi-driver-node-kghhl, Cpu: 30, Mem: 157286400
Jan 14 11:57:53.829: INFO: Pod for on the node: tuned-6hcnf, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.829: INFO: Pod for on the node: dns-default-wjw5k, Cpu: 65, Mem: 137363456
Jan 14 11:57:53.829: INFO: Pod for on the node: node-ca-dff2h, Cpu: 10, Mem: 10485760
Jan 14 11:57:53.829: INFO: Pod for on the node: ingress-canary-stk68, Cpu: 10, Mem: 20971520
Jan 14 11:57:53.829: INFO: Pod for on the node: machine-config-daemon-lvsc9, Cpu: 40, Mem: 104857600
Jan 14 11:57:53.829: INFO: Pod for on the node: node-exporter-k76nn, Cpu: 9, Mem: 220200960
Jan 14 11:57:53.829: INFO: Pod for on the node: multus-fdwzh, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.829: INFO: Pod for on the node: network-metrics-daemon-jq586, Cpu: 20, Mem: 125829120
Jan 14 11:57:53.829: INFO: Pod for on the node: network-check-target-kzqlq, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.829: INFO: Pod for on the node: ovs-5grbc, Cpu: 100, Mem: 419430400
Jan 14 11:57:53.829: INFO: Pod for on the node: sdn-fw96p, Cpu: 110, Mem: 230686720
Jan 14 11:57:53.829: INFO: Node: ip-10-0-152-250.us-east-2.compute.internal, totalRequestedCPUResource: 524, cpuAllocatableMil: 1500, cpuFraction: 0.34933333333333333
Jan 14 11:57:53.829: INFO: Node: ip-10-0-152-250.us-east-2.compute.internal, totalRequestedMemResource: 1898971136, memAllocatableVal: 6946533376, memFraction: 0.2733696123250297
Jan 14 11:57:53.829: INFO: ComputeCPUMemFraction for node: ip-10-0-199-24.us-east-2.compute.internal
Jan 14 11:57:53.890: INFO: Pod for on the node: aws-ebs-csi-driver-node-qtzv7, Cpu: 30, Mem: 157286400
Jan 14 11:57:53.890: INFO: Pod for on the node: tuned-85p7k, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.890: INFO: Pod for on the node: dns-default-ht8s4, Cpu: 65, Mem: 137363456
Jan 14 11:57:53.890: INFO: Pod for on the node: image-registry-869d5f5f94-h9sw5, Cpu: 100, Mem: 268435456
Jan 14 11:57:53.890: INFO: Pod for on the node: node-ca-ldgx8, Cpu: 10, Mem: 10485760
Jan 14 11:57:53.890: INFO: Pod for on the node: ingress-canary-c6xxx, Cpu: 10, Mem: 20971520
Jan 14 11:57:53.890: INFO: Pod for on the node: router-default-84fdf5895-v4vj4, Cpu: 100, Mem: 268435456
Jan 14 11:57:53.890: INFO: Pod for on the node: machine-config-daemon-xx5j2, Cpu: 40, Mem: 104857600
Jan 14 11:57:53.890: INFO: Pod for on the node: community-operators-crklm, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.890: INFO: Pod for on the node: alertmanager-main-1, Cpu: 8, Mem: 283115520
Jan 14 11:57:53.890: INFO: Pod for on the node: kube-state-metrics-57c97ff89f-tn48h, Cpu: 4, Mem: 125829120
Jan 14 11:57:53.890: INFO: Pod for on the node: node-exporter-mq9nx, Cpu: 9, Mem: 220200960
Jan 14 11:57:53.890: INFO: Pod for on the node: openshift-state-metrics-55dcf8647f-q9slk, Cpu: 3, Mem: 199229440
Jan 14 11:57:53.890: INFO: Pod for on the node: prometheus-adapter-857d7d57b-qstp7, Cpu: 1, Mem: 26214400
Jan 14 11:57:53.890: INFO: Pod for on the node: prometheus-k8s-1, Cpu: 76, Mem: 1262485504
Jan 14 11:57:53.890: INFO: Pod for on the node: telemeter-client-599d88dcf7-d745n, Cpu: 3, Mem: 440401920
Jan 14 11:57:53.890: INFO: Pod for on the node: thanos-querier-784758586-dbgcr, Cpu: 9, Mem: 96468992
Jan 14 11:57:53.890: INFO: Pod for on the node: multus-f8hf8, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.890: INFO: Pod for on the node: network-metrics-daemon-j75ff, Cpu: 20, Mem: 125829120
Jan 14 11:57:53.890: INFO: Pod for on the node: network-check-target-m75hq, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.890: INFO: Pod for on the node: ovs-l6s95, Cpu: 100, Mem: 419430400
Jan 14 11:57:53.890: INFO: Pod for on the node: sdn-5trvp, Cpu: 110, Mem: 230686720
Jan 14 11:57:53.890: INFO: Node: ip-10-0-199-24.us-east-2.compute.internal, totalRequestedCPUResource: 838, cpuAllocatableMil: 1500, cpuFraction: 0.5586666666666666
Jan 14 11:57:53.890: INFO: Node: ip-10-0-199-24.us-east-2.compute.internal, totalRequestedMemResource: 4922015744, memAllocatableVal: 6946533376, memFraction: 0.7085571287982825
Jan 14 11:57:53.923: INFO: Waiting for running...
Jan 14 11:58:04.001: INFO: Waiting for running...
Jan 14 11:58:09.082: INFO: Waiting for running...
[AfterEach] [sig-scheduling] SchedulerPriorities [Serial]
  k8s.io/kubernetes@v1.20.0/test/e2e/framework/framework.go:175
STEP: Collecting events from namespace "e2e-sched-priority-6094".
STEP: Found 13 events.
Jan 14 12:08:09.163: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: { } Scheduled: Successfully assigned e2e-sched-priority-6094/2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0 to ip-10-0-139-64.us-east-2.compute.internal
Jan 14 12:08:09.163: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: { } Scheduled: Successfully assigned e2e-sched-priority-6094/40ea718f-7423-45cf-a151-188223206df7-0 to ip-10-0-152-250.us-east-2.compute.internal
Jan 14 12:08:09.163: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for e8c095c8-5e7f-4384-9ee2-57cb23c56524-0: { } Scheduled: Successfully assigned e2e-sched-priority-6094/e8c095c8-5e7f-4384-9ee2-57cb23c56524-0 to ip-10-0-199-24.us-east-2.compute.internal
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:56:53 +0000 UTC - event for e2e-sched-priority-6094: {namespace-security-allocation-controller } CreatedSCCRanges: created SCC ranges
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:57:55 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: {multus } AddedInterface: Add eth0 [10.128.2.60/23]
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:57:58 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: {kubelet ip-10-0-139-64.us-east-2.compute.internal} Pulled: Container image "quay.io/openshift/community-e2e-images:e2e-27-k8s-gcr-io-pause-3-2-pZ7kB8CESdFpJJRs" already present on machine
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:57:58 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: {kubelet ip-10-0-139-64.us-east-2.compute.internal} Created: Created container 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:57:58 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: {kubelet ip-10-0-139-64.us-east-2.compute.internal} Started: Started container 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:06 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: {multus } AddedInterface: Add eth0 [10.131.2.52/23]
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:06 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: {kubelet ip-10-0-152-250.us-east-2.compute.internal} Started: Started container 40ea718f-7423-45cf-a151-188223206df7-0
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:06 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: {kubelet ip-10-0-152-250.us-east-2.compute.internal} Created: Created container 40ea718f-7423-45cf-a151-188223206df7-0
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:06 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: {kubelet ip-10-0-152-250.us-east-2.compute.internal} Pulled: Container image "quay.io/openshift/community-e2e-images:e2e-27-k8s-gcr-io-pause-3-2-pZ7kB8CESdFpJJRs" already present on machine
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:09 +0000 UTC - event for e8c095c8-5e7f-4384-9ee2-57cb23c56524-0: {kubelet ip-10-0-199-24.us-east-2.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = pod set memory limit 10485760 too low; should be at least 12582912
Jan 14 12:08:09.178: INFO: POD                                     NODE                                        PHASE    GRACE  CONDITIONS
Jan 14 12:08:09.178: INFO: 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0  ip-10-0-139-64.us-east-2.compute.internal   Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:57:53 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:57:59 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:57:59 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:57:53 +0000 UTC  }]
Jan 14 12:08:09.178: INFO: 40ea718f-7423-45cf-a151-188223206df7-0  ip-10-0-152-250.us-east-2.compute.internal  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:04 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:06 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:06 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:03 +0000 UTC  }]
Jan 14 12:08:09.178: INFO: e8c095c8-5e7f-4384-9ee2-57cb23c56524-0  ip-10-0-199-24.us-east-2.compute.internal   Pending         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:09 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:09 +0000 UTC ContainersNotReady containers with unready status: [e8c095c8-5e7f-4384-9ee2-57cb23c56524-0]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:09 +0000 UTC ContainersNotReady containers with unready status: [e8c095c8-5e7f-4384-9ee2-57cb23c56524-0]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:09 +0000 UTC  }]
Jan 14 12:08:09.178: INFO: 
Jan 14 12:08:09.216: INFO: skipping dumping cluster info - cluster too large
Jan 14 12:08:09.216: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-sched-priority-6094" for this suite.
[AfterEach] [sig-scheduling] SchedulerPriorities [Serial]
  k8s.io/kubernetes@v1.20.0/test/e2e/scheduling/priorities.go:131
Jan 14 12:08:09.258: INFO: Running AfterSuite actions on all nodes
Jan 14 12:08:09.258: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes@v1.20.0/test/e2e/scheduling/priorities.go:302]: Unexpected error:
    <*errors.errorString | 0xc003414030>: {
        s: "Error waiting for 1 pods to be running - probably a timeout: Timeout while waiting for pods with labels \"startPodsID=efbfd066-92b1-474f-95a7-51e44359eb83\" to be running",
    }
    Error waiting for 1 pods to be running - probably a timeout: Timeout while waiting for pods with labels "startPodsID=efbfd066-92b1-474f-95a7-51e44359eb83" to be running
occurred

Upstream PR

Comment 2 Mike Dame 2021-02-04 18:12:10 UTC
The upstream PR for this is undergoing review: https://github.com/kubernetes/kubernetes/pull/98073

The main question is how ensuring a minimum limit like this could affect other runtimes, and I need to reach out to the containers teams to answer that.

Comment 4 RamaKasturi 2021-02-24 15:24:34 UTC
Moving the bug back to assigned because as per the discussion with dev i see that we should wait for [1] to merge before verifying the bug, also when searched in link [2] i still see it is failing.

[1] https://github.com/openshift/origin/pull/25915
[2] https://search.ci.openshift.org/?search=Pod+should+be+preferably+scheduled+to+nodes+pod+can+tolerate&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 5 Mike Dame 2021-04-08 19:54:55 UTC
The currently linked PR has been merged, this should take effect in the 1.21 k8s bump for origin

Comment 10 errata-xmlrpc 2021-07-27 22:36:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.