Bug 1916489 - [sig-scheduling] SchedulerPriorities [Serial] fails with "Error waiting for 1 pods to be running - probably a timeout: Timeout while waiting for pods with labels to be ready"
Summary: [sig-scheduling] SchedulerPriorities [Serial] fails with "Error waiting for 1...
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.8.0
Assignee: Mike Dame
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-14 21:51 UTC by Mike Dame
Modified: 2021-04-13 13:12 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 526 0 None open Bug 1916489: (e2e/scheduler) Ensure minimum memory limit in createBalancedPodForNodes 2021-02-15 19:35:33 UTC
Github openshift origin pull 26054 0 None None None 2021-04-08 19:54:51 UTC

Description Mike Dame 2021-01-14 21:51:13 UTC
CI Run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/494/pull-ci-openshift-kubernetes-master-e2e-aws-serial/1349654189046763520

Full test output: 
Jan 14 11:57:53.712: INFO: ComputeCPUMemFraction for node: ip-10-0-139-64.us-east-2.compute.internal
Jan 14 11:57:53.770: INFO: Pod for on the node: aws-ebs-csi-driver-node-9q7bl, Cpu: 30, Mem: 157286400
Jan 14 11:57:53.770: INFO: Pod for on the node: tuned-gd7p9, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: dns-default-psmx6, Cpu: 65, Mem: 137363456
Jan 14 11:57:53.770: INFO: Pod for on the node: image-registry-869d5f5f94-5vvzp, Cpu: 100, Mem: 268435456
Jan 14 11:57:53.770: INFO: Pod for on the node: node-ca-5tf98, Cpu: 10, Mem: 10485760
Jan 14 11:57:53.770: INFO: Pod for on the node: ingress-canary-6wpnz, Cpu: 10, Mem: 20971520
Jan 14 11:57:53.770: INFO: Pod for on the node: router-default-84fdf5895-hjhzq, Cpu: 100, Mem: 268435456
Jan 14 11:57:53.770: INFO: Pod for on the node: migrator-6754b8bcd7-9zj7c, Cpu: 100, Mem: 209715200
Jan 14 11:57:53.770: INFO: Pod for on the node: machine-config-daemon-srq4g, Cpu: 40, Mem: 104857600
Jan 14 11:57:53.770: INFO: Pod for on the node: certified-operators-rzhcs, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: redhat-marketplace-jmpk9, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: redhat-operators-g2n99, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: alertmanager-main-0, Cpu: 8, Mem: 283115520
Jan 14 11:57:53.770: INFO: Pod for on the node: alertmanager-main-2, Cpu: 8, Mem: 283115520
Jan 14 11:57:53.770: INFO: Pod for on the node: grafana-64ff674fbf-94q2d, Cpu: 5, Mem: 125829120
Jan 14 11:57:53.770: INFO: Pod for on the node: node-exporter-85wt9, Cpu: 9, Mem: 220200960
Jan 14 11:57:53.770: INFO: Pod for on the node: prometheus-adapter-857d7d57b-xs286, Cpu: 1, Mem: 26214400
Jan 14 11:57:53.770: INFO: Pod for on the node: prometheus-k8s-0, Cpu: 76, Mem: 1262485504
Jan 14 11:57:53.770: INFO: Pod for on the node: thanos-querier-784758586-p72zt, Cpu: 9, Mem: 96468992
Jan 14 11:57:53.770: INFO: Pod for on the node: multus-xzgpv, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.770: INFO: Pod for on the node: network-metrics-daemon-k7tn2, Cpu: 20, Mem: 125829120
Jan 14 11:57:53.770: INFO: Pod for on the node: network-check-source-9c8c79599-hsrch, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.770: INFO: Pod for on the node: network-check-target-hmtj9, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.770: INFO: Pod for on the node: ovs-gmxmv, Cpu: 100, Mem: 419430400
Jan 14 11:57:53.770: INFO: Pod for on the node: sdn-8jxrc, Cpu: 110, Mem: 230686720
Jan 14 11:57:53.770: INFO: Node: ip-10-0-139-64.us-east-2.compute.internal, totalRequestedCPUResource: 971, cpuAllocatableMil: 1500, cpuFraction: 0.6473333333333333
Jan 14 11:57:53.770: INFO: Node: ip-10-0-139-64.us-east-2.compute.internal, totalRequestedMemResource: 4932501504, memAllocatableVal: 6946533376, memFraction: 0.7100666241727995
Jan 14 11:57:53.770: INFO: ComputeCPUMemFraction for node: ip-10-0-152-250.us-east-2.compute.internal
Jan 14 11:57:53.829: INFO: Pod for on the node: aws-ebs-csi-driver-node-kghhl, Cpu: 30, Mem: 157286400
Jan 14 11:57:53.829: INFO: Pod for on the node: tuned-6hcnf, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.829: INFO: Pod for on the node: dns-default-wjw5k, Cpu: 65, Mem: 137363456
Jan 14 11:57:53.829: INFO: Pod for on the node: node-ca-dff2h, Cpu: 10, Mem: 10485760
Jan 14 11:57:53.829: INFO: Pod for on the node: ingress-canary-stk68, Cpu: 10, Mem: 20971520
Jan 14 11:57:53.829: INFO: Pod for on the node: machine-config-daemon-lvsc9, Cpu: 40, Mem: 104857600
Jan 14 11:57:53.829: INFO: Pod for on the node: node-exporter-k76nn, Cpu: 9, Mem: 220200960
Jan 14 11:57:53.829: INFO: Pod for on the node: multus-fdwzh, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.829: INFO: Pod for on the node: network-metrics-daemon-jq586, Cpu: 20, Mem: 125829120
Jan 14 11:57:53.829: INFO: Pod for on the node: network-check-target-kzqlq, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.829: INFO: Pod for on the node: ovs-5grbc, Cpu: 100, Mem: 419430400
Jan 14 11:57:53.829: INFO: Pod for on the node: sdn-fw96p, Cpu: 110, Mem: 230686720
Jan 14 11:57:53.829: INFO: Node: ip-10-0-152-250.us-east-2.compute.internal, totalRequestedCPUResource: 524, cpuAllocatableMil: 1500, cpuFraction: 0.34933333333333333
Jan 14 11:57:53.829: INFO: Node: ip-10-0-152-250.us-east-2.compute.internal, totalRequestedMemResource: 1898971136, memAllocatableVal: 6946533376, memFraction: 0.2733696123250297
Jan 14 11:57:53.829: INFO: ComputeCPUMemFraction for node: ip-10-0-199-24.us-east-2.compute.internal
Jan 14 11:57:53.890: INFO: Pod for on the node: aws-ebs-csi-driver-node-qtzv7, Cpu: 30, Mem: 157286400
Jan 14 11:57:53.890: INFO: Pod for on the node: tuned-85p7k, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.890: INFO: Pod for on the node: dns-default-ht8s4, Cpu: 65, Mem: 137363456
Jan 14 11:57:53.890: INFO: Pod for on the node: image-registry-869d5f5f94-h9sw5, Cpu: 100, Mem: 268435456
Jan 14 11:57:53.890: INFO: Pod for on the node: node-ca-ldgx8, Cpu: 10, Mem: 10485760
Jan 14 11:57:53.890: INFO: Pod for on the node: ingress-canary-c6xxx, Cpu: 10, Mem: 20971520
Jan 14 11:57:53.890: INFO: Pod for on the node: router-default-84fdf5895-v4vj4, Cpu: 100, Mem: 268435456
Jan 14 11:57:53.890: INFO: Pod for on the node: machine-config-daemon-xx5j2, Cpu: 40, Mem: 104857600
Jan 14 11:57:53.890: INFO: Pod for on the node: community-operators-crklm, Cpu: 10, Mem: 52428800
Jan 14 11:57:53.890: INFO: Pod for on the node: alertmanager-main-1, Cpu: 8, Mem: 283115520
Jan 14 11:57:53.890: INFO: Pod for on the node: kube-state-metrics-57c97ff89f-tn48h, Cpu: 4, Mem: 125829120
Jan 14 11:57:53.890: INFO: Pod for on the node: node-exporter-mq9nx, Cpu: 9, Mem: 220200960
Jan 14 11:57:53.890: INFO: Pod for on the node: openshift-state-metrics-55dcf8647f-q9slk, Cpu: 3, Mem: 199229440
Jan 14 11:57:53.890: INFO: Pod for on the node: prometheus-adapter-857d7d57b-qstp7, Cpu: 1, Mem: 26214400
Jan 14 11:57:53.890: INFO: Pod for on the node: prometheus-k8s-1, Cpu: 76, Mem: 1262485504
Jan 14 11:57:53.890: INFO: Pod for on the node: telemeter-client-599d88dcf7-d745n, Cpu: 3, Mem: 440401920
Jan 14 11:57:53.890: INFO: Pod for on the node: thanos-querier-784758586-dbgcr, Cpu: 9, Mem: 96468992
Jan 14 11:57:53.890: INFO: Pod for on the node: multus-f8hf8, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.890: INFO: Pod for on the node: network-metrics-daemon-j75ff, Cpu: 20, Mem: 125829120
Jan 14 11:57:53.890: INFO: Pod for on the node: network-check-target-m75hq, Cpu: 10, Mem: 157286400
Jan 14 11:57:53.890: INFO: Pod for on the node: ovs-l6s95, Cpu: 100, Mem: 419430400
Jan 14 11:57:53.890: INFO: Pod for on the node: sdn-5trvp, Cpu: 110, Mem: 230686720
Jan 14 11:57:53.890: INFO: Node: ip-10-0-199-24.us-east-2.compute.internal, totalRequestedCPUResource: 838, cpuAllocatableMil: 1500, cpuFraction: 0.5586666666666666
Jan 14 11:57:53.890: INFO: Node: ip-10-0-199-24.us-east-2.compute.internal, totalRequestedMemResource: 4922015744, memAllocatableVal: 6946533376, memFraction: 0.7085571287982825
Jan 14 11:57:53.923: INFO: Waiting for running...
Jan 14 11:58:04.001: INFO: Waiting for running...
Jan 14 11:58:09.082: INFO: Waiting for running...
[AfterEach] [sig-scheduling] SchedulerPriorities [Serial]
  k8s.io/kubernetes@v1.20.0/test/e2e/framework/framework.go:175
STEP: Collecting events from namespace "e2e-sched-priority-6094".
STEP: Found 13 events.
Jan 14 12:08:09.163: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: { } Scheduled: Successfully assigned e2e-sched-priority-6094/2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0 to ip-10-0-139-64.us-east-2.compute.internal
Jan 14 12:08:09.163: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: { } Scheduled: Successfully assigned e2e-sched-priority-6094/40ea718f-7423-45cf-a151-188223206df7-0 to ip-10-0-152-250.us-east-2.compute.internal
Jan 14 12:08:09.163: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for e8c095c8-5e7f-4384-9ee2-57cb23c56524-0: { } Scheduled: Successfully assigned e2e-sched-priority-6094/e8c095c8-5e7f-4384-9ee2-57cb23c56524-0 to ip-10-0-199-24.us-east-2.compute.internal
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:56:53 +0000 UTC - event for e2e-sched-priority-6094: {namespace-security-allocation-controller } CreatedSCCRanges: created SCC ranges
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:57:55 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: {multus } AddedInterface: Add eth0 [10.128.2.60/23]
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:57:58 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: {kubelet ip-10-0-139-64.us-east-2.compute.internal} Pulled: Container image "quay.io/openshift/community-e2e-images:e2e-27-k8s-gcr-io-pause-3-2-pZ7kB8CESdFpJJRs" already present on machine
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:57:58 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: {kubelet ip-10-0-139-64.us-east-2.compute.internal} Created: Created container 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:57:58 +0000 UTC - event for 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0: {kubelet ip-10-0-139-64.us-east-2.compute.internal} Started: Started container 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:06 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: {multus } AddedInterface: Add eth0 [10.131.2.52/23]
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:06 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: {kubelet ip-10-0-152-250.us-east-2.compute.internal} Started: Started container 40ea718f-7423-45cf-a151-188223206df7-0
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:06 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: {kubelet ip-10-0-152-250.us-east-2.compute.internal} Created: Created container 40ea718f-7423-45cf-a151-188223206df7-0
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:06 +0000 UTC - event for 40ea718f-7423-45cf-a151-188223206df7-0: {kubelet ip-10-0-152-250.us-east-2.compute.internal} Pulled: Container image "quay.io/openshift/community-e2e-images:e2e-27-k8s-gcr-io-pause-3-2-pZ7kB8CESdFpJJRs" already present on machine
Jan 14 12:08:09.163: INFO: At 2021-01-14 11:58:09 +0000 UTC - event for e8c095c8-5e7f-4384-9ee2-57cb23c56524-0: {kubelet ip-10-0-199-24.us-east-2.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = pod set memory limit 10485760 too low; should be at least 12582912
Jan 14 12:08:09.178: INFO: POD                                     NODE                                        PHASE    GRACE  CONDITIONS
Jan 14 12:08:09.178: INFO: 2baf3fc0-1778-4dcd-be46-e7c0881c1fc0-0  ip-10-0-139-64.us-east-2.compute.internal   Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:57:53 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:57:59 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:57:59 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:57:53 +0000 UTC  }]
Jan 14 12:08:09.178: INFO: 40ea718f-7423-45cf-a151-188223206df7-0  ip-10-0-152-250.us-east-2.compute.internal  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:04 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:06 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:06 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:03 +0000 UTC  }]
Jan 14 12:08:09.178: INFO: e8c095c8-5e7f-4384-9ee2-57cb23c56524-0  ip-10-0-199-24.us-east-2.compute.internal   Pending         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:09 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:09 +0000 UTC ContainersNotReady containers with unready status: [e8c095c8-5e7f-4384-9ee2-57cb23c56524-0]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:09 +0000 UTC ContainersNotReady containers with unready status: [e8c095c8-5e7f-4384-9ee2-57cb23c56524-0]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-14 11:58:09 +0000 UTC  }]
Jan 14 12:08:09.178: INFO: 
Jan 14 12:08:09.216: INFO: skipping dumping cluster info - cluster too large
Jan 14 12:08:09.216: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-sched-priority-6094" for this suite.
[AfterEach] [sig-scheduling] SchedulerPriorities [Serial]
  k8s.io/kubernetes@v1.20.0/test/e2e/scheduling/priorities.go:131
Jan 14 12:08:09.258: INFO: Running AfterSuite actions on all nodes
Jan 14 12:08:09.258: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes@v1.20.0/test/e2e/scheduling/priorities.go:302]: Unexpected error:
    <*errors.errorString | 0xc003414030>: {
        s: "Error waiting for 1 pods to be running - probably a timeout: Timeout while waiting for pods with labels \"startPodsID=efbfd066-92b1-474f-95a7-51e44359eb83\" to be running",
    }
    Error waiting for 1 pods to be running - probably a timeout: Timeout while waiting for pods with labels "startPodsID=efbfd066-92b1-474f-95a7-51e44359eb83" to be running
occurred

Upstream PR

Comment 2 Mike Dame 2021-02-04 18:12:10 UTC
The upstream PR for this is undergoing review: https://github.com/kubernetes/kubernetes/pull/98073

The main question is how ensuring a minimum limit like this could affect other runtimes, and I need to reach out to the containers teams to answer that.

Comment 4 RamaKasturi 2021-02-24 15:24:34 UTC
Moving the bug back to assigned because as per the discussion with dev i see that we should wait for [1] to merge before verifying the bug, also when searched in link [2] i still see it is failing.

[1] https://github.com/openshift/origin/pull/25915
[2] https://search.ci.openshift.org/?search=Pod+should+be+preferably+scheduled+to+nodes+pod+can+tolerate&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 5 Mike Dame 2021-04-08 19:54:55 UTC
The currently linked PR has been merged, this should take effect in the 1.21 k8s bump for origin


Note You need to log in before you can comment on or make changes to this bug.