1997657 – Kubelet rejects pods that use resources that should be freed by completed pods

Bug 1997657 - Kubelet rejects pods that use resources that should be freed by completed pods

Summary: Kubelet rejects pods that use resources that should be freed by completed pods

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Harshal Patil
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-25 15:40 UTC by Ryan Phillips
Modified:	2021-10-18 17:49 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:49:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 920	0	None	None	None	2021-08-30 17:06:46 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:49:19 UTC

Description Ryan Phillips 2021-08-25 15:40:52 UTC

Description of problem:
https://github.com/kubernetes/kubernetes/issues/104560

There appears to be a change of behavior / possible regression in k8s 1.22 with respect to pod lifecycle behavior.

We've started to see some cases when multiple pods are being created (often as part of a k8s job) the following will happen:

Multiple pods get created by e.g. job controller (e.g. say 5). Each pod in job requests majority of node memory so only one such pod can fit on node at a time.
Scheduler schedules [first] pod to node, rest (4 pods) stay in pending
Kubelet accepts first pod, runs it to completion
Next pod get scheduled to the node
Kubelet rejects the pod during pod admission, e.g. with OutOfMemory, despite the fact that the first pod has already completed successfully
This results in many of pods backing the job getting into Failed phase and having to be recreated by job controller. The job can often get into BackofLimit exceeded to hitting backoffLimit.
The root issue seems to be a mismatch of when kubelet actually considers the pod to complete and what it reports to API server (which causes the followup pods to be scheduled to it). I'm still investigating further, but this may be related to the pod lifecycle changes in 1.22 -- #102344

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 32 Jiří Mencák 2021-09-14 14:07:06 UTC

*** Bug 1998193 has been marked as a duplicate of this bug. ***

Comment 33 Sunil Choudhary 2021-09-14 16:33:40 UTC

Verified on 4.9.0-0.nightly-2021-09-10-170926

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-10-170926   True        False         7h21m   Cluster version is 4.9.0-0.nightly-2021-09-10-170926

$ cat pod.yaml 
apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  parallelism: 30
  completions: 100
  template:
    spec:
      containers:
      - name: busybox
        image: busybox
        command: ["echo",  "ok"]
      restartPolicy: Never

$ oc create -f pod.yaml 
job.batch/test created

$ oc get jobs
NAME   COMPLETIONS   DURATION   AGE
test   30/100        17s        18s

$ oc get jobs
NAME   COMPLETIONS   DURATION   AGE
test   100/100       34s        2m2s

$ oc get pods
NAME            READY   STATUS      RESTARTS   AGE
test--1-27gmm   0/1     Completed   0          113s
test--1-2ppj8   0/1     Completed   0          99s
test--1-2w25n   0/1     Completed   0          113s
test--1-2zv5v   0/1     Completed   0          114s
test--1-4d444   0/1     Completed   0          113s
test--1-5zrf2   0/1     Completed   0          106s
test--1-62w5k   0/1     Completed   0          98s
test--1-67xzk   0/1     Completed   0          2m6s
test--1-6vfzg   0/1     Completed   0          2m6s
test--1-6vjq4   0/1     Completed   0          107s
test--1-6xxgk   0/1     Completed   0          2m5s
test--1-7b99h   0/1     Completed   0          112s
test--1-7d89d   0/1     Completed   0          2m6s
test--1-7mfch   0/1     Completed   0          113s
test--1-7vzns   0/1     Completed   0          113s
test--1-8ml7r   0/1     Completed   0          2m5s
test--1-8xwkw   0/1     Completed   0          104s
test--1-9xxkh   0/1     Completed   0          113s
test--1-b8qpm   0/1     Completed   0          100s
test--1-bg2qg   0/1     Completed   0          2m5s
test--1-cn79j   0/1     Completed   0          112s
test--1-ctkwx   0/1     Completed   0          2m5s
test--1-drwkh   0/1     Completed   0          113s
test--1-f8lm7   0/1     Completed   0          114s
test--1-fg2cq   0/1     Completed   0          113s
test--1-fhrlr   0/1     Completed   0          2m6s
test--1-fswv5   0/1     Completed   0          112s
test--1-fw4bx   0/1     Completed   0          100s
test--1-gdsjv   0/1     Completed   0          2m5s
test--1-gjc2x   0/1     Completed   0          2m6s
test--1-h2kwr   0/1     Completed   0          104s
test--1-h67gh   0/1     Completed   0          113s
test--1-hcb2h   0/1     Completed   0          113s
test--1-hfm97   0/1     Completed   0          106s
test--1-hjmxt   0/1     Completed   0          98s
test--1-hrv5f   0/1     Completed   0          102s
test--1-j478c   0/1     Completed   0          106s
test--1-jpd6v   0/1     Completed   0          2m5s
test--1-jrnvf   0/1     Completed   0          113s
test--1-k6pd6   0/1     Completed   0          113s
test--1-k8fmf   0/1     Completed   0          2m6s
test--1-k8tvn   0/1     Completed   0          113s
test--1-khz9r   0/1     Completed   0          2m6s
test--1-ksn9j   0/1     Completed   0          105s
test--1-l9k77   0/1     Completed   0          107s
test--1-lfvzv   0/1     Completed   0          105s
test--1-lkn6n   0/1     Completed   0          103s
test--1-mb2ng   0/1     Completed   0          2m5s
test--1-mksrl   0/1     Completed   0          2m6s
test--1-mn9gn   0/1     Completed   0          104s
test--1-mpq5x   0/1     Completed   0          2m5s
test--1-mqct8   0/1     Completed   0          112s
test--1-n2fz5   0/1     Completed   0          2m6s
test--1-n4h4r   0/1     Completed   0          113s
test--1-ndsck   0/1     Completed   0          103s
test--1-nnj5r   0/1     Completed   0          104s
test--1-nx7vw   0/1     Completed   0          113s
test--1-p2v2h   0/1     Completed   0          2m6s
test--1-p7kr7   0/1     Completed   0          113s
test--1-pclmw   0/1     Completed   0          103s
test--1-phv4x   0/1     Completed   0          2m6s
test--1-qc269   0/1     Completed   0          113s
test--1-qzswc   0/1     Completed   0          2m6s
test--1-r4sh4   0/1     Completed   0          107s
test--1-r6hpk   0/1     Completed   0          2m5s
test--1-rxht2   0/1     Completed   0          102s
test--1-s486h   0/1     Completed   0          105s
test--1-sh9gh   0/1     Completed   0          100s
test--1-sk2hm   0/1     Completed   0          2m5s
test--1-skk84   0/1     Completed   0          2m5s
test--1-sknzc   0/1     Completed   0          2m5s
test--1-slfzt   0/1     Completed   0          113s
test--1-slpm8   0/1     Completed   0          98s
test--1-svnpd   0/1     Completed   0          102s
test--1-t7k62   0/1     Completed   0          106s
test--1-t7psm   0/1     Completed   0          2m5s
test--1-t7wkz   0/1     Completed   0          99s
test--1-tdzf8   0/1     Completed   0          2m6s
test--1-tk69g   0/1     Completed   0          113s
test--1-tmzjp   0/1     Completed   0          100s
test--1-tnlpb   0/1     Completed   0          104s
test--1-tsxh8   0/1     Completed   0          103s
test--1-twj4q   0/1     Completed   0          103s
test--1-twvj2   0/1     Completed   0          103s
test--1-v7xwx   0/1     Completed   0          2m6s
test--1-vbw9r   0/1     Completed   0          105s
test--1-vcp5l   0/1     Completed   0          103s
test--1-vfsmk   0/1     Completed   0          113s
test--1-vk6px   0/1     Completed   0          104s
test--1-vnr58   0/1     Completed   0          2m6s
test--1-vpjrb   0/1     Completed   0          103s
test--1-wcggt   0/1     Completed   0          104s
test--1-wlxdj   0/1     Completed   0          2m6s
test--1-wmktk   0/1     Completed   0          113s
test--1-wprk7   0/1     Completed   0          112s
test--1-xv8vx   0/1     Completed   0          99s
test--1-zcfdx   0/1     Completed   0          2m6s
test--1-zckbf   0/1     Completed   0          104s
test--1-zl7fv   0/1     Completed   0          113s
test--1-zp62p   0/1     Completed   0          113s

Comment 34 Jiří Mencák 2021-10-01 18:22:44 UTC

I'm re-opening this BZ as it currently fails again on 4.9.0-0.nightly-2021-10-01-123414

$ oc version
Client Version: 4.9.0-rc.3
Server Version: 4.9.0-0.nightly-2021-10-01-123414
Kubernetes Version: v1.22.0-rc.0+8719299

Testing on an 8 vCPU box with a slightly modified (limits) pods.yaml:

$ cat pods.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  parallelism: 30
  completions: 100
  template:
    spec:
      containers:
      - name: busybox
        image: busybox
        command: ["echo",  "ok"]
        resources:
          limits:
            cpu: "1"
            memory: "40Mi"
      restartPolicy: Never

$ oc create -f pods.yaml

$ oc get jobs
NAME   COMPLETIONS   DURATION   AGE
test   6/100         7m58s      7m58s

$ oc get po
NAME            READY   STATUS      RESTARTS   AGE
test--1-66qlg   0/1     OutOfcpu    0          7m27s
test--1-6htll   0/1     Completed   0          7m32s
test--1-8g8mh   0/1     Completed   0          7m32s
test--1-9bbvz   0/1     OutOfcpu    0          7m27s
test--1-c9ldg   0/1     Completed   0          7m32s
test--1-cspws   0/1     Completed   0          7m32s
test--1-d5thk   0/1     Completed   0          7m32s
test--1-htdjz   0/1     OutOfcpu    0          7m27s
test--1-nskmx   0/1     OutOfcpu    0          7m27s
test--1-pq7cs   0/1     OutOfcpu    0          7m27s
test--1-q55wv   0/1     OutOfcpu    0          7m27s
test--1-wnm4n   0/1     OutOfcpu    0          7m27s
test--1-xn2cg   0/1     Completed   0          7m32s

Note that this issue was fixed for me by
https://github.com/openshift/kubernetes/pull/920
but broken again by
https://github.com/openshift/kubernetes/pull/949

In other words, the last commit that still fixes the issue for me in
openshift/kubernetes is 75ee3073266f07baaba5db004cde0636425737cf 

sh-4.4# ./kubelet --version
Kubernetes v1.22.1-1672+75ee3073266f07-dirty

A custom-compiled kubelet^ passes the above tests without issues and all 100 pods are successfully "Completed".

Comment 35 Jiří Mencák 2021-10-01 18:28:29 UTC

*** Bug 1998193 has been marked as a duplicate of this bug. ***

Comment 47 errata-xmlrpc 2021-10-18 17:49:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.