1753067 – [IPI][OSP] Kubelet fail to admit pod with insufficient memory due to coredns, keepalived and mdns-publisher have no pods in kube-apiserver

Bug 1753067 - [IPI][OSP] Kubelet fail to admit pod with insufficient memory due to coredns, keepalived and mdns-publisher have no pods in kube-apiserver

Summary: [IPI][OSP] Kubelet fail to admit pod with insufficient memory due to coredns,...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Yossi Boaron
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1757390
TreeView+	depends on / blocked

Reported:	2019-09-18 02:01 UTC by weiwei jiang
Modified:	2020-01-23 11:06 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1757390 (view as bug list)
Environment:
Last Closed:	2020-01-23 11:06:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1122	'None'	closed	Bug 1753067: Create a namespace for OpenStack infra static pods	2020-08-05 20:41:05 UTC
Github	openshift machine-config-operator pull 1127	'None'	closed	Bug 1753953: Temporarily put OpenStack infra pods into non-existing namespace	2020-08-05 20:41:05 UTC
Github	openshift machine-config-operator pull 1131	'None'	closed	Bug 1753067: Revert: "Temporarily put OpenStack infra pods into non-existing namespace"	2020-08-05 20:41:05 UTC
Red Hat Product Errata	RHBA-2020:0062	None	None	None	2020-01-23 11:06:39 UTC

Description weiwei jiang 2019-09-18 02:01:26 UTC

Description of problem:
When create a deployment with memory requests and scale it to a number, will got "OutOfmemory", and got following logs from kubelet:

Sep 17 09:39:29 share-0916c-8vp8z-worker-rdtdw hyperkube[1264]: I0917 09:39:29.406474    1264 predicate.go:136] Predicate failed on Pod: test-56cf6cdb48-5nlrn_default(0b067e0c-d92f-11e9-89a5-fa163eb3bdb0), for reason: Node didn't have enough resource: memory, requested: 2
147483648, used: 15223226368, capacity: 16185528320

But the node should has enough memory for the pod.
➜  ~ oc describe nodes -l node-role.kubernetes.io/worker= | grep -i -A 7 allocate                                                                                                                                        
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests       Limits
  --------                   --------       ------
  cpu                        570m (7%)      0 (0%)
  memory                     11446Mi (74%)  512Mi (3%)
  ephemeral-storage          0 (0%)         0 (0%)
  attachable-volumes-cinder  0              0
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests       Limits
  --------                   --------       ------
  cpu                        2150m (28%)    700m (9%)
  memory                     11369Mi (73%)  687Mi (4%)
  ephemeral-storage          0 (0%)         0 (0%)
  attachable-volumes-cinder  0              0


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-09-16-151155

How reproducible:
unknown

Steps to Reproduce:
1. Create deployment with spec.containers[*].resources.requests.memory 
2. Scale up the deployment to over the allocatable memory
3.

Actual results:
Got pods with OutOfmemory status
test-56cf6cdb48-k85kv   0/1     OutOfmemory   0          28m    <none>        share-0916c-8vp8z-worker-sc4nc   <none>           <none>
test-56cf6cdb48-lmptq   0/1     OutOfmemory   0          28m    <none>        share-0916c-8vp8z-worker-sc4nc   <none>           <none>
test-56cf6cdb48-lq4jz   0/1     OutOfmemory   0          28m    <none>        share-0916c-8vp8z-worker-sc4nc   <none>           <none>
test-56cf6cdb48-m2rw8   0/1     OutOfmemory   0          28m    <none>        share-0916c-8vp8z-worker-sc4nc   <none>           <none>
test-56cf6cdb48-m6cvf   0/1     OutOfmemory   0          28m    <none>        share-0916c-8vp8z-worker-sc4nc   <none>           <none>
test-56cf6cdb48-mmzz9   0/1     OutOfmemory   0          28m    <none>        



Expected results:
The node should not fail to admit the pod.

Additional info:

Comment 2 weiwei jiang 2019-09-18 03:17:11 UTC

Found the root cause here that, coredns, keepalived and mdns-publisher pods take 3Gi additional memory space from the worker. so make the calculation failed.


sh-4.4# cat /etc/kubernetes/manifests/* | grep -A 3  resources:
    resources: {}
    volumeMounts:
    - name: kubeconfig
      mountPath: "/etc/kubernetes/kubeconfig"
--
    resources:
      requests:
        cpu: 150m
        memory: 1Gi
--
    resources: {}
    volumeMounts:
    - name: resource-dir
      mountPath: "/config"
--
    resources:
      requests:
        cpu: 150m
        memory: 1Gi
--
    resources: {}
    volumeMounts:
    - name: kubeconfig
      mountPath: "/etc/kubernetes/kubeconfig"
--
    resources:
      requests:
        cpu: 150m
        memory: 1Gi

Comment 3 weiwei jiang 2019-09-18 03:32:45 UTC

Found a workaround that after I create the ns manually, all things work well.

oc adm new-project openshift-kni-infra

Comment 5 weiwei jiang 2019-09-19 09:44:10 UTC

Verified on 4.2.0-0.nightly-2019-09-19-040356

➜  ~ oc get pods -n openshift-openstack-infra
NAME                                            READY   STATUS    RESTARTS   AGE
coredns-share-0919a-qn4pn-master-0              1/1     Running   2          2m56s
coredns-share-0919a-qn4pn-master-2              1/1     Running   3          118m
coredns-share-0919a-qn4pn-worker-9mw25          1/1     Running   3          118m
coredns-share-0919a-qn4pn-worker-gzt8w          0/1     Pending   0          2s
coredns-share-0919a-qn4pn-worker-v7g8v          1/1     Running   3          2m56s
haproxy-share-0919a-qn4pn-master-0              2/2     Running   0          2m56s
haproxy-share-0919a-qn4pn-master-2              2/2     Running   2          118m
keepalived-share-0919a-qn4pn-master-0           1/1     Running   0          2m56s
keepalived-share-0919a-qn4pn-master-2           1/1     Running   1          118m
keepalived-share-0919a-qn4pn-worker-9mw25       1/1     Running   1          118m
keepalived-share-0919a-qn4pn-worker-gzt8w       0/1     Pending   0          2s
keepalived-share-0919a-qn4pn-worker-v7g8v       1/1     Running   1          2m56s
mdns-publisher-share-0919a-qn4pn-master-0       1/1     Running   0          2m56s
mdns-publisher-share-0919a-qn4pn-master-2       1/1     Running   1          118m
mdns-publisher-share-0919a-qn4pn-worker-9mw25   1/1     Running   1          118m
mdns-publisher-share-0919a-qn4pn-worker-gzt8w   0/1     Pending   0          2s
mdns-publisher-share-0919a-qn4pn-worker-v7g8v   1/1     Running   1          2m56s

➜  ~ oc get pods  -o wide 
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                             NOMINATED NODE   READINESS GATES
h-1-265hh    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-4sjmn    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-678m8    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-8tsgh    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-9md7j    1/1     Running   0          17s   10.128.2.20   share-0919a-qn4pn-worker-v7g8v   <none>           <none>
h-1-c957g    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-cj6mk    1/1     Running   0          17s   10.128.2.22   share-0919a-qn4pn-worker-v7g8v   <none>           <none>
h-1-ctpd8    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-deploy   1/1     Running   0          29s   10.131.0.28   share-0919a-qn4pn-worker-9mw25   <none>           <none>
h-1-h7rzz    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-hvh9v    1/1     Running   0          17s   10.131.0.29   share-0919a-qn4pn-worker-9mw25   <none>           <none>
h-1-jvnjw    1/1     Running   0          17s   10.131.0.30   share-0919a-qn4pn-worker-9mw25   <none>           <none>
h-1-nwdlx    1/1     Running   0          17s   10.131.0.31   share-0919a-qn4pn-worker-9mw25   <none>           <none>
h-1-pkmmm    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-pppsl    1/1     Running   0          17s   10.128.2.23   share-0919a-qn4pn-worker-v7g8v   <none>           <none>
h-1-px7ls    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-r7cbl    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-rn5tg    1/1     Running   0          17s   10.128.2.21   share-0919a-qn4pn-worker-v7g8v   <none>           <none>
h-1-rz2x2    1/1     Running   0          17s   10.131.0.32   share-0919a-qn4pn-worker-9mw25   <none>           <none>
h-1-tnxlb    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>
h-1-x8bxq    0/1     Pending   0          17s   <none>        <none>                           <none>           <none>

Comment 8 weiwei jiang 2019-10-09 07:43:10 UTC

Since this is targeted to 4.3.0, so need to wait 4.3 nightly build to have a try then.

Comment 10 weiwei jiang 2019-10-15 08:22:41 UTC

Checked with 4.3.0-0.nightly-2019-10-15-021732, and the issue is fixed.

➜  ~ oc get pods -n openshift-openstack-infra -o wide                                      
NAME                                      READY   STATUS    RESTARTS   AGE   IP             NODE                       NOMINATED NODE   READINESS GATES
coredns-qe-wj-6lx69-master-0              1/1     Running   0          45m   192.168.0.29   qe-wj-6lx69-master-0       <none>           <none>
coredns-qe-wj-6lx69-master-1              1/1     Running   0          45m   192.168.0.15   qe-wj-6lx69-master-1       <none>           <none>
coredns-qe-wj-6lx69-master-2              1/1     Running   0          46m   192.168.0.20   qe-wj-6lx69-master-2       <none>           <none>
coredns-qe-wj-6lx69-worker-64svj          1/1     Running   0          29m   192.168.0.35   qe-wj-6lx69-worker-64svj   <none>           <none>
coredns-qe-wj-6lx69-worker-g7pvh          1/1     Running   0          37m   192.168.0.12   qe-wj-6lx69-worker-g7pvh   <none>           <none>
coredns-qe-wj-6lx69-worker-hdgql          1/1     Running   0          37m   192.168.0.41   qe-wj-6lx69-worker-hdgql   <none>           <none>
haproxy-qe-wj-6lx69-master-0              2/2     Running   0          45m   192.168.0.29   qe-wj-6lx69-master-0       <none>           <none>
haproxy-qe-wj-6lx69-master-1              2/2     Running   0          45m   192.168.0.15   qe-wj-6lx69-master-1       <none>           <none>
haproxy-qe-wj-6lx69-master-2              2/2     Running   0          45m   192.168.0.20   qe-wj-6lx69-master-2       <none>           <none>
keepalived-qe-wj-6lx69-master-0           1/1     Running   0          45m   192.168.0.29   qe-wj-6lx69-master-0       <none>           <none>
keepalived-qe-wj-6lx69-master-1           1/1     Running   0          45m   192.168.0.15   qe-wj-6lx69-master-1       <none>           <none>
keepalived-qe-wj-6lx69-master-2           1/1     Running   0          45m   192.168.0.20   qe-wj-6lx69-master-2       <none>           <none>
keepalived-qe-wj-6lx69-worker-64svj       1/1     Running   0          29m   192.168.0.35   qe-wj-6lx69-worker-64svj   <none>           <none>
keepalived-qe-wj-6lx69-worker-g7pvh       1/1     Running   0          37m   192.168.0.12   qe-wj-6lx69-worker-g7pvh   <none>           <none>
keepalived-qe-wj-6lx69-worker-hdgql       1/1     Running   0          37m   192.168.0.41   qe-wj-6lx69-worker-hdgql   <none>           <none>
mdns-publisher-qe-wj-6lx69-master-0       1/1     Running   0          45m   192.168.0.29   qe-wj-6lx69-master-0       <none>           <none>
mdns-publisher-qe-wj-6lx69-master-1       1/1     Running   0          46m   192.168.0.15   qe-wj-6lx69-master-1       <none>           <none>
mdns-publisher-qe-wj-6lx69-master-2       1/1     Running   0          45m   192.168.0.20   qe-wj-6lx69-master-2       <none>           <none>
mdns-publisher-qe-wj-6lx69-worker-64svj   1/1     Running   0          29m   192.168.0.35   qe-wj-6lx69-worker-64svj   <none>           <none>
mdns-publisher-qe-wj-6lx69-worker-g7pvh   1/1     Running   0          37m   192.168.0.12   qe-wj-6lx69-worker-g7pvh   <none>           <none>
mdns-publisher-qe-wj-6lx69-worker-hdgql   1/1     Running   0          37m   192.168.0.41   qe-wj-6lx69-worker-hdgql   <none>           <none>
➜  ~ oc get pods -o wide                                                                             
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                       NOMINATED NODE   READINESS GATES
h-1-8vsk7    1/1     Running   0          40s   10.131.0.34   qe-wj-6lx69-worker-hdgql   <none>           <none>
h-1-ctdnp    1/1     Running   0          40s   10.129.2.25   qe-wj-6lx69-worker-64svj   <none>           <none>
h-1-deploy   1/1     Running   0          48s   10.129.2.24   qe-wj-6lx69-worker-64svj   <none>           <none>
h-1-dkbzb    0/1     Pending   0          40s   <none>        <none>                     <none>           <none>
h-1-fhckn    1/1     Running   0          40s   10.128.2.27   qe-wj-6lx69-worker-g7pvh   <none>           <none>
h-1-gxj98    1/1     Running   0          40s   10.128.2.28   qe-wj-6lx69-worker-g7pvh   <none>           <none>
h-1-mhddx    1/1     Running   0          40s   10.131.0.35   qe-wj-6lx69-worker-hdgql   <none>           <none>
h-1-njdrm    1/1     Running   0          40s   10.128.2.29   qe-wj-6lx69-worker-g7pvh   <none>           <none>
h-1-w477k    1/1     Running   0          40s   10.129.2.27   qe-wj-6lx69-worker-64svj   <none>           <none>
h-1-x27zn    0/1     Pending   0          40s   <none>        <none>                     <none>           <none>
h-1-z5vwf    1/1     Running   0          40s   10.129.2.26   qe-wj-6lx69-worker-64svj   <none>           <none>
➜  ~ oc version          
Client Version: v4.3.0
Server Version: 4.3.0-0.nightly-2019-10-15-021732
Kubernetes Version: v1.16.0-beta.2+a6ff814

Comment 12 errata-xmlrpc 2020-01-23 11:06:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.