Bug 1771016 - Pod fails to start because of Node didn't have enough resource but UI console shows it has enough
Summary: Pod fails to start because of Node didn't have enough resource but UI console...
Keywords:
Status: CLOSED DUPLICATE of bug 1801826
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.4.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-11 17:38 UTC by Filip Brychta
Modified: 2020-02-25 15:48 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-25 15:48:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
node overview screenshot (110.40 KB, image/png)
2019-11-11 17:38 UTC, Filip Brychta
no flags Details

Description Filip Brychta 2019-11-11 17:38:59 UTC
Created attachment 1634963 [details]
node overview screenshot

Description of problem:
Pod is failing to start with following err:
Successfully assigned istio-system/istio-policy-659bc7b88c-4cs4l to fbr-42-m-6psqn-worker-24sfj
Pistio-policy-659bc7b88c-4cs4l
Node didn't have enough resource: memory, requested: 268435456, used: 7632243200, capacity: 7730569216

It says: used: 7632243200
But UI console -> Compute -> Nodes -> fbr-42-m-6psqn-worker-24sfj shows that the consumed memory on host is only ~3.5 GB. See attached screenshot.

OC also shows it's not consuming 7632243200:
oc adm top node
NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
fbr-42-m-6psqn-master-0       1660m        22%    4961Mi          32%       
fbr-42-m-6psqn-master-1       1109m        14%    4438Mi          28%       
fbr-42-m-6psqn-master-2       779m         10%    3183Mi          20%       
fbr-42-m-6psqn-worker-24sfj   1962m        56%    3648Mi          49%       
fbr-42-m-6psqn-worker-8dm7p   753m         21%    3379Mi          45%       
fbr-42-m-6psqn-worker-g6n65   2078m        59%    4434Mi          60%       
fbr-42-m-6psqn-worker-js4fm   591m         16%    3228Mi          43%

Version-Release number of selected component (if applicable):
OCP 4.2.2

How reproducible:
Always

Steps to Reproduce:
1. install OCP 4.2 on OpenStack with 3 masters (16GB, 8 CPUs), 4 workers (8GB 4 CPUs)
2. install OpenShift Service mesh with two control planes


Actual results:
Pods failing to start because of:
Node didn't have enough resource: memory, requested: 268435456, used: 7632243200, capacity: 7730569216

But UI shows that there is enough of free memory on given host.

Expected results:
UI should show correct memory usage on hosts.

Additional info:
Not sure if the error message is incorrect or if the UI shows incorrect values.

Comment 2 Filip Brychta 2019-11-12 10:12:01 UTC
I don't have original environment but I reproduced it in new env:

Events on pod which failed to start:
Generated from default-scheduler
Successfully assigned bookinfo2/reviews-v3-6595c9dcb-8lr9r to fbr-42-s-c2nmq-worker-7km6z
Generated from kubelet on fbr-42-s-c2nmq-worker-7km6z
Node didn't have enough resource: memory, requested: 134217728, used: 7622098944, capacity: 7730569216

Node stats:
oc adm top node
NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
fbr-42-s-c2nmq-master-0       1388m        18%    4031Mi          26%       
fbr-42-s-c2nmq-master-1       697m         9%     2761Mi          17%       
fbr-42-s-c2nmq-master-2       1207m        16%    4302Mi          27%       
fbr-42-s-c2nmq-worker-7km6z   1006m        28%    3433Mi          46%       
fbr-42-s-c2nmq-worker-hw4d4   1138m        32%    3374Mi          45%       
fbr-42-s-c2nmq-worker-pt9p7   1497m        42%    3782Mi          51%

Node details:
oc describe node fbr-42-s-c2nmq-worker-7km6z
Name:               fbr-42-s-c2nmq-worker-7km6z
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=ci.w1.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=regionOne
                    failure-domain.beta.kubernetes.io/zone=nova
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=fbr-42-s-c2nmq-worker-7km6z
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        machine.openshift.io/machine: openshift-machine-api/fbr-42-s-c2nmq-worker-7km6z
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-7d0c404aee63b69d895dd1bf28a8cda7
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-7d0c404aee63b69d895dd1bf28a8cda7
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 12 Nov 2019 09:07:39 +0100
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 12 Nov 2019 11:06:34 +0100   Tue, 12 Nov 2019 09:28:40 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 12 Nov 2019 11:06:34 +0100   Tue, 12 Nov 2019 09:28:40 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 12 Nov 2019 11:06:34 +0100   Tue, 12 Nov 2019 09:28:40 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 12 Nov 2019 11:06:34 +0100   Tue, 12 Nov 2019 09:29:00 +0100   KubeletReady                 kubelet is posting ready status
Addresses:
  Hostname:    fbr-42-s-c2nmq-worker-7km6z
  InternalIP:  192.168.0.35
Capacity:
 attachable-volumes-cinder:  256
 cpu:                        4
 hugepages-1Gi:              0
 hugepages-2Mi:              0
 memory:                     8163784Ki
 pods:                       250
Allocatable:
 attachable-volumes-cinder:  256
 cpu:                        3500m
 hugepages-1Gi:              0
 hugepages-2Mi:              0
 memory:                     7549384Ki
 pods:                       250
System Info:
 Machine ID:                              cc0990d2e47544e48514d5092cd824fd
 System UUID:                             cc0990d2-e475-44e4-8514-d5092cd824fd
 Boot ID:                                 1ab7a161-be81-4ee5-8b84-79fc176bf15b
 Kernel Version:                          4.18.0-80.11.2.el8_0.x86_64
 OS Image:                                Red Hat Enterprise Linux CoreOS 42.80.20191022.0 (Ootpa)
 Operating System:                        linux
 Architecture:                            amd64
 Container Runtime Version:               cri-o://1.14.11-0.23.dev.rhaos4.2.gitc41de67.el8
 Kubelet Version:                         v1.14.6+7e13ab9a7
 Kube-Proxy Version:                      v1.14.6+7e13ab9a7
ProviderID:                               openstack://cc0990d2-e475-44e4-8514-d5092cd824fd
Non-terminated Pods:                      (28 in total)
  Namespace                               Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                        ------------  ----------  ---------------  -------------  ---
  bookinfo                                reviews-v3-6595c9dcb-j99fp                  10m (0%)      0 (0%)      128Mi (1%)       0 (0%)         11m
  bookinfo2                               details-v1-5b6d97f647-m7bfc                 10m (0%)      0 (0%)      128Mi (1%)       0 (0%)         6m21s
  bookinfo2                               reviews-v1-5bb5b76576-scswl                 10m (0%)      0 (0%)      128Mi (1%)       0 (0%)         6m20s
  bookinfo2                               reviews-v3-6595c9dcb-87wlp                  10m (0%)      0 (0%)      128Mi (1%)       0 (0%)         1s
  istio-operator                          istio-node-p45k8                            10m (0%)      0 (0%)      100Mi (1%)       0 (0%)         24m
  istio-system                            3scale-istio-adapter-585bbcb595-6h8zx       0 (0%)        0 (0%)      0 (0%)           0 (0%)         20m
  istio-system                            istio-ingressgateway-8657d8cfff-tbfw9       10m (0%)      0 (0%)      128Mi (1%)       0 (0%)         21m
  istio-system                            istio-pilot-8d85c5ddd-c76m4                 20m (0%)      0 (0%)      256Mi (3%)       0 (0%)         22m
  istio-system                            istio-pilot-8d85c5ddd-sr4wl                 20m (0%)      0 (0%)      256Mi (3%)       0 (0%)         2s
  istio-system                            istio-sidecar-injector-bb8b5554b-26k6x      10m (0%)      0 (0%)      128Mi (1%)       0 (0%)         21m
  kiali-test-mesh-operator                kiali-test-mesh-operator-69c5b8bb8-fqpmq    0 (0%)        0 (0%)      0 (0%)           0 (0%)         12m
  openshift-cluster-node-tuning-operator  tuned-pl72p                                 10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         93m
  openshift-console                       downloads-df59f64db-h97lw                   10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         97m
  openshift-dns                           dns-default-dcqlq                           110m (3%)     0 (0%)      70Mi (0%)        512Mi (6%)     119m
  openshift-image-registry                image-registry-69bcb5c874-vxfbw             100m (2%)     0 (0%)      256Mi (3%)       0 (0%)         97m
  openshift-image-registry                node-ca-2xw44                               10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         106m
  openshift-ingress                       router-default-64f68cd7b7-7qf4v             100m (2%)     0 (0%)      256Mi (3%)       0 (0%)         97m
  openshift-machine-config-operator       machine-config-daemon-v8829                 20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         119m
  openshift-monitoring                    alertmanager-main-1                         100m (2%)     100m (2%)   225Mi (3%)       25Mi (0%)      97m
  openshift-monitoring                    grafana-69f4f95645-gwgt4                    100m (2%)     0 (0%)      100Mi (1%)       0 (0%)         97m
  openshift-monitoring                    node-exporter-tbdjt                         10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         106m
  openshift-monitoring                    openshift-state-metrics-7f4bdfbdf9-wv9xl    120m (3%)     0 (0%)      190Mi (2%)       0 (0%)         97m
  openshift-monitoring                    prometheus-adapter-5668d4848f-7gdck         10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         97m
  openshift-monitoring                    prometheus-k8s-1                            430m (12%)    200m (5%)   1134Mi (15%)     50Mi (0%)      97m
  openshift-monitoring                    telemeter-client-7bf667c5-g2hg4             10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         97m
  openshift-multus                        multus-xtthk                                10m (0%)      0 (0%)      150Mi (2%)       0 (0%)         119m
  openshift-sdn                           ovs-w964g                                   200m (5%)     0 (0%)      400Mi (5%)       0 (0%)         119m
  openshift-sdn                           sdn-lbbj4                                   100m (2%)     0 (0%)      200Mi (2%)       0 (0%)         119m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests      Limits
  --------                   --------      ------
  cpu                        1560m (44%)   300m (8%)
  memory                     4581Mi (62%)  587Mi (7%)
  ephemeral-storage          0 (0%)        0 (0%)
  attachable-volumes-cinder  0             0
Events:
  Type    Reason                   Age                From                                  Message
  ----    ------                   ----               ----                                  -------
  Normal  NodeNotSchedulable       101m               kubelet, fbr-42-s-c2nmq-worker-7km6z  Node fbr-42-s-c2nmq-worker-7km6z status is now: NodeNotSchedulable
  Normal  Starting                 98m                kubelet, fbr-42-s-c2nmq-worker-7km6z  Starting kubelet.
  Normal  NodeAllocatableEnforced  98m                kubelet, fbr-42-s-c2nmq-worker-7km6z  Updated Node Allocatable limit across pods
  Normal  NodeHasSufficientMemory  98m (x8 over 98m)  kubelet, fbr-42-s-c2nmq-worker-7km6z  Node fbr-42-s-c2nmq-worker-7km6z status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    98m (x8 over 98m)  kubelet, fbr-42-s-c2nmq-worker-7km6z  Node fbr-42-s-c2nmq-worker-7km6z status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     98m (x7 over 98m)  kubelet, fbr-42-s-c2nmq-worker-7km6z  Node fbr-42-s-c2nmq-worker-7km6z status is now: NodeHasSufficientPID

Comment 3 Ryan Phillips 2019-11-12 17:17:33 UTC
Are you sure the workers have 8 GB and have 4 CPUs? The screenshot attached shows the node with 4 GB and 2 CPUs.

Comment 4 Ryan Phillips 2019-11-12 17:18:35 UTC
The filesystem appears to be 8 GB as well, which would be extremely tiny for openshift 4.

Comment 5 Filip Brychta 2019-11-13 08:39:56 UTC
Yes, I'm sure that the VM for worker has 8GB of memory and 4 VCPUs. It's visible in oc describe node fbr-42-s-c2nmq-worker-7km6z in comment2 too:
memory:                     8163784Ki
cpu:                        4

I guess that the highest values for y-axis in graphs on attached screen shot do NOT show maximal available value but are relative to current consumed value. e.g. Network out graph shows 800KBps which is definitely NOT the max possible value for the network.

The root of this bug is that UI graph for Memory usage shows only ~3.5GB of memory is consumed. But pod is failing to start with "Node didn't have enough resource: memory, requested: 134217728, used: 7622098944, capacity: 7730569216
" which is basically saying that already ~7.6GB of memory is used on the node.

Comment 6 Ryan Phillips 2020-02-12 20:48:09 UTC
This ticket is likely a duplicate for another fix going into the tree: https://github.com/openshift/machine-config-operator/pull/1459

Duplicate BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1801826

Comment 7 Ryan Phillips 2020-02-25 15:48:49 UTC

*** This bug has been marked as a duplicate of bug 1801826 ***


Note You need to log in before you can comment on or make changes to this bug.