Bug 2029794

Summary: Worker node is observing the 99% memory utilisation although node is scheduling disabled state and less pods running.
Product: OpenShift Container Platform Reporter: sphoorthi <skanniha>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aos-bugs
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-08 16:41:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sphoorthi 2021-12-07 10:28:43 UTC
Description of problem:

The node ocppre-5vg6q-worker-vf25k is observed with high memory utilisation(99%) and node is made with scheduling disabled state and there are less pods running at the moment but still we could see 99% of memory is being utilised from free -m in the sos report. We could not able to identify which process is causing the issue using (# ps -aux)

Allocated memory on the node - 26GB

We have not been observed memory leak as per https://access.redhat.com/solutions/6304881

The restart of the node temporarily fix the issue.

Version-Release number of selected component (if applicable):
4.8.11

Actual results:
High memory utilisation is observed on the node.

Expected results:
Need to understand what is causing the high memory consumption.

Additional info:

We have observed OOM-killer error from the sos report and later point in time node is in scheduling disabled state

~~~
[3887727.113054] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=crio-f89e32200cdb4d2bd2a2ba09846578457f23b89dc5f849ea84e66bbbb22983e8.scope,mems_allowed=0,global_oom,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod35582eb4_db90_45b1_ba76_17380fdb1c02.slice/crio-d7c67d08d21c41c4acce464ab2fd9a8473aa7c1130e3e2c0a097430976749b4b.scope,task=node_exporter,pid=3190554,uid=65534
[3887727.118377] Out of memory: Killed process 3190554 (node_exporter) total-vm:719248kB, anon-rss:10128kB, file-rss:0kB, shmem-rss:0kB, UID:65534 pgtables:136kB oom_score_adj:999
[3887727.155553] oom_reaper: reaped process 3190554 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[3887734.687547] crio invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-999
[3887734.689160] CPU: 3 PID: 2526 Comm: crio Not tainted 4.18.0-305.19.1.el8_4.x86_64 #1

~~~

From the below output it says crio is the top level cosuming process but its not equal to whole memory consumption

~~~
[root@ocppre-5vg6q-worker-vf25k ~]# ps aux | head -1 ;  ps aux| sort -nk4 -r| head -5
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        1915  2.1  0.9 5445496 272704 ?      Ssl  Oct05 1696:44 /usr/bin/crio
root     1747361  1.6  0.8 1016784 248212 ?      Ssl  Nov26  76:04 /usr/bin/ruby /usr/local/bin/fluentd --suppress-config-dump --no-supervisor -r /usr/local/share/gems/gems/fluent-plugin-elasticsearch-5.0.5/lib/fluent/plugin/elasticsearch_simple_sniffer.rb
root     2948132 16.9  0.6 7408792 191752 ?      Ssl  Nov12 4088:22 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=rhcos --node-ip=2.170.3.56 --address=2.170.3.56 --minimum-container-ttl-duration=6m0s --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-provider= --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2abf882de887664b3bd74bbdf3c6fbde22296e1a5f87510eb225954045d4cb99 --system-reserved=cpu=500m,memory=1Gi --v=2
dtuser   1750458  0.1  0.6 436768 170088 ?       Sl   Nov26   5:10 oneagentnetwork -Dcom.compuware.apm.WatchDogPort=50001 -Dcom.compuware.apm.WatchDogTimeout=900
dtuser   1750443  2.7  0.5 1558404 144712 ?      Sl   Nov26 122:23 oneagentos -Dcom.compuware.apm.WatchDogPort=50000 -Dcom.compuware.apm.WatchDogTimeout=900
~~~

# crictl stats
CONTAINER           CPU %               MEM                 DISK                INODES
02260fb627f97       0.10                37.37MB             308B                16
04bed85ae3784       0.03                21.84MB             291B                15
0561bd4a506e6       0.00                2.245MB             372B                19
1e68e1ba31cdb       0.00                16.84MB             334B                17
2324aa72d1512       0.26                17.09MB             309B                16
25018e596f2ca       3.48                4.829MB             277B                16
342ca2ce0a756       0.11                36.99MB             308B                16
3b62e25e138e0       0.00                27.93MB             361B                19
3bc493bec13f1       0.18                28.68MB             270B                14
3f6bc244763af       2.12                245.2MB             779B                36
420addd1f3ae7       0.01                19.01MB             334B                17
45d77abdade27       0.14                101.5MB             299B                17
47a57a0f47f3d       0.81                30.54MB             291B                15
49b4900d286b7       0.00                24.58MB             312B                16
51e0d9ce048ab       0.00                19.98MB             382B                20
5d276f47a6ad1       0.62                46.8MB              312B                16
5dac41eefdf6b       0.25                56.92MB             291B                15
5e6990c1665b9       6.49                25.71MB             446B                23
642d7d8edb5e2       0.02                97.42MB             62B                 4
6a2695e88ec7a       0.00                23.52MB             329B                17
6f165d1600194       0.13                40.34MB             804B                33
825988c8aff5f       0.97                83.92MB             312B                16
9bd19d626e4e0       0.33                58.31MB             409B                21
a727fb90cea49       0.11                56.64MB             7.821kB             27
af8cb49a6d1a4       0.16                65.53MB             435B                23
b617b02407f61       0.02                16.26MB             334B                17
b716e98455de0       0.00                1.409MB             291B                15
c0866f895bcc1       0.02                76.5MB              12.52kB             13
ca8ff85ba3239       0.02                20.06MB             137B                8
e70e4c320875d       0.11                35.87MB             804B                33
e9a5814c13f08       22.33               430.7MB             17.82kB             37
f10655554e25b       0.00                31.9MB              5.609kB             17
f51348fc09195       0.00                5.825MB             368B                4
f89e32200cdb4       7.86                138.2MB             82.82kB             23
fc1684a5e72d2       0.15                41.71MB             2.157kB             24
fe85b3eedaf4f       0.05                44.71MB             2.286kB             31

Comment 1 Peter Hunt 2021-12-08 16:41:03 UTC
This looks like a dup

*** This bug has been marked as a duplicate of bug 2014136 ***