This bug was initially created as a copy of Bug #1757314 I am copying this bug because: reasons Description of problem: kibana-proxy container from kibana pods were killed for OOM by just a simple HTTP GET to kibana route or to kibana svc: Sep 30 11:38:17 node-3 kernel: oauth-proxy invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=968 Sep 30 11:38:17 node-3 kernel: oauth-proxy cpuset=docker-a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c.scope mems_allowed=0 Sep 30 11:38:17 node-3 kernel: CPU: 3 PID: 43442 Comm: oauth-proxy Kdump: loaded Tainted: G W ------------ T 3.10.0-957.el7.x86_64 #1 Sep 30 11:38:17 node-3 kernel: Hardware name: Red Hat OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014 Sep 30 11:38:17 node-3 kernel: Call Trace: Sep 30 11:38:17 node-3 kernel: [<ffffffff99d61dc1>] dump_stack+0x19/0x1b Sep 30 11:38:17 node-3 kernel: [<ffffffff99d5c7ea>] dump_header+0x90/0x229 Sep 30 11:38:17 node-3 kernel: [<ffffffff997b9dc6>] ? find_lock_task_mm+0x56/0xc0 Sep 30 11:38:17 node-3 kernel: [<ffffffff997ba274>] oom_kill_process+0x254/0x3d0 Sep 30 11:38:17 node-3 kernel: [<ffffffff99900a2c>] ? selinux_capable+0x1c/0x40 Sep 30 11:38:17 node-3 kernel: [<ffffffff99834f16>] mem_cgroup_oom_synchronize+0x546/0x570 Sep 30 11:38:17 node-3 kernel: [<ffffffff99834390>] ? mem_cgroup_charge_common+0xc0/0xc0 Sep 30 11:38:17 node-3 kernel: [<ffffffff997bab04>] pagefault_out_of_memory+0x14/0x90 Sep 30 11:38:17 node-3 kernel: [<ffffffff99d5acf2>] mm_fault_error+0x6a/0x157 Sep 30 11:38:17 node-3 kernel: [<ffffffff99d6f7a8>] __do_page_fault+0x3c8/0x500 Sep 30 11:38:17 node-3 kernel: [<ffffffff99d6f9c6>] trace_do_page_fault+0x56/0x150 Sep 30 11:38:17 node-3 kernel: [<ffffffff99d6ef42>] do_async_page_fault+0x22/0xf0 Sep 30 11:38:17 node-3 kernel: [<ffffffff99d6b788>] async_page_fault+0x28/0x30 Sep 30 11:38:17 node-3 kernel: Task in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddc5d931a_dfa3_11e9_aa40_fa163ea64b17.slice/ docker-a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c.scope killed as a result of limit of /kubepods.slice/kubepods-burstable.s lice/kubepods-burstable-poddc5d931a_dfa3_11e9_aa40_fa163ea64b17.slice/docker-a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c.sco pe Sep 30 11:38:17 node-3 kernel: memory: usage 524288kB, limit 524288kB, failcnt 1834 Sep 30 11:38:17 node-3 kernel: memory+swap: usage 524288kB, limit 524288kB, failcnt 0 Sep 30 11:38:17 node-3 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Sep 30 11:38:17 node-3 kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddc5d931a_dfa3_11e9_aa40_fa163ea64b17.slice/docker-a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c.scope: cache:1584KB rss:522704KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:522704KB inactive_file:844KB active_file:740KB unevictable:0KB Sep 30 11:38:17 node-3 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Sep 30 11:38:17 node-3 kernel: [43367] 1000080000 43367 246541 132083 299 0 968 oauth-proxy Sep 30 11:38:17 node-3 kernel: Memory cgroup out of memory: Kill process 13784 (oauth-proxy) score 1977 or sacrifice child Sep 30 11:38:17 node-3 kernel: Killed process 43367 (oauth-proxy) total-vm:986164kB, anon-rss:521632kB, file-rss:6700kB, shmem-rss:0kB Sep 30 11:38:18 node-3 atomic-openshift-node: I0930 11:38:18.444147 691 kubelet.go:1865] SyncLoop (PLEG): "logging-kibana-6-bx8qz_openshift-logging(dc5d931a-dfa3-11e9-aa40-fa163ea64b17)", event: &pleg.PodLifecycleEvent{ID:"dc5d931a-dfa3-11e9-aa40-fa163ea64b17", Type:"ContainerDied", Data:"a65ce7857b306c3eef1d020828f8705adf9ef84c256d011ad0ea9725c881473c"} Sep 30 11:38:18 node-3 atomic-openshift-node: I0930 11:38:18.853633 691 kuberuntime_manager.go:513] Container {Name:kibana-proxy Image:registry.redhat.io/openshift3/oauth-proxy:v3.11.117 Command:[] Args:[--upstream-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt --https-address=:3000 -provider=openshift -client-id=kibana-proxy -client-secret-file=/secret/oauth-secret -cookie-secret-file=/secret/session-secret -upstream=http://localhost:5601 -scope=user:info user:check-access user:list-projects --tls-cert=/secret/server-cert --tls-key=/secret/server-key -pass-access-token -skip-provider-button] WorkingDir: Ports:[{Name:oaproxy HostPort:0 ContainerPort:3000 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:OAP_DEBUG Value:true ValueFrom:nil} {Name:OCP_AUTH_PROXY_MEMORY_LIMIT Value: ValueFrom:&EnvVarSource{FieldRef:nil,ResourceFieldRef:&ResourceFieldSelector{ContainerName:kibana-proxy,Resource:limits.memory,Divisor:0,},ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[memory:{i:{value:536870912 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:268435456 scale:0} d:{Dec:<nil>} s: Format:BinarySI}]} VolumeMounts:[{Name:kibana-proxy ReadOnly:true MountPath:/secret SubPath: MountPropagation:<nil>} {Name:aggregated-logging-kibana-token-pc9zf ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[KILL MKNOD SETGID SETUID],},Privileged:nil,SELinuxOptions:nil,RunAsUser:*1000080000,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it. Sep 30 11:38:18 node-3 atomic-openshift-node: I0930 11:38:18.854051 691 kuberuntime_manager.go:757] checking backoff for container "kibana-proxy" in pod "logging-kibana-6-bx8qz_openshift-logging(dc5d931a-dfa3-11e9-aa40-fa163ea64b17)" Version-Release number of selected component (if applicable): OCPv3.11.117 ; kibana oauth proxy image: registry.redhat.io/openshift3/oauth-proxy:v3.11.117 How reproducible: always: while true; do curl <kibana route or kibana svc>; sleep 1; done Steps to Reproduce: 1. while true; do curl <kibana route or kibana svc>; sleep 1; done 2. 3. Actual results: kibana-proxy container were killed for OOM after few minutes; increasing the requests/limits it just takes more time for the RAM to be exhausted Expected results: kibana-proxy not OOM killed and RAM freed Additional info:
The memory increased, could you check it? Fri Nov 8 18:48:23 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 140m 492Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:48:49 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 140m 540Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:49:15 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 136m 576Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:49:24 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 136m 576Mi NAME READY STATUS RESTARTS AGE NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:49:51 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 137m 628Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:50:17 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 137m 673Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:50:45 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 137m 717Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:50:55 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 137m 717Mi NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 129m 717Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:51:21 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 141m 148Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:51:30 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 141m 148Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h Fri Nov 8 18:51:39 CST 2019 NAME CPU(cores) MEMORY(bytes) kibana-598df8b964-gq5lm 131m 148Mi NAME READY STATUS RESTARTS AGE kibana-598df8b964-gq5lm 2/2 Running 0 10h
Could you please answer these questions so that I can understand what happened: The memory increased when? When you ran the reproducer? Is the first report already from when the reproducer ran? Which version of cluster did you use? The bug was about permanent memory increase with each request but there is an obvious memory usage drop in reports from "Fri Nov 8 18:50:55 CST 2019" and "Fri Nov 8 18:51:21 CST 2019", which I assume is when you stopped the reproducer, which would therefore mean that the bug was fixed?
Yes, when I run the reproducer. the memory increased. (In reply to Standa Laznicka from comment #3) > Could you please answer these questions so that I can understand what > happened: > > The memory increased when? When you ran the reproducer? Yes, when I run the reproducer. > > Is the first report already from when the reproducer ran? Which version of > cluster did you use? It is v4.3 > > The bug was about permanent memory increase with each request but there is > an obvious memory usage drop in reports from "Fri Nov 8 18:50:55 CST 2019" > and "Fri Nov 8 18:51:21 CST 2019", which I assume is when you stopped the > reproducer, which would therefore mean that the bug was fixed? When the memory usage dropped, the reproducer is still running. there may be some improvement. But it seems that is not enought. as the size 717Mi is near by the memory limit 760Mi. I had observed the pod been restarted
Anping, can you also specify the version of oauth-proxy image you tested against? I tested the oauth-proxy which gets distributed with the 4.3 cluster and has the fix. I bombed the container with some 2k request in 10 seconds and I did not see any unreasonable memory consumption increase, if any at all. If you're indeed testing the fixed oauth-proxy version, can you add the --containers flag to `oc adm top` to see that it is actually the proxy which consumes the resources?
Verified in v4.3.0-201911080552. The kibana pods wasn't restarted in 4 hours. [43]$ head check.logs Already on project "openshift-logging" on server "https://api.qe-anlia3.qe.gcp.devcluster.openshift.com:6443". Tue Nov 12 10:01:46 EST 2019 NAME READY STATUS RESTARTS AGE kibana-766bf67db5-rcqj6 2/2 Running 0 8m8s POD NAME CPU(cores) MEMORY(bytes) kibana-766bf67db5-rcqj6 kibana 8m 101Mi kibana-766bf67db5-rcqj6 kibana-proxy 0m 11Mi Tue Nov 12 10:01:52 EST 2019 NAME READY STATUS RESTARTS AGE kibana-766bf67db5-rcqj6 2/2 Running 0 8m14s [anli@preserve-docker-slave 43]$ tail check.logs kibana-766bf67db5-rcqj6 2/2 Running 0 16h POD NAME CPU(cores) MEMORY(bytes) kibana-766bf67db5-rcqj6 kibana 10m 95Mi kibana-766bf67db5-rcqj6 kibana-proxy 24m 12Mi Wed Nov 13 02:07:52 EST 2019 NAME READY STATUS RESTARTS AGE kibana-766bf67db5-rcqj6 2/2 Running 0 16h POD NAME CPU(cores) MEMORY(bytes) kibana-766bf67db5-rcqj6 kibana-proxy 24m 12Mi kibana-766bf67db5-rcqj6 kibana 10m 95Mi
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062