Description of problem: OSE environment becomes very slow and can't even be accessed after deploying EFK stack. top - 15:41:41 up 6:10, 8 users, load average: 10.21, 9.36, 5.57 Tasks: 157 total, 2 running, 155 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.7 us, 1.0 sy, 0.0 ni, 24.7 id, 73.6 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 3882328 total, 103860 free, 3629244 used, 149224 buff/cache KiB Swap: 511996 total, 134716 free, 377280 used. 59416 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3566 etcd 20 0 197276 86228 1900 D 1.3 2.2 7:13.72 etcd 14518 root 20 0 4244672 3.164g 1344 D 1.0 85.5 3:07.55 openshift 8884 root 20 0 856296 33132 5812 S 0.7 0.9 3:20.28 openshift 34 root 20 0 0 0 0 S 0.3 0.0 3:28.39 kswapd0 390 root 20 0 0 0 0 S 0.3 0.0 0:03.11 xfsaild/dm-0 14965 root 20 0 146444 1904 940 S 0.3 0.0 0:00.99 top 15033 root 20 0 126564 4832 3812 D 0.3 0.1 0:00.01 oc 1 root 20 0 191360 2768 1336 S 0.0 0.1 0:04.89 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.05 kthreadd Version-Release number of selected component (if applicable): oc v3.0.2.901-61-g568adb6 kubernetes v1.1.0-alpha.1-653-g86b4e77 1 master + 2 nodes How reproducible: always(I tried twice) Steps to Reproduce: 1.Deploy EFK stack according to this docs:https://github.com/openshift/origin-aggregated-logging/tree/master/deployment 2.wait for pods to ready, Pods are teminating for pending more than 600s [root@dhcp-128-16 sample-app]# oc get pods NAME READY STATUS RESTARTS AGE logging-deployer-2noyz 0/1 Error 0 19m logging-es-kctfvi63-1-deploy 0/1 Error 0 16m logging-es-kctfvi63-1-gozaq 0/1 Terminating 0 16m logging-kibana-1-deploy 0/1 Error 0 16m logging-kibana-1-k53x2 0/2 Terminating 0 16m 3.run 'oc get pods' from oc client Actual results: [root@dhcp-128-16 sample-app]# oc get pods error: couldn't read version from server: Get https://openshift-157.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout Expected results: pod list should return correctly Additional info: After restarting Openshift services and deleting the logging project, everything back to normal and OSE works fine.
related images info: rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-kibana rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-auth-proxy rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-deployment rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-elasticsearch
Seems you're running out of RAM, and the culprit is this openshift process: 14518 root 20 0 4244672 3.164g 1344 D 1.0 85.5 3:07.55 openshift I'm not sure how our deployment could cause that and haven't seen it before, but I'll look into it. Can I ask how you deploy your test system? AMI, ansible with RPMs, etc...
(In reply to Luke Meyer from comment #3) > Seems you're running out of RAM, and the culprit is this openshift process: > > 14518 root 20 0 4244672 3.164g 1344 D 1.0 85.5 3:07.55 > openshift > > > I'm not sure how our deployment could cause that and haven't seen it before, > but I'll look into it. > > Can I ask how you deploy your test system? AMI, ansible with RPMs, etc... it's ansible with RPMs
I deployed a 3-node cluster with the latest OSE puddle today and I didn't see this. Please do the following: 1. Add this line to your master sysconfig file (/etc/sysconfig/atomic-openshift-master or origin-master): OPENSHIFT_PROFILE=web 2. Restart the master Then, if you see this happen again, run the following on the master to get a heap dump: curl 'http://127.0.0.1:6060/debug/pprof/heap?debug=1' > master.pprof ... and attach the file to this ticket.
(In reply to Luke Meyer from comment #5) > I deployed a 3-node cluster with the latest OSE puddle today and I didn't > see this. > > Please do the following: > 1. Add this line to your master sysconfig file > (/etc/sysconfig/atomic-openshift-master or origin-master): > OPENSHIFT_PROFILE=web > 2. Restart the master > > > Then, if you see this happen again, run the following on the master to get a > heap dump: > curl 'http://127.0.0.1:6060/debug/pprof/heap?debug=1' > master.pprof > > ... and attach the file to this ticket. This didn't happen on new OSE setup, I will continue to pay attention about it.
Going to close this now to keep it off the bug list as it doesn't appear to be blocking testing, but feel free to re-open if encountered again.
It seems this issue reproduced today on a two nodes(one master, one node) OSE setup: [root@openshift-140 ~]# openshift version openshift v3.0.2.903-29-g49953d6 kubernetes v1.2.0-alpha.1-1107-g4c8e6f4 etcd 2.1.2 below are reproduce steps: 1. deploy according to this doc: https://github.com/openshift/origin-aggregated-logging/tree/master/deployment but found es and kibana deployed failed [root@dhcp-128-16 ~]# oc get pods NAME READY STATUS RESTARTS AGE logging-deployer-moxck 0/1 Completed 0 23m logging-deployer-zqbfr 0/1 Error 0 29m logging-es-0rqpoiqm-1-deploy 0/1 Error 0 18m logging-kibana-1-deploy 0/1 Error 0 1 [root@dhcp-128-16 ~]# oc logs logging-kibana-1-deploy I1027 03:55:32.582019 1 deployer.go:196] Deploying logging/logging-kibana-1 for the first time (replicas: 1) I1027 03:55:32.640204 1 recreate.go:113] Scaling logging/logging-kibana-1 to 1 before performing acceptance check I1027 03:55:34.694625 1 recreate.go:118] Performing acceptance check of logging/logging-kibana-1 I1027 03:55:34.694752 1 lifecycle.go:290] Waiting 600 seconds for pods owned by deployment "logging/logging-kibana-1" to become ready (checking every 1 seconds; 0 pods previously accepted) F1027 04:05:34.695147 1 deployer.go:65] update acceptor rejected logging/logging-kibana-1: pods for deployment "logging/logging-kibana-1" took longer than 600 seconds to become ready [root@dhcp-128-16 ~]# [root@dhcp-128-16 ~]# oc get logs logging-es-0rqpoiqm-1-deploy error: no resource "logs" has been defined [root@dhcp-128-16 ~]# oc logs logging-es-0rqpoiqm-1-deploy error: couldn't read version from server: Get https://openshift-140.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout 2.Then I found I didn't set OPENSHIFT_PROFILE=web -> I set it according to your comment and restarted openshift, -> redeploy es by 'oc deploy' [root@openshift-140 ~]# oc get pods Unable to connect to the server: net/http: TLS handshake timeout [root@openshift-140 ~]# oc get nodes Unable to connect to the server: EOF 3. Then got master.pprof as attached
Created attachment 1086764 [details] master.pprof
I found my env is deleted by others later, so I am not sure the bug reproduced or env broken caused this problem this time. please ignore my above comments.
reproduced this time, steps: 1.1.deploy EFK stack according to : https://github.com/openshift/origin-aggregated-logging/tree/master/deployment 2.wait for the deployments complete and found both deployments are error. (both deployment fail because pods took longer than 600 seconds to become ready) [root@dhcp-128-16 ~]# oc get pods NAME READY STATUS RESTARTS AGE logging-deployer-6hb70 0/1 Completed 0 23m logging-es-36n5hs16-1-deploy 0/1 Error 0 19m logging-kibana-1-deploy 0/1 Error 0 19m 3. deploy logging-es-36n5hs16 with 'oc deploy logging-es-36n5hs16 --latest', the deploy pod starts. then try to deploy kibana with 'oc deploy logging-kibana --latest', got: Error from server: 502: (unhandled http status [Request Entity Too Large] with body []) [0] 4. then, wait for a while, logging-es-36n5hs16 running [root@dhcp-128-16 ~]# oc get pods NAME READY STATUS RESTARTS AGE logging-deployer-6hb70 0/1 Completed 0 33m logging-es-36n5hs16-1-deploy 0/1 Error 0 29m logging-es-36n5hs16-2-deploy 0/1 Error 0 17m logging-es-36n5hs16-3-b73t1 1/1 Running 0 3m logging-kibana-1-deploy 0/1 Error 0 29m 5. wait for a while, got: [root@dhcp-128-16 ~]# oc get pods error: couldn't read version from server: Get https://openshift-158.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout [root@dhcp-128-16 ~]# oc get node error: couldn't read version from server: Get https://openshift-158.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout
I will investigate today and see what I can find. It sounds like a system problem.
Created attachment 1087645 [details] Another heap dump from today This one is against the OSE openshift v3.0.2.903-74-gf49cee6 binary.
Created attachment 1087649 [details] Heap dump 2 from 2015-10-29
Created attachment 1087650 [details] crash dump (SIGABRT)
I see the problem occurring. I have no idea why it happens for you and not me. I took some more dumps and Clayton looked at them. It seems to be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1273149 though.
I don't think we can spend any more time debugging this until the fix for 1273149 is merged. I will mark it a duplicate of that for now and we can resurrect it later if it turns out to be something different. *** This bug has been marked as a duplicate of bug 1273149 ***