Bug 1273749

Summary:

[intservice_public_91]OSE environment going to die after deploying EFK stack

Product:

OpenShift Container Platform

Reporter:

wyue

Component:

Logging

Assignee:

Luke Meyer <lmeyer>

Status:

CLOSED DUPLICATE

QA Contact:

chunchen <chunchen>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.1.0

CC:

aos-bugs, sdodson, wsun, wyue, xtian

Target Milestone:

---

Keywords:

Reopened, TestBlocker

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-10-30 12:06:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
master.pprof	none
Another heap dump from today	none
Heap dump 2 from 2015-10-29	none
crash dump (SIGABRT)	none

Description wyue 2015-10-21 07:45:57 UTC

Description of problem:
OSE environment becomes very slow and can't even be accessed after deploying EFK stack.

top - 15:41:41 up  6:10,  8 users,  load average: 10.21, 9.36, 5.57
Tasks: 157 total,   2 running, 155 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  1.0 sy,  0.0 ni, 24.7 id, 73.6 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  3882328 total,   103860 free,  3629244 used,   149224 buff/cache
KiB Swap:   511996 total,   134716 free,   377280 used.    59416 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                 
  3566 etcd      20   0  197276  86228   1900 D   1.3  2.2   7:13.72 etcd                                                                                                                                                                    
 14518 root      20   0 4244672 3.164g   1344 D   1.0 85.5   3:07.55 openshift                                                                                                                                                               
  8884 root      20   0  856296  33132   5812 S   0.7  0.9   3:20.28 openshift                                                                                                                                                               
    34 root      20   0       0      0      0 S   0.3  0.0   3:28.39 kswapd0                                                                                                                                                                 
   390 root      20   0       0      0      0 S   0.3  0.0   0:03.11 xfsaild/dm-0                                                                                                                                                            
 14965 root      20   0  146444   1904    940 S   0.3  0.0   0:00.99 top                                                                                                                                                                     
 15033 root      20   0  126564   4832   3812 D   0.3  0.1   0:00.01 oc                                                                                                                                                                      
     1 root      20   0  191360   2768   1336 S   0.0  0.1   0:04.89 systemd                                                                                                                                                                 
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.05 kthreadd 

Version-Release number of selected component (if applicable):
oc v3.0.2.901-61-g568adb6
kubernetes v1.1.0-alpha.1-653-g86b4e77
1 master + 2 nodes


How reproducible:
always(I tried twice)


Steps to Reproduce:
1.Deploy EFK stack according to this docs:https://github.com/openshift/origin-aggregated-logging/tree/master/deployment

2.wait for pods to ready, Pods are teminating for pending more than 600s
[root@dhcp-128-16 sample-app]# oc get pods
NAME                           READY     STATUS        RESTARTS   AGE
logging-deployer-2noyz         0/1       Error         0          19m
logging-es-kctfvi63-1-deploy   0/1       Error         0          16m
logging-es-kctfvi63-1-gozaq    0/1       Terminating   0          16m
logging-kibana-1-deploy        0/1       Error         0          16m
logging-kibana-1-k53x2         0/2       Terminating   0          16m

3.run 'oc get pods' from oc client

Actual results:
[root@dhcp-128-16 sample-app]# oc get pods
error: couldn't read version from server: Get https://openshift-157.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout

Expected results:
pod list should return correctly

Additional info:
After restarting Openshift services and deleting the logging project, everything back to normal and OSE works fine.

Comment 2 wyue 2015-10-21 08:04:54 UTC

related images info:
rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-kibana
rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-auth-proxy
rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-deployment
rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-elasticsearch

Comment 3 Luke Meyer 2015-10-21 12:52:58 UTC

Seems you're running out of RAM, and the culprit is this openshift process:

 14518 root      20   0 4244672 3.164g   1344 D   1.0 85.5   3:07.55 openshift                                                                                                                                                               

I'm not sure how our deployment could cause that and haven't seen it before, but I'll look into it.

Can I ask how you deploy your test system? AMI, ansible with RPMs, etc...

Comment 4 wyue 2015-10-22 03:27:39 UTC

(In reply to Luke Meyer from comment #3)
> Seems you're running out of RAM, and the culprit is this openshift process:
> 
>  14518 root      20   0 4244672 3.164g   1344 D   1.0 85.5   3:07.55
> openshift                                                                   
> 
> 
> I'm not sure how our deployment could cause that and haven't seen it before,
> but I'll look into it.
> 
> Can I ask how you deploy your test system? AMI, ansible with RPMs, etc...

it's ansible with RPMs

Comment 5 Luke Meyer 2015-10-23 02:09:45 UTC

I deployed a 3-node cluster with the latest OSE puddle today and I didn't see this.

Please do the following:
1. Add this line to your master sysconfig file (/etc/sysconfig/atomic-openshift-master or origin-master):
OPENSHIFT_PROFILE=web
2. Restart the master


Then, if you see this happen again, run the following on the master to get a heap dump:
curl 'http://127.0.0.1:6060/debug/pprof/heap?debug=1' > master.pprof

... and attach the file to this ticket.

Comment 6 wyue 2015-10-25 07:20:18 UTC

(In reply to Luke Meyer from comment #5)
> I deployed a 3-node cluster with the latest OSE puddle today and I didn't
> see this.
> 
> Please do the following:
> 1. Add this line to your master sysconfig file
> (/etc/sysconfig/atomic-openshift-master or origin-master):
> OPENSHIFT_PROFILE=web
> 2. Restart the master
> 
> 
> Then, if you see this happen again, run the following on the master to get a
> heap dump:
> curl 'http://127.0.0.1:6060/debug/pprof/heap?debug=1' > master.pprof
> 
> ... and attach the file to this ticket.

This didn't happen on new OSE setup, I will continue to pay attention about it.

Comment 7 Luke Meyer 2015-10-26 12:47:22 UTC

Going to close this now to keep it off the bug list as it doesn't appear to be blocking testing, but feel free to re-open if encountered again.

Comment 8 wyue 2015-10-27 09:27:31 UTC

It seems this issue reproduced today on a two nodes(one master, one node) OSE setup:

[root@openshift-140 ~]# openshift version
openshift v3.0.2.903-29-g49953d6
kubernetes v1.2.0-alpha.1-1107-g4c8e6f4
etcd 2.1.2


below are reproduce steps:
1. deploy according to this doc:
https://github.com/openshift/origin-aggregated-logging/tree/master/deployment

but found es and kibana deployed failed
[root@dhcp-128-16 ~]# oc get pods
NAME                           READY     STATUS      RESTARTS   AGE
logging-deployer-moxck         0/1       Completed   0          23m
logging-deployer-zqbfr         0/1       Error       0          29m
logging-es-0rqpoiqm-1-deploy   0/1       Error       0          18m
logging-kibana-1-deploy        0/1       Error       0          1 

[root@dhcp-128-16 ~]# oc logs logging-kibana-1-deploy
I1027 03:55:32.582019       1 deployer.go:196] Deploying logging/logging-kibana-1 for the first time (replicas: 1)
I1027 03:55:32.640204       1 recreate.go:113] Scaling logging/logging-kibana-1 to 1 before performing acceptance check
I1027 03:55:34.694625       1 recreate.go:118] Performing acceptance check of logging/logging-kibana-1
I1027 03:55:34.694752       1 lifecycle.go:290] Waiting 600 seconds for pods owned by deployment "logging/logging-kibana-1" to become ready (checking every 1 seconds; 0 pods previously accepted)
F1027 04:05:34.695147       1 deployer.go:65] update acceptor rejected logging/logging-kibana-1: pods for deployment "logging/logging-kibana-1" took longer than 600 seconds to become ready
[root@dhcp-128-16 ~]# 

[root@dhcp-128-16 ~]# oc get logs logging-es-0rqpoiqm-1-deploy
error: no resource "logs" has been defined
[root@dhcp-128-16 ~]# oc logs logging-es-0rqpoiqm-1-deploy
error: couldn't read version from server: Get https://openshift-140.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout

2.Then I found I didn't set OPENSHIFT_PROFILE=web -> I set it according to your comment and restarted openshift, -> redeploy es by 'oc deploy' 

[root@openshift-140 ~]# oc get pods
Unable to connect to the server: net/http: TLS handshake timeout
[root@openshift-140 ~]# oc get nodes
Unable to connect to the server: EOF

3. Then got master.pprof as attached

Comment 9 wyue 2015-10-27 09:30:12 UTC

Created attachment 1086764 [details]
master.pprof

Comment 10 wyue 2015-10-27 10:39:54 UTC

I found my env is deleted by others later, so I am not sure the bug reproduced or env broken caused this problem this time.
please ignore my above comments.

Comment 11 wyue 2015-10-28 07:00:26 UTC

reproduced this time, steps:
1.1.deploy EFK stack according to :
https://github.com/openshift/origin-aggregated-logging/tree/master/deployment

2.wait for the deployments complete and found both deployments are error.
(both deployment fail because pods took longer than 600 seconds to become ready)
[root@dhcp-128-16 ~]# oc get pods
NAME                           READY     STATUS        RESTARTS   AGE
logging-deployer-6hb70         0/1       Completed     0          23m
logging-es-36n5hs16-1-deploy   0/1       Error         0          19m
logging-kibana-1-deploy        0/1       Error         0          19m

3.
deploy logging-es-36n5hs16 with 'oc deploy logging-es-36n5hs16 --latest', the deploy pod starts.
then try to deploy kibana with 'oc deploy logging-kibana --latest', got:
Error from server: 502:  (unhandled http status [Request Entity Too Large] with body []) [0]

4. then, wait for a while, logging-es-36n5hs16 running
[root@dhcp-128-16 ~]# oc get pods
NAME                           READY     STATUS      RESTARTS   AGE
logging-deployer-6hb70         0/1       Completed   0          33m
logging-es-36n5hs16-1-deploy   0/1       Error       0          29m
logging-es-36n5hs16-2-deploy   0/1       Error       0          17m
logging-es-36n5hs16-3-b73t1    1/1       Running     0          3m
logging-kibana-1-deploy        0/1       Error       0          29m

5. wait for a while, got:
[root@dhcp-128-16 ~]# oc get pods
error: couldn't read version from server: Get https://openshift-158.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout
[root@dhcp-128-16 ~]# oc get node
error: couldn't read version from server: Get https://openshift-158.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout

Comment 14 Luke Meyer 2015-10-29 16:56:43 UTC

I will investigate today and see what I can find. It sounds like a system problem.

Comment 15 Luke Meyer 2015-10-29 17:51:06 UTC

Created attachment 1087645 [details]
Another heap dump from today

This one is against the OSE openshift v3.0.2.903-74-gf49cee6 binary.

Comment 16 Luke Meyer 2015-10-29 18:53:26 UTC

Created attachment 1087649 [details]
Heap dump 2 from 2015-10-29

Comment 17 Luke Meyer 2015-10-29 18:54:19 UTC

Created attachment 1087650 [details]
crash dump (SIGABRT)

Comment 18 Luke Meyer 2015-10-29 18:55:42 UTC

I see the problem occurring. I have no idea why it happens for you and not me. I took some more dumps and Clayton looked at them.

It seems to be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1273149 though.

Comment 21 Luke Meyer 2015-10-30 12:06:21 UTC

I don't think we can spend any more time debugging this until the fix for 1273149  is merged. I will mark it a duplicate of that for now and we can resurrect it later if it turns out to be something different.

*** This bug has been marked as a duplicate of bug 1273149 ***