Bug 1273749
Summary: | [intservice_public_91]OSE environment going to die after deploying EFK stack | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | wyue | ||||||||||
Component: | Logging | Assignee: | Luke Meyer <lmeyer> | ||||||||||
Status: | CLOSED DUPLICATE | QA Contact: | chunchen <chunchen> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 3.1.0 | CC: | aos-bugs, sdodson, wsun, wyue, xtian | ||||||||||
Target Milestone: | --- | Keywords: | Reopened, TestBlocker | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2015-10-30 12:06:21 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
wyue
2015-10-21 07:45:57 UTC
related images info: rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-kibana rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-auth-proxy rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-deployment rcm-img-docker01.build.eng.bos.redhat.com:5001/openshift3/logging-elasticsearch Seems you're running out of RAM, and the culprit is this openshift process: 14518 root 20 0 4244672 3.164g 1344 D 1.0 85.5 3:07.55 openshift I'm not sure how our deployment could cause that and haven't seen it before, but I'll look into it. Can I ask how you deploy your test system? AMI, ansible with RPMs, etc... (In reply to Luke Meyer from comment #3) > Seems you're running out of RAM, and the culprit is this openshift process: > > 14518 root 20 0 4244672 3.164g 1344 D 1.0 85.5 3:07.55 > openshift > > > I'm not sure how our deployment could cause that and haven't seen it before, > but I'll look into it. > > Can I ask how you deploy your test system? AMI, ansible with RPMs, etc... it's ansible with RPMs I deployed a 3-node cluster with the latest OSE puddle today and I didn't see this. Please do the following: 1. Add this line to your master sysconfig file (/etc/sysconfig/atomic-openshift-master or origin-master): OPENSHIFT_PROFILE=web 2. Restart the master Then, if you see this happen again, run the following on the master to get a heap dump: curl 'http://127.0.0.1:6060/debug/pprof/heap?debug=1' > master.pprof ... and attach the file to this ticket. (In reply to Luke Meyer from comment #5) > I deployed a 3-node cluster with the latest OSE puddle today and I didn't > see this. > > Please do the following: > 1. Add this line to your master sysconfig file > (/etc/sysconfig/atomic-openshift-master or origin-master): > OPENSHIFT_PROFILE=web > 2. Restart the master > > > Then, if you see this happen again, run the following on the master to get a > heap dump: > curl 'http://127.0.0.1:6060/debug/pprof/heap?debug=1' > master.pprof > > ... and attach the file to this ticket. This didn't happen on new OSE setup, I will continue to pay attention about it. Going to close this now to keep it off the bug list as it doesn't appear to be blocking testing, but feel free to re-open if encountered again. It seems this issue reproduced today on a two nodes(one master, one node) OSE setup: [root@openshift-140 ~]# openshift version openshift v3.0.2.903-29-g49953d6 kubernetes v1.2.0-alpha.1-1107-g4c8e6f4 etcd 2.1.2 below are reproduce steps: 1. deploy according to this doc: https://github.com/openshift/origin-aggregated-logging/tree/master/deployment but found es and kibana deployed failed [root@dhcp-128-16 ~]# oc get pods NAME READY STATUS RESTARTS AGE logging-deployer-moxck 0/1 Completed 0 23m logging-deployer-zqbfr 0/1 Error 0 29m logging-es-0rqpoiqm-1-deploy 0/1 Error 0 18m logging-kibana-1-deploy 0/1 Error 0 1 [root@dhcp-128-16 ~]# oc logs logging-kibana-1-deploy I1027 03:55:32.582019 1 deployer.go:196] Deploying logging/logging-kibana-1 for the first time (replicas: 1) I1027 03:55:32.640204 1 recreate.go:113] Scaling logging/logging-kibana-1 to 1 before performing acceptance check I1027 03:55:34.694625 1 recreate.go:118] Performing acceptance check of logging/logging-kibana-1 I1027 03:55:34.694752 1 lifecycle.go:290] Waiting 600 seconds for pods owned by deployment "logging/logging-kibana-1" to become ready (checking every 1 seconds; 0 pods previously accepted) F1027 04:05:34.695147 1 deployer.go:65] update acceptor rejected logging/logging-kibana-1: pods for deployment "logging/logging-kibana-1" took longer than 600 seconds to become ready [root@dhcp-128-16 ~]# [root@dhcp-128-16 ~]# oc get logs logging-es-0rqpoiqm-1-deploy error: no resource "logs" has been defined [root@dhcp-128-16 ~]# oc logs logging-es-0rqpoiqm-1-deploy error: couldn't read version from server: Get https://openshift-140.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout 2.Then I found I didn't set OPENSHIFT_PROFILE=web -> I set it according to your comment and restarted openshift, -> redeploy es by 'oc deploy' [root@openshift-140 ~]# oc get pods Unable to connect to the server: net/http: TLS handshake timeout [root@openshift-140 ~]# oc get nodes Unable to connect to the server: EOF 3. Then got master.pprof as attached Created attachment 1086764 [details]
master.pprof
I found my env is deleted by others later, so I am not sure the bug reproduced or env broken caused this problem this time. please ignore my above comments. reproduced this time, steps: 1.1.deploy EFK stack according to : https://github.com/openshift/origin-aggregated-logging/tree/master/deployment 2.wait for the deployments complete and found both deployments are error. (both deployment fail because pods took longer than 600 seconds to become ready) [root@dhcp-128-16 ~]# oc get pods NAME READY STATUS RESTARTS AGE logging-deployer-6hb70 0/1 Completed 0 23m logging-es-36n5hs16-1-deploy 0/1 Error 0 19m logging-kibana-1-deploy 0/1 Error 0 19m 3. deploy logging-es-36n5hs16 with 'oc deploy logging-es-36n5hs16 --latest', the deploy pod starts. then try to deploy kibana with 'oc deploy logging-kibana --latest', got: Error from server: 502: (unhandled http status [Request Entity Too Large] with body []) [0] 4. then, wait for a while, logging-es-36n5hs16 running [root@dhcp-128-16 ~]# oc get pods NAME READY STATUS RESTARTS AGE logging-deployer-6hb70 0/1 Completed 0 33m logging-es-36n5hs16-1-deploy 0/1 Error 0 29m logging-es-36n5hs16-2-deploy 0/1 Error 0 17m logging-es-36n5hs16-3-b73t1 1/1 Running 0 3m logging-kibana-1-deploy 0/1 Error 0 29m 5. wait for a while, got: [root@dhcp-128-16 ~]# oc get pods error: couldn't read version from server: Get https://openshift-158.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout [root@dhcp-128-16 ~]# oc get node error: couldn't read version from server: Get https://openshift-158.lab.eng.nay.redhat.com:8443/api: net/http: TLS handshake timeout I will investigate today and see what I can find. It sounds like a system problem. Created attachment 1087645 [details]
Another heap dump from today
This one is against the OSE openshift v3.0.2.903-74-gf49cee6 binary.
Created attachment 1087649 [details]
Heap dump 2 from 2015-10-29
Created attachment 1087650 [details]
crash dump (SIGABRT)
I see the problem occurring. I have no idea why it happens for you and not me. I took some more dumps and Clayton looked at them. It seems to be a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1273149 though. I don't think we can spend any more time debugging this until the fix for 1273149 is merged. I will mark it a duplicate of that for now and we can resurrect it later if it turns out to be something different. *** This bug has been marked as a duplicate of bug 1273149 *** |