Bug 1439451 - Kibana pod is CrashLoopBackOff after logging v3.6 was deployed
Summary: Kibana pod is CrashLoopBackOff after logging v3.6 was deployed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Jeff Cantrill
QA Contact: Xia Zhao
URL:
Whiteboard: aos-scalability-36
: 1449858 1452807 (view as bug list)
Depends On:
Blocks: 1446217
TreeView+ depends on / blocked
 
Reported: 2017-04-06 03:26 UTC by Xia Zhao
Modified: 2020-09-10 10:26 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Change in the authproxy was keeping it from finding dependent files Consequence: The auth proxy would terminate because it could not find its dependent files Fix: Add ENV vars to the deploymentconfig with the correct path to the files. Result: The openshift-auth-proxy finds dependent files and starts.
Clone Of:
Environment:
Last Closed: 2017-08-10 05:20:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
full ansible execution logs (1.40 MB, text/plain)
2017-04-06 03:32 UTC, Xia Zhao
no flags Details
inventory file for logging 3.6.0 stacks' deployment (1.18 KB, text/plain)
2017-04-06 03:35 UTC, Xia Zhao
no flags Details
screenshot when visit kibana route from web browser (54.79 KB, image/png)
2017-04-06 03:38 UTC, Xia Zhao
no flags Details
es_log (3.96 KB, text/plain)
2017-04-06 06:24 UTC, Xia Zhao
no flags Details
fluentd_log (48.31 KB, text/plain)
2017-04-06 06:24 UTC, Xia Zhao
no flags Details
kibana log of container kibana (86.54 KB, text/plain)
2017-04-06 06:25 UTC, Xia Zhao
no flags Details
kibana log of container kibana-proxy (283 bytes, text/plain)
2017-04-06 06:25 UTC, Xia Zhao
no flags Details
info you wanted, there is no logging-kibana confimap (11.14 KB, text/plain)
2017-04-07 02:40 UTC, Junqi Zhao
no flags Details
kibana dc info (7.05 KB, text/plain)
2017-04-07 03:13 UTC, Junqi Zhao
no flags Details
the kibana,es logs on May 10, 2017 (9.48 KB, text/plain)
2017-05-10 09:20 UTC, Xia Zhao
no flags Details
logging 3.5 kibana dc info (4.45 KB, text/plain)
2017-05-11 08:36 UTC, Junqi Zhao
no flags Details
logging 3.6 kibana dc info (3.80 KB, text/plain)
2017-05-11 08:44 UTC, Junqi Zhao
no flags Details
kibana dc, pods info (77.90 KB, text/plain)
2017-05-27 01:52 UTC, Junqi Zhao
no flags Details
kibana and kibana-proxy logs from 3.6.116 (790 bytes, application/x-gzip)
2017-06-20 06:23 UTC, Mike Fiedler
no flags Details
kibana and kibana-proxy logs from 3.6.122 internal build (1.20 KB, application/x-gzip)
2017-06-22 03:26 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC

Description Xia Zhao 2017-04-06 03:26:21 UTC
Description of problem:
Kibana pod is CrashLoopBackOff after logging 3.6.0 was deployed. The ansible script execution finished successfully, and ES in green status, but kibana route is not accessible.

# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-0v14q       1/1       Running            0          18h
logging-es-8r3vyszi-1-gz5hf   1/1       Running            0          18h
logging-fluentd-4twb2         1/1       Running            0          18h
logging-kibana-1-4v736        1/2       CrashLoopBackOff   138        18h

# oc logs -f logging-kibana-1-4v736 -c kibana-proxy 
Could not read TLS opts from secret/server-tls.json; error was: Error: ENOENT: no such file or directory, open 'secret/server-tls.json'
Starting up the proxy with auth mode "oauth2" and proxy transform "user_header,token_header".

Get error message when visiting the kibana route (at the same time, openshift router is running fine , and a second route for applications other than logging worked fine):
Application is not available
The application is currently not serving requests at this endpoint. It may not have been started or is still starting.

Version-Release number of selected component (if applicable):
openshift-ansible-3.6.15-1.git.0.d2b88f8.el7.noarch
openshift-ansible-playbooks-3.6.15-1.git.0.d2b88f8.el7.noarch
openshift-ansible-roles-3.6.15-1.git.0.d2b88f8.el7.noarch

logging images on ops registry:
openshift3/logging-auth-proxy                              3.6.0               5cd70d92d4ef
openshift3/logging-kibana                                  3.6.0               925583fe8c13 

# openshift version
openshift v3.6.16
kubernetes v1.5.2+43a9be4
etcd 3.1.0

How reproducible:
always

Steps to Reproduce:
1.Deploy logging 3.6.0 stacks on OCP 3.6.0 by running ansible scripts
2.Check EFK pods' status
3.Check kibana route

Actual results:
Kibana pod is CrashLoopBackOff , kibana route not accessible

Expected results:
EFK pods are in running status and logging UI working fine

Additional info:
ansible inventory file and execution logs attached
EFK logs attached
logging UI screenshot attached

Comment 1 Xia Zhao 2017-04-06 03:32:41 UTC
Created attachment 1269161 [details]
full ansible execution logs

Comment 2 Xia Zhao 2017-04-06 03:35:00 UTC
Created attachment 1269162 [details]
inventory file for logging 3.6.0 stacks' deployment

Comment 3 Xia Zhao 2017-04-06 03:38:40 UTC
Created attachment 1269163 [details]
screenshot when visit kibana route from web browser

Comment 4 Xia Zhao 2017-04-06 04:01:48 UTC
Blocks 3.6.0 logging tests

Comment 5 Xia Zhao 2017-04-06 06:24:04 UTC
Created attachment 1269168 [details]
es_log

Comment 6 Xia Zhao 2017-04-06 06:24:27 UTC
Created attachment 1269169 [details]
fluentd_log

Comment 7 Xia Zhao 2017-04-06 06:25:00 UTC
Created attachment 1269170 [details]
kibana log of container kibana

Comment 8 Xia Zhao 2017-04-06 06:25:23 UTC
Created attachment 1269171 [details]
kibana log of container kibana-proxy

Comment 9 Rich Megginson 2017-04-07 01:19:05 UTC
can you provide
oc get secret logging-kibana -o yaml
and
oc get configmap logging-kibana -o yaml

Comment 10 Junqi Zhao 2017-04-07 02:40:43 UTC
Created attachment 1269539 [details]
info you wanted, there is no logging-kibana confimap

Comment 11 Rich Megginson 2017-04-07 03:00:17 UTC
sorry, I meant 

oc get secret logging-kibana -o yaml

instead of configmap

Comment 12 Rich Megginson 2017-04-07 03:00:47 UTC
(In reply to Rich Megginson from comment #11)
> sorry, I meant 
> 
> oc get secret logging-kibana -o yaml
> 
> instead of configmap

I meant

oc get dc logging-kibana -o yaml

instead of configmap

Comment 13 Junqi Zhao 2017-04-07 03:13:22 UTC
Created attachment 1269543 [details]
kibana dc info

Comment 14 Xia Zhao 2017-04-12 08:15:25 UTC
Issue reproduced with 

# openshift version
openshift v3.5.5.5
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Images tested with (ops registry):
logging-kibana          3.6.0               925583fe8c13
logging-auth-proxy      3.6.0               5cd70d92d4ef

# oc get po -n logging
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-274nh       1/1       Running            1          3m
logging-es-np7xeu0t-1-j970v   1/1       Running            0          3m
logging-fluentd-h54ds         1/1       Running            0          3m
logging-kibana-1-2rn4z        1/2       CrashLoopBackOff   4          3m

Comment 15 Junqi Zhao 2017-04-19 05:32:44 UTC
@Rich,

This defect is still not fixed, Kibana pods are in CrashLoopBackOff status

Comment 16 Noriko Hosoi 2017-04-21 01:40:35 UTC
It may not be the same symptom found by Junqi and Xia since my kibana log is a bit different and your ES looks running healty, but I also duplicated the CrashLoopBackOff on beaker.
logging-kibana-1-dscf1         1/2       CrashLoopBackOff   6          12m

How to reproduce:
$ git clone http://git.app.eng.bos.redhat.com/srv/git/ViaQ.git
BRANCH: 3.6
CMDLINE: bkr job-submit el7-ocp-36.xml

Kibana log:
# oc logs logging-kibana-1-dscf1 -c=kibana
.....

{"type":"log","@timestamp":"2017-04-20T22:47:53Z","tags":["warning","elasticsearch"],"pid":9,"message":"Unable to revive connection: https://logging-es:9200/"}
{"type":"log","@timestamp":"2017-04-20T22:47:53Z","tags":["warning","elasticsearch"],"pid":9,"message":"No living connections"}
{"type":"log","@timestamp":"2017-04-20T22:47:53Z","tags":["status","plugin:elasticsearch.0","error"],"pid":9,"state":"red","message":"Status changed from red to red - Unable to connect to Elasticsearch at https://logging-es:9200.","prevState":"red","prevMsg":"Request Timeout after 3000ms"}
.....
{"type":"log","@timestamp":"2017-04-20T22:49:58Z","tags":["warning","elasticsearch"],"pid":9,"message":"Unable to revive connection: https://logging-es:9200/"}
{"type":"log","@timestamp":"2017-04-20T22:49:58Z","tags":["warning","elasticsearch"],"pid":9,"message":"No living connections"}
.....

The pair of messages "Unable to revive connection: https://logging-es:9200/" and "No living connections" are repeated endlessly.  In the beginning, the status of kibana was Error:
    logging-kibana-1-dscf1       1/2     Error              0    2m
bug eventually, it's turned to be CrashLoopBackOff:
    logging-kibana-1-dscf1       1/2     CrashLoopBackOff   6    12m

I'm wondering this Timeout message could change the status of kibana from Error to CrashLoopBackOff?  The same message is found in https://bugzilla.redhat.com/show_bug.cgi?id=1439451#c7, as well.
{"type":"log","@timestamp":"2017-04-20T22:47:53Z","tags":["status","plugin:elasticsearch.0","error"],"pid":9,"state":"red","message":"Status changed from red to red - Unable to connect to Elasticsearch at https://logging-es:9200.","prevState":"red","prevMsg":"Request Timeout after 3000ms"}

More seriously, my ES pod eventually disappears.  It looks to me this is the root cause of the above connection errors on Kibana.
# oc get pods
NAME                           READY     STATUS             RESTARTS   AGE
logging-curator-1-c571w        1/1       Running            3          13m
logging-es-1vszhw7i-1-deploy   0/1       Error              0          12m
logging-fluentd-02d8t          1/1       Running            0          12m
logging-kibana-1-dscf1         1/2       CrashLoopBackOff   6          12m
logging-mux-1-pt15q            1/1       Running            0          10m

Please note that, the pods were running fine in the beginning.
logging-es-1vszhw7i-1-deploy   1/1       Running             0          2m
logging-es-1vszhw7i-1-x4fgh    0/1       Running             0          2m

I restarted the ES and tried to find out what crashed the ES.  Here's my observation.

First, I would like to double-check whether it is correct or not that only logging-es is created (i.e., no logging-es-ops)

NAME                           READY     STATUS              RESTARTS   AGE
logging-curator-1-c571w        1/1       Running             0          3m
logging-es-1vszhw7i-1-deploy   1/1       Running             0          2m
logging-es-1vszhw7i-1-x4fgh    0/1       Running             0          2m
logging-fluentd-02d8t          1/1       Running             0          2m
logging-kibana-1-deploy        1/1       Running             0          2m
logging-kibana-1-dscf1         2/2       Running             0          2m
logging-mux-1-deploy           1/1       Running             0          2m
logging-mux-1-pt15q            0/1       ContainerCreating   0          51s

If I run search against "https://logging-es-ops:9200", it turminated the ES.
logging-es-1vszhw7i-3-14pfq    0/1       Terminating        0          10m

In our test cases/health checks, we could issue such an operation, which puts the rest of the pods in the error state, i.e., it fails to connect to the ES.

Comment 17 Rich Megginson 2017-04-21 02:30:03 UTC
Noriko, what's in the pod log for the logging-es-1vszhw7i-1-deploy   0/1       Error pod?

Comment 18 Noriko Hosoi 2017-04-22 23:50:08 UTC
(In reply to Rich Megginson from comment #17)
> Noriko, what's in the pod log for the logging-es-1vszhw7i-1-deploy   0/1    
> Error pod?

I restarted a new beaker job.
https://beaker.engineering.redhat.com/jobs/1820081

Prior to the status Error:
  logging-es-hjrv78go-1-crqvh    0/1       Running            0          8m
  logging-es-hjrv78go-1-deploy   1/1       Running            0          8m
==> pod log of logging-es-hjrv78go-1-deploy <==
--> Scaling logging-es-hjrv78go-1 to 1
--> Waiting up to 10m0s for pods in rc logging-es-hjrv78go-1 to become ready

After the status turned to Error:
  logging-es-hjrv78go-1-deploy   0/1       Error              0          11m
==> pod log of logging-es-hjrv78go-1-deploy <==
--> Scaling logging-es-hjrv78go-1 to 1
--> Waiting up to 10m0s for pods in rc logging-es-hjrv78go-1 to become ready
error: update acceptor rejected logging-es-hjrv78go-1: pods for rc "logging-es-hjrv78go-1" took longer than 600 seconds to become ready

This time, this command line terminated logging-es-hjrv78go-1-crqvh.  (I.e., my previous comment accessing via logging-es-ops might have terminated the ES was incorrect.)
# oc exec logging-es-hjrv78go-1-crqvh -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key 'https://logging-es:9200/_cat/indices?v'
Error from server (NotFound): pods "logging-es-hjrv78go-1-crqvh" not found



Regarding the status of Kibana, it's "running" when it's started.  Then, it goes to CrashLoopBackOff.
  logging-kibana-1-1c83p         2/2       Running   3          4m
===>
  logging-kibana-1-1c83p         1/2       CrashLoopBackOff   3          5m

Following is the early part of the log of logging-kibana-1-1c83p.  Even when the status is Running, the log already has an error message (the 3rd one with EHOSTUNREACH).  Then, it takes some time to be turned to CrashLoopBackOff.  By the time, the log just repeats "Unable to revive connection: https://logging-es:9200/".

{"type":"log","@timestamp":"2017-04-22T20:42:01Z","tags":["status","plugin:table_vis.0","info"],"pid":9,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
 <== the last "green" message

{"type":"log","@timestamp":"2017-04-22T20:42:01Z","tags":["listening","info"],"pid":9,"message":"Server running at http://0.0.0.0:5601"}

{"type":"log","@timestamp":"2017-04-22T20:42:02Z","tags":["error","elasticsearch"],"pid":9,"message":"Request error, retrying -- connect EHOSTUNREACH 172.30.224.19:9200"}
 <== This error looks strange. [1]

{"type":"log","@timestamp":"2017-04-22T20:42:04Z","tags":["status","plugin:elasticsearch.0","error"],"pid":9,"state":"red","message":"Status changed from yellow to red - Request Timeout after 3000ms","prevState":"yellow","prevMsg":"Waiting for Elasticsearch"}
{"type":"log","@timestamp":"2017-04-22T20:42:05Z","tags":["warning","elasticsearch"],"pid":9,"message":"Unable to revive connection: https://logging-es:9200/"}

[1] I cannot find the IP address 172.30.224.19 anywhere (not in the source codes, either.  I'm wondering where it came from...)
Ping from the beaker box fails:
# ping 172.30.224.19
PING 172.30.224.19 (172.30.224.19) 56(84) bytes of data.
From 10.128.0.1 icmp_seq=1 Destination Host Unreachable

I failed to get the IP address of the ES pod (it terminated again), but curator, fluentd, and kibana are configured as follows.  So, I'd guess the ES also is on 10.128.0.##?  (It cannot be 172.30.224.19?)  But I could be wrong...
 10.128.0.6	logging-curator-1-rx3vp
 10.128.0.11	logging-fluentd-c3l66
 10.128.0.8	logging-kibana-1-1c83p

Comment 19 Rich Megginson 2017-04-24 14:40:39 UTC
use oc describe:

oc describe pod logging-es-hjrv78go-1-deploy

oc describe pod logging-es-hjrv78go-1-crqvh

Comment 20 Noriko Hosoi 2017-04-24 17:35:14 UTC
Thanks, Rich.  

Since logging-es-hjrv78go-1-crqvh had died, I restarted -2...

# oc describe pod logging-es-hjrv78go-2-deploy
<snip>
Status:         Running
IP:         10.128.0.16
Controllers:        <none>
Containers:
  deployment:
    Container ID:   docker://dabeff64f515417a92f54ec7f391d51f310481e3bfbbc86091820577e17a2349
    Image:      openshift3/ose-deployer:v3.6.42
    Image ID:       docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ose-dep
loyer@sha256:19525d83a631241a11a728de2db6b50e982a1898f50a4beb3597ddabf61df008
<snip>

# oc describe pod logging-es-hjrv78go-2-fb4hq
<snip>
Status:         Running
IP:         10.128.0.17
Controllers:        ReplicationController/logging-es-hjrv78go-2
Containers:
  elasticsearch:
    Container ID:   docker://7b9530a887b1cd75c23671dd9b8c0360a1585a3fb6c75e086212fb94164a1b6a
    Image:      brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:v3.6
<snip>
  1m        4s      13  {kubelet cisco-c210-01.rhts.eng.bos.redhat.com} spec.containers{elasticsearch}  Warning     Unhealthy   Readiness probe failed: rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"/usr/share/elasticsearch/probe/readiness.sh\\\": stat /usr/share/elasticsearch/probe/readiness.sh: no such file or directory\"\n"

The last error event in the output from "oc describe pod logging-es-hjrv78go-2-fb4hq" does not look good...

Note: indeed, there is no 'probe' directory on the ES pod...
# oc exec logging-es-hjrv78go-3-0rkcx -- ls /usr/share/elasticsearch
elasticsearch
index_patterns
index_templates

Another node: the questionable "172.30.224.19" is not found in the output from "oc describe pod ...", either.

Comment 21 Rich Megginson 2017-04-24 17:38:15 UTC
Are you running 3.6?  Looks like it is missing the readiness probe.  Perhaps a rebuild of 3.6 is in order?

Comment 27 Jeff Cantrill 2017-04-28 13:52:43 UTC
This change came is as part of the merge for https://github.com/fabric8io/openshift-auth-proxy/pull/13.  Temporarily can be resolved by setting ENV VARS in the DC to https://github.com/jcantrill/origin-aggregated-logging/blob/0386569cacb132827cad125ccdc1eb56f4ff8a8d/deployer/templates/kibana.yaml#L127-L140

The issue is there is a missing leading slash.

Comment 28 Jeff Cantrill 2017-04-28 14:33:49 UTC
Fixed in https://github.com/openshift/openshift-ansible/pull/4035

Comment 29 Jeff Cantrill 2017-04-28 17:20:20 UTC
Moving back to assigned as I believe this is unrelated to whats reference in comment#27.  Looking at the configmap I do not see any of the auth related secrets the auth proxy uses.

Comment 30 Jeff Cantrill 2017-05-02 15:16:04 UTC
Can you retry with 3.6 latest or a minimum of:

openshift-ansible-3.6.38-1
openshift-ansible-3.6.50-1

It looks like there was a fix related to this issue in: 

https://github.com/openshift/openshift-ansible/pull/3911

Comment 31 Noriko Hosoi 2017-05-02 17:54:35 UTC
Hi Jeff,

Unfortunately, I still see CrashLoopBackOff.

# oc get pods -l component=kibana
NAME                     READY     STATUS             RESTARTS   AGE
logging-kibana-1-jbnk0   1/2       CrashLoopBackOff   8          24m

The version of openshift-ansible:
# rpm -q openshift-ansible
openshift-ansible-3.6.49-1.git.0.5ba1856.el7.noarch

Good news is we don't see this Unhealthy Readiness warning on the ES pod any more.
> CMDLINE "os describe pod $espod" still shows a warning.
> Warning Unhealthy Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]

According to the detailed kibana pods output, kibana itself looks running fine.
      name: kibana
      ready: true
      restartCount: 0
      state:
        running:
          startedAt: 2017-05-02T17:14:47Z

But kibana-proxy fails to start.
      name: kibana-proxy
      ready: false
      restartCount: 8
      state:
        waiting:
          message: Back-off 5m0s restarting failed container=kibana-proxy pod=logging-kibana-1-jbnk0_logging(b04909c2-2f5a-11e7-b95c-00188b89f3f7)
          reason: CrashLoopBackOff

This is one of the errors from "oc describe pod $kibanapod"
  35m		35m		1	kubelet, dell-pe-sc1435-01.rhts.englab.brq.redhat.com		Warning		FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "kibana-proxy" with CrashLoopBackOff: "Back-off 10s restarting failed container=kibana-proxy pod=logging-kibana-1-jbnk0_logging(b04909c2-2f5a-11e7-b95c-00188b89f3f7)"

Comment 32 Xia Zhao 2017-05-03 08:02:07 UTC
--Issue reproduced with openshift-ansible-playbooks-3.6.51-1.git.0.18eb563.el7.noarch

# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-785z5       0/1       CrashLoopBackOff   6          33m
logging-es-iqqip04t-1-lb9wv   0/1       Running            0          33m
logging-fluentd-c61j2         1/1       Running            0          33m
logging-fluentd-fk5nq         1/1       Running            0          33m
logging-kibana-1-gpgrx        1/2       CrashLoopBackOff   11         33m

--The above question is caused by current elasticsearch image with tag 3.6.0 on brew and ops registries are too old, don't contain the readiness probe script:

openshift3/logging-elasticsearch   3.6.0               32938f595638        3 weeks ago

--Worked around the readiness probe by changing ES_REST_BASEURL from
     "https://localhost:9200"
to 
     "https://logging-es:9200"
in https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/probe/readiness.sh#L20, the problem get resolved, but the kibana pod's crash is still encountered:

# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-hntbv       1/1       Running            2          16m
logging-es-3y9oq0ti-2-7s6tz   1/1       Running            0          10m
logging-fluentd-3tbc4         1/1       Running            0          16m
logging-fluentd-437l4         1/1       Running            0          16m
logging-kibana-1-76f7t        1/2       CrashLoopBackOff   6          8m

Test env info:
# openshift version
openshift v3.6.61
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

Comment 34 Rich Megginson 2017-05-03 15:24:40 UTC
I don't understand what's going on.  I'm using the latest 3.6 devenv.  The readiness probe is using https://localhost:9200.  I used set -x in the probe script, and used curl -v.  It works fine.  Here is the output:

+ ES_REST_BASEURL=https://localhost:9200
+ EXPECTED_RESPONSE_CODE=200
+ secret_dir=/etc/elasticsearch/secret
+ max_time=4
++ curl -v -s -X HEAD --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key --max-time 4 -w '%{response_code}' https://localhost:9200/
* About to connect() to localhost port 9200 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 9200 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/elasticsearch/secret/admin-ca
  CApath: none
* NSS: client certificate from file
* 	subject: CN=system.admin,OU=OpenShift,O=Logging
* 	start date: May 03 14:51:15 2017 GMT
* 	expire date: May 03 14:51:15 2019 GMT
* 	common name: system.admin
* 	issuer: CN=logging-signer-test
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* 	subject: CN=logging-es,OU=OpenShift,O=Logging
* 	start date: May 03 14:51:25 2017 GMT
* 	expire date: May 03 14:51:25 2019 GMT
* 	common name: logging-es
* 	issuer: CN=logging-signer-test
> HEAD / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:9200
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=UTF-8
< Content-Length: 0
< 
* Connection #0 to host localhost left intact
+ response_code=200
+ '[' 200 == 200 ']'
+ exit 0

I don't understand why in your case localhost doesn't work but logging-es does work.

Comment 39 Xia Zhao 2017-05-10 09:13:53 UTC
However, I noticed there is new kibana/es images with tag=v3.6 on brew registry, and when I did a further test today with the latest images, this issue got reproduced. Could you help to take a further look? Thanks.

Images tested with:
logging-kibana             v3.6                dc571aa09d26        10 hours ago
logging-elasticsearch      v3.6                d2709cc1e16a        10 hours ago
logging-fluentd            v3.6                aafaf8787b29        10 hours ago
logging-curator            v3.6                028e689a3276        6 days ago
logging-auth-proxy         v3.6                11f731349ff9        2 days ago

The ansible version is:
openshift-ansible-playbooks-3.6.58-1.git.0.f4a514a.el7.noarch

# oc get po 
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-9p8wz       1/1       Running            0          18m
logging-es-zed8jht4-1-rnjng   1/1       Running            0          18m
logging-fluentd-7fr0g         1/1       Running            0          18m
logging-fluentd-kzqfp         1/1       Running            0          18m
logging-kibana-1-990kf        1/2       CrashLoopBackOff   8          18m


both the kibana and kibana-proxy containers looked fine (though kibana is waiting for Elasticsearch to create the .kibana index), the problem appeared to me is es somehow stopped after stating this line:

 Create index template 'com.redhat.viaq-openshift-project.template.json'

The logs about es/kibana were attached for your reference.

Comment 40 Xia Zhao 2017-05-10 09:20:48 UTC
Created attachment 1277588 [details]
the kibana,es logs on May 10, 2017

Comment 42 Junqi Zhao 2017-05-11 08:36:10 UTC
Created attachment 1277777 [details]
logging 3.5 kibana dc info

Comment 43 Junqi Zhao 2017-05-11 08:43:47 UTC
@Noriko,

I checked logging 3.5 kibana dc info on my one machines and logging 3.6 kibana dc info in the machine mentioned in Comment 41, I think for this issue, there are some properties missed in 3.6 kibana dc, such as 


        - name: OAP_OAUTH_SECRET_FILE
          value: /secret/oauth-secret
        - name: OAP_SERVER_CERT_FILE
          value: /secret/server-cert
        - name: OAP_SERVER_KEY_FILE
          value: /secret/server-key
        - name: OAP_SERVER_TLS_FILE
          value: /secret/server-tls.json
        - name: OAP_SESSION_SECRET_FILE
          value: /secret/session-secret

You can compare the two files.

Comment 44 Junqi Zhao 2017-05-11 08:44:21 UTC
Created attachment 1277786 [details]
logging 3.6 kibana dc info

Comment 45 Jeff Cantrill 2017-05-11 13:34:37 UTC
Cherry-picked fix: https://github.com/openshift/openshift-ansible/pull/4162

Comment 46 Jeff Cantrill 2017-05-11 14:48:14 UTC
*** Bug 1449858 has been marked as a duplicate of this bug. ***

Comment 47 Xia Zhao 2017-05-16 09:34:25 UTC
Bug reproduced with openshift-ansible-playbooks-3.6.68-1.git.0.9cbe2b7.el7.noarch

# openshift version
openshift v3.6.75
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

# oc get po
NAME                          READY     STATUS               RESTARTS   AGE
logging-curator-1-w61zr       1/1       Running              0          1m
logging-es-f5guu218-1-w1th4   1/1       Running              0          1m
logging-fluentd-07wjn         1/1       Running              0          1m
logging-fluentd-1165z         1/1       Running              0          1m
logging-kibana-1-9trkz        1/2       CrashLoopBackOff     2          1m

Since the bug fix PR was merged, I also tested with the latest ansible playbooks from https://github.com/openshift/openshift-ansible/tree/master with ( head commit 15fd42020a0b5fee665c45cd23b9ba3bd152251d ), bug still reproduced: 

# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-8bvtc       1/1       Running            0          4m
logging-es-nqqlsk0x-1-rh4wt   1/1       Running            0          4m
logging-fluentd-lz5vs         1/1       Running            0          4m
logging-fluentd-p9pj5         1/1       Running            0          4m
logging-kibana-1-r7mbl        1/2       CrashLoopBackOff   5          4m

Comment 48 Junqi Zhao 2017-05-26 06:11:04 UTC
Tested on the following env and openshift-ansible packages.

# openshift version
openshift v3.6.76
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

# rpm -qa | grep openshift-ansible
openshift-ansible-lookup-plugins-3.6.84-1.git.0.72b2d74.el7.noarch
openshift-ansible-playbooks-3.6.84-1.git.0.72b2d74.el7.noarch
openshift-ansible-3.6.84-1.git.0.72b2d74.el7.noarch
openshift-ansible-callback-plugins-3.6.84-1.git.0.72b2d74.el7.noarch
openshift-ansible-roles-3.6.84-1.git.0.72b2d74.el7.noarch
openshift-ansible-docs-3.6.84-1.git.0.72b2d74.el7.noarch
openshift-ansible-filter-plugins-3.6.84-1.git.0.72b2d74.el7.noarch

we get the following pods, there is one logging-kibana deployer pod.
# oc get po
NAME                                      READY     STATUS    RESTARTS   AGE
logging-curator-1-w5wt7                   1/1       Running   1          9m
logging-es-data-master-v5gmycm6-1-l3wsq   0/1       Running   0          10m
logging-fluentd-brh9v                     1/1       Running   0          9m
logging-fluentd-x6xhr                     1/1       Running   0          9m
logging-kibana-1-deploy                   1/1       Running   0          9m
logging-kibana-1-qfgbt                    1/2       Running   0          9m


After a few minutes, logging-kibana pod is gone, and the logging-kibana deployer pod's status changed to Error
oc get po
NAME                                      READY     STATUS    RESTARTS   AGE
logging-curator-1-w5wt7                   1/1       Running   1          15m
logging-es-data-master-v5gmycm6-1-l3wsq   1/1       Running   0          16m
logging-fluentd-brh9v                     1/1       Running   0          15m
logging-fluentd-x6xhr                     1/1       Running   0          15m
logging-kibana-1-deploy                   0/1       Error     0          15m

Comment 49 Jeff Cantrill 2017-05-26 13:44:44 UTC
Please provide:

1. kibana container
2. kibana-proxy container

Looking at the dc from the 3.6 attachment: https://bugzilla.redhat.com/attachment.cgi?id=1277786  it is missing: https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.6.84-1/roles/openshift_logging_kibana/templates/kibana.j2#L120-L133

I'm assuming kibana is starting but kiban-proxy is failing due to inability to find the secrets.

Can you please reconfirm given the openshift-ansible version you ref says they should be there.  My other thought is it might be the opposite as well in the 3.6 images have not been synced with origin recently which is just happening now.

Comment 50 Mike Fiedler 2017-05-26 15:11:45 UTC
Also see https://bugzilla.redhat.com/show_bug.cgi?id=1452807.

I think the original problem described by this bz  ("Could not read TLS opts from secret/server-tls.json; error was: Error: ENOENT: no such file or directory, open 'secret/server-tls.json') is fixed.   This error no longer occurs.

We now see the issue described in bz 1452807.

Comment 51 Junqi Zhao 2017-05-27 01:51:52 UTC
Retested again, get the following error for kibana-proxy container
# oc logs logging-kibana-1-qndht -c kibana-proxy
Could not read TLS opts from /secret/server-tls.json; error was: Error: ENOENT: no such file or directory, open '/secret/server-tls.json'
Starting up the proxy with auth mode "oauth2" and proxy transform "user_header,token_header".

Attached es, kibana pod info and kibana dc info

Environment info:
# openshift version
openshift v3.6.85
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

# docker images | grep logging
openshift3/logging-kibana          v3.6                03bd6dfe1a53        21 hours ago        342.4 MB
openshift3/logging-elasticsearch   v3.6                6f311a1b9e0c        21 hours ago        404.6 MB
openshift3/logging-fluentd         v3.6                a0f8e4ccb888        21 hours ago        232.5 MB
openshift3/logging-auth-proxy      v3.6                a4bfb6537dcc        21 hours ago        229.6 MB
openshift3/logging-curator         v3.6                028e689a3276        3 weeks ago         211.1 MB

Comment 52 Junqi Zhao 2017-05-27 01:52:34 UTC
Created attachment 1282798 [details]
kibana dc, pods info

Comment 53 Junqi Zhao 2017-05-27 01:53:39 UTC
Add ansible info:

# rpm -qa | grep openshift-ansible
openshift-ansible-callback-plugins-3.6.85-1.git.0.109a54e.el7.noarch
openshift-ansible-docs-3.6.85-1.git.0.109a54e.el7.noarch
openshift-ansible-lookup-plugins-3.6.85-1.git.0.109a54e.el7.noarch
openshift-ansible-filter-plugins-3.6.85-1.git.0.109a54e.el7.noarch
openshift-ansible-playbooks-3.6.85-1.git.0.109a54e.el7.noarch
openshift-ansible-3.6.85-1.git.0.109a54e.el7.noarch
openshift-ansible-roles-3.6.85-1.git.0.109a54e.el7.noarch

Comment 55 Junqi Zhao 2017-06-05 03:11:57 UTC
Assign back, there is no kibana pod generated.
# oc get po
NAME                                      READY     STATUS    RESTARTS   AGE
logging-curator-1-jg9px                   1/1       Running   0          1h
logging-es-data-master-s89krelm-1-4j3h0   1/1       Running   0          1h
logging-fluentd-1mxvz                     1/1       Running   0          1h
logging-fluentd-73hxv                     1/1       Running   0          1h
logging-kibana-1-deploy                   0/1       Error     0          1h

Testing environemts:
# openshift version
openshift v3.6.85
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

Image id from brew registry
# docker images | grep logging
openshift3/logging-auth-proxy      v3.6                d043e446a08d        2 days ago          230.2 MB
openshift3/logging-kibana          v3.6                b2ee235a5512        2 days ago          342.4 MB
openshift3/logging-elasticsearch   v3.6                05cb395dd2b2        2 days ago          404.5 MB
openshift3/logging-fluentd         v3.6                67ee8da21667        2 days ago          232.5 MB
openshift3/logging-curator         v3.6                028e689a3276        4 weeks ago         211.1 MB

Comment 56 Xia Zhao 2017-06-05 06:21:10 UTC
Please feel free to assign back to ON_QA, the root cause of comment #55 (and may also for comment #48) is actually the kibana readiness probe failed which is reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1458652

During kibana pod's deploying phase, I see the kibana-proxy container can be ready now:

# oc get po
NAME                                      READY     STATUS    RESTARTS   AGE
logging-curator-1-vs9sj                   1/1       Running   0          6m
logging-es-data-master-lfw1lt94-1-8k1vg   1/1       Running   0          6m
logging-fluentd-lkv5r                     1/1       Running   0          6m
logging-fluentd-zkfxz                     1/1       Running   0          6m
logging-kibana-1-7bdgz                    1/2       Running   0          6m
logging-kibana-1-deploy                   1/1       Running   0          6m

Comment 58 Noriko Hosoi 2017-06-14 20:58:56 UTC
Thanks for the clarification, Jan.  You are right.  Sorry, I missed to update the kibana Dockerfile in the dist-git.  I'm adding it and rebuilding the kibana image.

Comment 59 Noriko Hosoi 2017-06-14 22:20:36 UTC
Building the kibana image is done:
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=565287
logging-kibana-docker-v3.6.109-2

Sorry for introducing the undefined variable 'KIBANA_HOME' bug.

The change was reverted:
    Revert "Updating Dockerfile version and release v3.6.99 2"

and added the readiness probe:
    Updating Dockerfile version and release v3.6.109 2 
    Add a readiness probe to the Kibana image

Comment 60 Jeff Cantrill 2017-06-19 21:47:08 UTC
*** Bug 1452807 has been marked as a duplicate of this bug. ***

Comment 61 Wei Sun 2017-06-20 01:47:24 UTC
Add TestBlocker, since Bug 1452807  was blocking some logging testing.

Comment 62 Mike Fiedler 2017-06-20 02:48:44 UTC
Using the logging-kibana image in comment 59, the logging-kibana pod starts and both kibana and kibana-proxy containers go into Ready condition.   Can we get this fix expedited and an image pushed to the ops mirror?  

I tried the official 3.6.116 image and it still has the issue of kibana pod going CrashLoopBackoff.

Comment 63 Noriko Hosoi 2017-06-20 04:50:09 UTC
(In reply to Mike Fiedler from comment #62)
> Using the logging-kibana image in comment 59, the logging-kibana pod starts
> and both kibana and kibana-proxy containers go into Ready condition.   Can
> we get this fix expedited and an image pushed to the ops mirror?  
> 
> I tried the official 3.6.116 image and it still has the issue of kibana pod
> going CrashLoopBackoff.

3.6.116 is supposed to have the fix.  Please mark FailedQA.  Sorry for the inconvenience.

Comment 64 Mike Fiedler 2017-06-20 06:21:53 UTC
Correction to comment 62.   I am seeing the same behavior with the image from comment 59 and 3.6.116.   The kibana pod is still going into CrashLoopBackoff, but without the kibana-proxy errors seen in comment 51 and in Bug 1452807.

The kibana and kibana-proxy logs (attached) now appear to be clean but the logging-kibana-pod is cycling between fully Ready, Error and CrashLoopBackoff.

Please let me know what further info I can gather.

logging-kibana-5-pq39j   2/2       Running   1         19s       172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   1/2       Error     1         23s       172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   1/2       CrashLoopBackOff   1         35s       172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   2/2       Running   2         37s       172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   1/2       Error     2         41s       172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   1/2       CrashLoopBackOff   2         56s       172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   2/2       Running   3         1m        172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   1/2       Error     3         1m        172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   1/2       CrashLoopBackOff   3         1m        172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   2/2       Running   4         1m        172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   1/2       Error     4         2m        172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal
logging-kibana-5-pq39j   1/2       CrashLoopBackOff   4         2m        172.20.3.24   ip-172-31-18-156.us-west-2.compute.internal

Comment 65 Mike Fiedler 2017-06-20 06:23:46 UTC
Created attachment 1289415 [details]
kibana and kibana-proxy logs from 3.6.116

Comment 66 Mike Fiedler 2017-06-20 06:26:09 UTC
Tested with:

registry.ops.openshift.com/openshift3/logging-fluentd             v3.6.116                                                 6819bdde7f83        47 hours ago        233.1 MB
registry.ops.openshift.com/openshift3/logging-kibana              v3.6.116                                                 a3e7c14233be        47 hours ago        342.4 MB
registry.ops.openshift.com/openshift3/logging-auth-proxy          v3.6.116                                                 4ecd26a8e9c5        47 hours ago        229.6 MB
registry.ops.openshift.com/openshift3/logging-fluentd             v3.6.114                                                 a7e65663f572        3 days ago          233.1 MB
nhosoi/logging-kibana-docker                                      rhaos-3.6-rhel-7-docker-candidate-53963-20170614213755   03db141a6026        5 days ago          342.4 MB


Tried both the 3.6.116 and nhosoi/logging-kibana-docker images with the results reported in comment 64.

Comment 67 Jeff Cantrill 2017-06-21 20:44:30 UTC
Fixed in

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=567197
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:rhaos-3.6-rhel-7-docker-candidate-88157-20170621202522
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:v3.6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:v3.6.122
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:v3.6.122-2

Comment 68 Mike Fiedler 2017-06-22 03:25:44 UTC
Running with the logging-auth-proxy container image from comment 67 results in a logging-kibana pod which repeatedly exits (apparently normally) and restarts:

logging-kibana-5-d65st   0/2       ContainerCreating   0         0s
logging-kibana-5-d65st   1/2       Running   0         9s
logging-kibana-5-d65st   2/2       Running   0         19s
logging-kibana-5-d65st   1/2       Completed   0         24s
logging-kibana-5-d65st   2/2       Running   1         26s
logging-kibana-5-d65st   1/2       Completed   1         30s
logging-kibana-5-d65st   1/2       CrashLoopBackOff   1         44s
logging-kibana-5-d65st   2/2       Running   2         45s
logging-kibana-5-d65st   1/2       Completed   2         49s
logging-kibana-5-d65st   1/2       CrashLoopBackOff   2         1m

Logs attached.

I downloaded from brew - I can also retry when it is pushed to the ops mirror.

Comment 69 Mike Fiedler 2017-06-22 03:26:32 UTC
Created attachment 1290465 [details]
kibana and kibana-proxy logs from 3.6.122 internal build

Comment 71 Junqi Zhao 2017-06-28 01:32:52 UTC
Issue is fixed, kibana pods are in running status and log entries could be retrieved from kibana UI.
# oc get po
NAME                                          READY     STATUS    RESTARTS   AGE
logging-curator-1-12sq7                       1/1       Running   0          30m
logging-curator-ops-1-wf3xz                   1/1       Running   0          30m
logging-es-data-master-ljq0gu76-1-d2kw8       1/1       Running   0          31m
logging-es-ops-data-master-5kx189jv-1-b2cmf   1/1       Running   0          31m
logging-fluentd-6tw96                         1/1       Running   0          30m
logging-fluentd-l9wr3                         1/1       Running   0          30m
logging-kibana-1-nxbjf                        2/2       Running   0          30m
logging-kibana-ops-1-4lgww                    2/2       Running   0          30m

Images from brew registry
logging-elasticsearch      v3.6                19ad6f8e4738        29 minutes ago      404.2 MB
logging-auth-proxy         v3.6                d94bddb3dcba        8 hours ago         214.8 MB
logging-kibana             v3.6                4eabc3acd717        21 hours ago        342.4 MB
logging-fluentd            v3.6                08e8a59602fe        21 hours ago        232.5 MB
logging-curator            v3.6                a0148dd96b8d        2 weeks ago         221.5 MB

Comment 73 errata-xmlrpc 2017-08-10 05:20:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716


Note You need to log in before you can comment on or make changes to this bug.