Description of problem: Kibana pod is CrashLoopBackOff after logging 3.6.0 was deployed. The ansible script execution finished successfully, and ES in green status, but kibana route is not accessible. # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-0v14q 1/1 Running 0 18h logging-es-8r3vyszi-1-gz5hf 1/1 Running 0 18h logging-fluentd-4twb2 1/1 Running 0 18h logging-kibana-1-4v736 1/2 CrashLoopBackOff 138 18h # oc logs -f logging-kibana-1-4v736 -c kibana-proxy Could not read TLS opts from secret/server-tls.json; error was: Error: ENOENT: no such file or directory, open 'secret/server-tls.json' Starting up the proxy with auth mode "oauth2" and proxy transform "user_header,token_header". Get error message when visiting the kibana route (at the same time, openshift router is running fine , and a second route for applications other than logging worked fine): Application is not available The application is currently not serving requests at this endpoint. It may not have been started or is still starting. Version-Release number of selected component (if applicable): openshift-ansible-3.6.15-1.git.0.d2b88f8.el7.noarch openshift-ansible-playbooks-3.6.15-1.git.0.d2b88f8.el7.noarch openshift-ansible-roles-3.6.15-1.git.0.d2b88f8.el7.noarch logging images on ops registry: openshift3/logging-auth-proxy 3.6.0 5cd70d92d4ef openshift3/logging-kibana 3.6.0 925583fe8c13 # openshift version openshift v3.6.16 kubernetes v1.5.2+43a9be4 etcd 3.1.0 How reproducible: always Steps to Reproduce: 1.Deploy logging 3.6.0 stacks on OCP 3.6.0 by running ansible scripts 2.Check EFK pods' status 3.Check kibana route Actual results: Kibana pod is CrashLoopBackOff , kibana route not accessible Expected results: EFK pods are in running status and logging UI working fine Additional info: ansible inventory file and execution logs attached EFK logs attached logging UI screenshot attached
Created attachment 1269161 [details] full ansible execution logs
Created attachment 1269162 [details] inventory file for logging 3.6.0 stacks' deployment
Created attachment 1269163 [details] screenshot when visit kibana route from web browser
Blocks 3.6.0 logging tests
Created attachment 1269168 [details] es_log
Created attachment 1269169 [details] fluentd_log
Created attachment 1269170 [details] kibana log of container kibana
Created attachment 1269171 [details] kibana log of container kibana-proxy
can you provide oc get secret logging-kibana -o yaml and oc get configmap logging-kibana -o yaml
Created attachment 1269539 [details] info you wanted, there is no logging-kibana confimap
sorry, I meant oc get secret logging-kibana -o yaml instead of configmap
(In reply to Rich Megginson from comment #11) > sorry, I meant > > oc get secret logging-kibana -o yaml > > instead of configmap I meant oc get dc logging-kibana -o yaml instead of configmap
Created attachment 1269543 [details] kibana dc info
Issue reproduced with # openshift version openshift v3.5.5.5 kubernetes v1.5.2+43a9be4 etcd 3.1.0 Images tested with (ops registry): logging-kibana 3.6.0 925583fe8c13 logging-auth-proxy 3.6.0 5cd70d92d4ef # oc get po -n logging NAME READY STATUS RESTARTS AGE logging-curator-1-274nh 1/1 Running 1 3m logging-es-np7xeu0t-1-j970v 1/1 Running 0 3m logging-fluentd-h54ds 1/1 Running 0 3m logging-kibana-1-2rn4z 1/2 CrashLoopBackOff 4 3m
@Rich, This defect is still not fixed, Kibana pods are in CrashLoopBackOff status
It may not be the same symptom found by Junqi and Xia since my kibana log is a bit different and your ES looks running healty, but I also duplicated the CrashLoopBackOff on beaker. logging-kibana-1-dscf1 1/2 CrashLoopBackOff 6 12m How to reproduce: $ git clone http://git.app.eng.bos.redhat.com/srv/git/ViaQ.git BRANCH: 3.6 CMDLINE: bkr job-submit el7-ocp-36.xml Kibana log: # oc logs logging-kibana-1-dscf1 -c=kibana ..... {"type":"log","@timestamp":"2017-04-20T22:47:53Z","tags":["warning","elasticsearch"],"pid":9,"message":"Unable to revive connection: https://logging-es:9200/"} {"type":"log","@timestamp":"2017-04-20T22:47:53Z","tags":["warning","elasticsearch"],"pid":9,"message":"No living connections"} {"type":"log","@timestamp":"2017-04-20T22:47:53Z","tags":["status","plugin:elasticsearch.0","error"],"pid":9,"state":"red","message":"Status changed from red to red - Unable to connect to Elasticsearch at https://logging-es:9200.","prevState":"red","prevMsg":"Request Timeout after 3000ms"} ..... {"type":"log","@timestamp":"2017-04-20T22:49:58Z","tags":["warning","elasticsearch"],"pid":9,"message":"Unable to revive connection: https://logging-es:9200/"} {"type":"log","@timestamp":"2017-04-20T22:49:58Z","tags":["warning","elasticsearch"],"pid":9,"message":"No living connections"} ..... The pair of messages "Unable to revive connection: https://logging-es:9200/" and "No living connections" are repeated endlessly. In the beginning, the status of kibana was Error: logging-kibana-1-dscf1 1/2 Error 0 2m bug eventually, it's turned to be CrashLoopBackOff: logging-kibana-1-dscf1 1/2 CrashLoopBackOff 6 12m I'm wondering this Timeout message could change the status of kibana from Error to CrashLoopBackOff? The same message is found in https://bugzilla.redhat.com/show_bug.cgi?id=1439451#c7, as well. {"type":"log","@timestamp":"2017-04-20T22:47:53Z","tags":["status","plugin:elasticsearch.0","error"],"pid":9,"state":"red","message":"Status changed from red to red - Unable to connect to Elasticsearch at https://logging-es:9200.","prevState":"red","prevMsg":"Request Timeout after 3000ms"} More seriously, my ES pod eventually disappears. It looks to me this is the root cause of the above connection errors on Kibana. # oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-c571w 1/1 Running 3 13m logging-es-1vszhw7i-1-deploy 0/1 Error 0 12m logging-fluentd-02d8t 1/1 Running 0 12m logging-kibana-1-dscf1 1/2 CrashLoopBackOff 6 12m logging-mux-1-pt15q 1/1 Running 0 10m Please note that, the pods were running fine in the beginning. logging-es-1vszhw7i-1-deploy 1/1 Running 0 2m logging-es-1vszhw7i-1-x4fgh 0/1 Running 0 2m I restarted the ES and tried to find out what crashed the ES. Here's my observation. First, I would like to double-check whether it is correct or not that only logging-es is created (i.e., no logging-es-ops) NAME READY STATUS RESTARTS AGE logging-curator-1-c571w 1/1 Running 0 3m logging-es-1vszhw7i-1-deploy 1/1 Running 0 2m logging-es-1vszhw7i-1-x4fgh 0/1 Running 0 2m logging-fluentd-02d8t 1/1 Running 0 2m logging-kibana-1-deploy 1/1 Running 0 2m logging-kibana-1-dscf1 2/2 Running 0 2m logging-mux-1-deploy 1/1 Running 0 2m logging-mux-1-pt15q 0/1 ContainerCreating 0 51s If I run search against "https://logging-es-ops:9200", it turminated the ES. logging-es-1vszhw7i-3-14pfq 0/1 Terminating 0 10m In our test cases/health checks, we could issue such an operation, which puts the rest of the pods in the error state, i.e., it fails to connect to the ES.
Noriko, what's in the pod log for the logging-es-1vszhw7i-1-deploy 0/1 Error pod?
(In reply to Rich Megginson from comment #17) > Noriko, what's in the pod log for the logging-es-1vszhw7i-1-deploy 0/1 > Error pod? I restarted a new beaker job. https://beaker.engineering.redhat.com/jobs/1820081 Prior to the status Error: logging-es-hjrv78go-1-crqvh 0/1 Running 0 8m logging-es-hjrv78go-1-deploy 1/1 Running 0 8m ==> pod log of logging-es-hjrv78go-1-deploy <== --> Scaling logging-es-hjrv78go-1 to 1 --> Waiting up to 10m0s for pods in rc logging-es-hjrv78go-1 to become ready After the status turned to Error: logging-es-hjrv78go-1-deploy 0/1 Error 0 11m ==> pod log of logging-es-hjrv78go-1-deploy <== --> Scaling logging-es-hjrv78go-1 to 1 --> Waiting up to 10m0s for pods in rc logging-es-hjrv78go-1 to become ready error: update acceptor rejected logging-es-hjrv78go-1: pods for rc "logging-es-hjrv78go-1" took longer than 600 seconds to become ready This time, this command line terminated logging-es-hjrv78go-1-crqvh. (I.e., my previous comment accessing via logging-es-ops might have terminated the ES was incorrect.) # oc exec logging-es-hjrv78go-1-crqvh -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key 'https://logging-es:9200/_cat/indices?v' Error from server (NotFound): pods "logging-es-hjrv78go-1-crqvh" not found Regarding the status of Kibana, it's "running" when it's started. Then, it goes to CrashLoopBackOff. logging-kibana-1-1c83p 2/2 Running 3 4m ===> logging-kibana-1-1c83p 1/2 CrashLoopBackOff 3 5m Following is the early part of the log of logging-kibana-1-1c83p. Even when the status is Running, the log already has an error message (the 3rd one with EHOSTUNREACH). Then, it takes some time to be turned to CrashLoopBackOff. By the time, the log just repeats "Unable to revive connection: https://logging-es:9200/". {"type":"log","@timestamp":"2017-04-22T20:42:01Z","tags":["status","plugin:table_vis.0","info"],"pid":9,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"} <== the last "green" message {"type":"log","@timestamp":"2017-04-22T20:42:01Z","tags":["listening","info"],"pid":9,"message":"Server running at http://0.0.0.0:5601"} {"type":"log","@timestamp":"2017-04-22T20:42:02Z","tags":["error","elasticsearch"],"pid":9,"message":"Request error, retrying -- connect EHOSTUNREACH 172.30.224.19:9200"} <== This error looks strange. [1] {"type":"log","@timestamp":"2017-04-22T20:42:04Z","tags":["status","plugin:elasticsearch.0","error"],"pid":9,"state":"red","message":"Status changed from yellow to red - Request Timeout after 3000ms","prevState":"yellow","prevMsg":"Waiting for Elasticsearch"} {"type":"log","@timestamp":"2017-04-22T20:42:05Z","tags":["warning","elasticsearch"],"pid":9,"message":"Unable to revive connection: https://logging-es:9200/"} [1] I cannot find the IP address 172.30.224.19 anywhere (not in the source codes, either. I'm wondering where it came from...) Ping from the beaker box fails: # ping 172.30.224.19 PING 172.30.224.19 (172.30.224.19) 56(84) bytes of data. From 10.128.0.1 icmp_seq=1 Destination Host Unreachable I failed to get the IP address of the ES pod (it terminated again), but curator, fluentd, and kibana are configured as follows. So, I'd guess the ES also is on 10.128.0.##? (It cannot be 172.30.224.19?) But I could be wrong... 10.128.0.6 logging-curator-1-rx3vp 10.128.0.11 logging-fluentd-c3l66 10.128.0.8 logging-kibana-1-1c83p
use oc describe: oc describe pod logging-es-hjrv78go-1-deploy oc describe pod logging-es-hjrv78go-1-crqvh
Thanks, Rich. Since logging-es-hjrv78go-1-crqvh had died, I restarted -2... # oc describe pod logging-es-hjrv78go-2-deploy <snip> Status: Running IP: 10.128.0.16 Controllers: <none> Containers: deployment: Container ID: docker://dabeff64f515417a92f54ec7f391d51f310481e3bfbbc86091820577e17a2349 Image: openshift3/ose-deployer:v3.6.42 Image ID: docker-pullable://brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ose-dep loyer@sha256:19525d83a631241a11a728de2db6b50e982a1898f50a4beb3597ddabf61df008 <snip> # oc describe pod logging-es-hjrv78go-2-fb4hq <snip> Status: Running IP: 10.128.0.17 Controllers: ReplicationController/logging-es-hjrv78go-2 Containers: elasticsearch: Container ID: docker://7b9530a887b1cd75c23671dd9b8c0360a1585a3fb6c75e086212fb94164a1b6a Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:v3.6 <snip> 1m 4s 13 {kubelet cisco-c210-01.rhts.eng.bos.redhat.com} spec.containers{elasticsearch} Warning Unhealthy Readiness probe failed: rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"/usr/share/elasticsearch/probe/readiness.sh\\\": stat /usr/share/elasticsearch/probe/readiness.sh: no such file or directory\"\n" The last error event in the output from "oc describe pod logging-es-hjrv78go-2-fb4hq" does not look good... Note: indeed, there is no 'probe' directory on the ES pod... # oc exec logging-es-hjrv78go-3-0rkcx -- ls /usr/share/elasticsearch elasticsearch index_patterns index_templates Another node: the questionable "172.30.224.19" is not found in the output from "oc describe pod ...", either.
Are you running 3.6? Looks like it is missing the readiness probe. Perhaps a rebuild of 3.6 is in order?
This change came is as part of the merge for https://github.com/fabric8io/openshift-auth-proxy/pull/13. Temporarily can be resolved by setting ENV VARS in the DC to https://github.com/jcantrill/origin-aggregated-logging/blob/0386569cacb132827cad125ccdc1eb56f4ff8a8d/deployer/templates/kibana.yaml#L127-L140 The issue is there is a missing leading slash.
Fixed in https://github.com/openshift/openshift-ansible/pull/4035
Moving back to assigned as I believe this is unrelated to whats reference in comment#27. Looking at the configmap I do not see any of the auth related secrets the auth proxy uses.
Can you retry with 3.6 latest or a minimum of: openshift-ansible-3.6.38-1 openshift-ansible-3.6.50-1 It looks like there was a fix related to this issue in: https://github.com/openshift/openshift-ansible/pull/3911
Hi Jeff, Unfortunately, I still see CrashLoopBackOff. # oc get pods -l component=kibana NAME READY STATUS RESTARTS AGE logging-kibana-1-jbnk0 1/2 CrashLoopBackOff 8 24m The version of openshift-ansible: # rpm -q openshift-ansible openshift-ansible-3.6.49-1.git.0.5ba1856.el7.noarch Good news is we don't see this Unhealthy Readiness warning on the ES pod any more. > CMDLINE "os describe pod $espod" still shows a warning. > Warning Unhealthy Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] According to the detailed kibana pods output, kibana itself looks running fine. name: kibana ready: true restartCount: 0 state: running: startedAt: 2017-05-02T17:14:47Z But kibana-proxy fails to start. name: kibana-proxy ready: false restartCount: 8 state: waiting: message: Back-off 5m0s restarting failed container=kibana-proxy pod=logging-kibana-1-jbnk0_logging(b04909c2-2f5a-11e7-b95c-00188b89f3f7) reason: CrashLoopBackOff This is one of the errors from "oc describe pod $kibanapod" 35m 35m 1 kubelet, dell-pe-sc1435-01.rhts.englab.brq.redhat.com Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "kibana-proxy" with CrashLoopBackOff: "Back-off 10s restarting failed container=kibana-proxy pod=logging-kibana-1-jbnk0_logging(b04909c2-2f5a-11e7-b95c-00188b89f3f7)"
--Issue reproduced with openshift-ansible-playbooks-3.6.51-1.git.0.18eb563.el7.noarch # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-785z5 0/1 CrashLoopBackOff 6 33m logging-es-iqqip04t-1-lb9wv 0/1 Running 0 33m logging-fluentd-c61j2 1/1 Running 0 33m logging-fluentd-fk5nq 1/1 Running 0 33m logging-kibana-1-gpgrx 1/2 CrashLoopBackOff 11 33m --The above question is caused by current elasticsearch image with tag 3.6.0 on brew and ops registries are too old, don't contain the readiness probe script: openshift3/logging-elasticsearch 3.6.0 32938f595638 3 weeks ago --Worked around the readiness probe by changing ES_REST_BASEURL from "https://localhost:9200" to "https://logging-es:9200" in https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/probe/readiness.sh#L20, the problem get resolved, but the kibana pod's crash is still encountered: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-hntbv 1/1 Running 2 16m logging-es-3y9oq0ti-2-7s6tz 1/1 Running 0 10m logging-fluentd-3tbc4 1/1 Running 0 16m logging-fluentd-437l4 1/1 Running 0 16m logging-kibana-1-76f7t 1/2 CrashLoopBackOff 6 8m Test env info: # openshift version openshift v3.6.61 kubernetes v1.6.1+5115d708d7 etcd 3.1.0
I don't understand what's going on. I'm using the latest 3.6 devenv. The readiness probe is using https://localhost:9200. I used set -x in the probe script, and used curl -v. It works fine. Here is the output: + ES_REST_BASEURL=https://localhost:9200 + EXPECTED_RESPONSE_CODE=200 + secret_dir=/etc/elasticsearch/secret + max_time=4 ++ curl -v -s -X HEAD --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key --max-time 4 -w '%{response_code}' https://localhost:9200/ * About to connect() to localhost port 9200 (#0) * Trying ::1... * Connected to localhost (::1) port 9200 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * CAfile: /etc/elasticsearch/secret/admin-ca CApath: none * NSS: client certificate from file * subject: CN=system.admin,OU=OpenShift,O=Logging * start date: May 03 14:51:15 2017 GMT * expire date: May 03 14:51:15 2019 GMT * common name: system.admin * issuer: CN=logging-signer-test * SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 * Server certificate: * subject: CN=logging-es,OU=OpenShift,O=Logging * start date: May 03 14:51:25 2017 GMT * expire date: May 03 14:51:25 2019 GMT * common name: logging-es * issuer: CN=logging-signer-test > HEAD / HTTP/1.1 > User-Agent: curl/7.29.0 > Host: localhost:9200 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: text/plain; charset=UTF-8 < Content-Length: 0 < * Connection #0 to host localhost left intact + response_code=200 + '[' 200 == 200 ']' + exit 0 I don't understand why in your case localhost doesn't work but logging-es does work.
However, I noticed there is new kibana/es images with tag=v3.6 on brew registry, and when I did a further test today with the latest images, this issue got reproduced. Could you help to take a further look? Thanks. Images tested with: logging-kibana v3.6 dc571aa09d26 10 hours ago logging-elasticsearch v3.6 d2709cc1e16a 10 hours ago logging-fluentd v3.6 aafaf8787b29 10 hours ago logging-curator v3.6 028e689a3276 6 days ago logging-auth-proxy v3.6 11f731349ff9 2 days ago The ansible version is: openshift-ansible-playbooks-3.6.58-1.git.0.f4a514a.el7.noarch # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-9p8wz 1/1 Running 0 18m logging-es-zed8jht4-1-rnjng 1/1 Running 0 18m logging-fluentd-7fr0g 1/1 Running 0 18m logging-fluentd-kzqfp 1/1 Running 0 18m logging-kibana-1-990kf 1/2 CrashLoopBackOff 8 18m both the kibana and kibana-proxy containers looked fine (though kibana is waiting for Elasticsearch to create the .kibana index), the problem appeared to me is es somehow stopped after stating this line: Create index template 'com.redhat.viaq-openshift-project.template.json' The logs about es/kibana were attached for your reference.
Created attachment 1277588 [details] the kibana,es logs on May 10, 2017
Created attachment 1277777 [details] logging 3.5 kibana dc info
@Noriko, I checked logging 3.5 kibana dc info on my one machines and logging 3.6 kibana dc info in the machine mentioned in Comment 41, I think for this issue, there are some properties missed in 3.6 kibana dc, such as - name: OAP_OAUTH_SECRET_FILE value: /secret/oauth-secret - name: OAP_SERVER_CERT_FILE value: /secret/server-cert - name: OAP_SERVER_KEY_FILE value: /secret/server-key - name: OAP_SERVER_TLS_FILE value: /secret/server-tls.json - name: OAP_SESSION_SECRET_FILE value: /secret/session-secret You can compare the two files.
Created attachment 1277786 [details] logging 3.6 kibana dc info
Cherry-picked fix: https://github.com/openshift/openshift-ansible/pull/4162
*** Bug 1449858 has been marked as a duplicate of this bug. ***
Bug reproduced with openshift-ansible-playbooks-3.6.68-1.git.0.9cbe2b7.el7.noarch # openshift version openshift v3.6.75 kubernetes v1.6.1+5115d708d7 etcd 3.1.0 # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-w61zr 1/1 Running 0 1m logging-es-f5guu218-1-w1th4 1/1 Running 0 1m logging-fluentd-07wjn 1/1 Running 0 1m logging-fluentd-1165z 1/1 Running 0 1m logging-kibana-1-9trkz 1/2 CrashLoopBackOff 2 1m Since the bug fix PR was merged, I also tested with the latest ansible playbooks from https://github.com/openshift/openshift-ansible/tree/master with ( head commit 15fd42020a0b5fee665c45cd23b9ba3bd152251d ), bug still reproduced: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-8bvtc 1/1 Running 0 4m logging-es-nqqlsk0x-1-rh4wt 1/1 Running 0 4m logging-fluentd-lz5vs 1/1 Running 0 4m logging-fluentd-p9pj5 1/1 Running 0 4m logging-kibana-1-r7mbl 1/2 CrashLoopBackOff 5 4m
Tested on the following env and openshift-ansible packages. # openshift version openshift v3.6.76 kubernetes v1.6.1+5115d708d7 etcd 3.1.0 # rpm -qa | grep openshift-ansible openshift-ansible-lookup-plugins-3.6.84-1.git.0.72b2d74.el7.noarch openshift-ansible-playbooks-3.6.84-1.git.0.72b2d74.el7.noarch openshift-ansible-3.6.84-1.git.0.72b2d74.el7.noarch openshift-ansible-callback-plugins-3.6.84-1.git.0.72b2d74.el7.noarch openshift-ansible-roles-3.6.84-1.git.0.72b2d74.el7.noarch openshift-ansible-docs-3.6.84-1.git.0.72b2d74.el7.noarch openshift-ansible-filter-plugins-3.6.84-1.git.0.72b2d74.el7.noarch we get the following pods, there is one logging-kibana deployer pod. # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-w5wt7 1/1 Running 1 9m logging-es-data-master-v5gmycm6-1-l3wsq 0/1 Running 0 10m logging-fluentd-brh9v 1/1 Running 0 9m logging-fluentd-x6xhr 1/1 Running 0 9m logging-kibana-1-deploy 1/1 Running 0 9m logging-kibana-1-qfgbt 1/2 Running 0 9m After a few minutes, logging-kibana pod is gone, and the logging-kibana deployer pod's status changed to Error oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-w5wt7 1/1 Running 1 15m logging-es-data-master-v5gmycm6-1-l3wsq 1/1 Running 0 16m logging-fluentd-brh9v 1/1 Running 0 15m logging-fluentd-x6xhr 1/1 Running 0 15m logging-kibana-1-deploy 0/1 Error 0 15m
Please provide: 1. kibana container 2. kibana-proxy container Looking at the dc from the 3.6 attachment: https://bugzilla.redhat.com/attachment.cgi?id=1277786 it is missing: https://github.com/openshift/openshift-ansible/blob/openshift-ansible-3.6.84-1/roles/openshift_logging_kibana/templates/kibana.j2#L120-L133 I'm assuming kibana is starting but kiban-proxy is failing due to inability to find the secrets. Can you please reconfirm given the openshift-ansible version you ref says they should be there. My other thought is it might be the opposite as well in the 3.6 images have not been synced with origin recently which is just happening now.
Also see https://bugzilla.redhat.com/show_bug.cgi?id=1452807. I think the original problem described by this bz ("Could not read TLS opts from secret/server-tls.json; error was: Error: ENOENT: no such file or directory, open 'secret/server-tls.json') is fixed. This error no longer occurs. We now see the issue described in bz 1452807.
Retested again, get the following error for kibana-proxy container # oc logs logging-kibana-1-qndht -c kibana-proxy Could not read TLS opts from /secret/server-tls.json; error was: Error: ENOENT: no such file or directory, open '/secret/server-tls.json' Starting up the proxy with auth mode "oauth2" and proxy transform "user_header,token_header". Attached es, kibana pod info and kibana dc info Environment info: # openshift version openshift v3.6.85 kubernetes v1.6.1+5115d708d7 etcd 3.1.0 # docker images | grep logging openshift3/logging-kibana v3.6 03bd6dfe1a53 21 hours ago 342.4 MB openshift3/logging-elasticsearch v3.6 6f311a1b9e0c 21 hours ago 404.6 MB openshift3/logging-fluentd v3.6 a0f8e4ccb888 21 hours ago 232.5 MB openshift3/logging-auth-proxy v3.6 a4bfb6537dcc 21 hours ago 229.6 MB openshift3/logging-curator v3.6 028e689a3276 3 weeks ago 211.1 MB
Created attachment 1282798 [details] kibana dc, pods info
Add ansible info: # rpm -qa | grep openshift-ansible openshift-ansible-callback-plugins-3.6.85-1.git.0.109a54e.el7.noarch openshift-ansible-docs-3.6.85-1.git.0.109a54e.el7.noarch openshift-ansible-lookup-plugins-3.6.85-1.git.0.109a54e.el7.noarch openshift-ansible-filter-plugins-3.6.85-1.git.0.109a54e.el7.noarch openshift-ansible-playbooks-3.6.85-1.git.0.109a54e.el7.noarch openshift-ansible-3.6.85-1.git.0.109a54e.el7.noarch openshift-ansible-roles-3.6.85-1.git.0.109a54e.el7.noarch
Assign back, there is no kibana pod generated. # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-jg9px 1/1 Running 0 1h logging-es-data-master-s89krelm-1-4j3h0 1/1 Running 0 1h logging-fluentd-1mxvz 1/1 Running 0 1h logging-fluentd-73hxv 1/1 Running 0 1h logging-kibana-1-deploy 0/1 Error 0 1h Testing environemts: # openshift version openshift v3.6.85 kubernetes v1.6.1+5115d708d7 etcd 3.1.0 Image id from brew registry # docker images | grep logging openshift3/logging-auth-proxy v3.6 d043e446a08d 2 days ago 230.2 MB openshift3/logging-kibana v3.6 b2ee235a5512 2 days ago 342.4 MB openshift3/logging-elasticsearch v3.6 05cb395dd2b2 2 days ago 404.5 MB openshift3/logging-fluentd v3.6 67ee8da21667 2 days ago 232.5 MB openshift3/logging-curator v3.6 028e689a3276 4 weeks ago 211.1 MB
Please feel free to assign back to ON_QA, the root cause of comment #55 (and may also for comment #48) is actually the kibana readiness probe failed which is reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1458652 During kibana pod's deploying phase, I see the kibana-proxy container can be ready now: # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-vs9sj 1/1 Running 0 6m logging-es-data-master-lfw1lt94-1-8k1vg 1/1 Running 0 6m logging-fluentd-lkv5r 1/1 Running 0 6m logging-fluentd-zkfxz 1/1 Running 0 6m logging-kibana-1-7bdgz 1/2 Running 0 6m logging-kibana-1-deploy 1/1 Running 0 6m
checking the source code for v3.6 it looks like Docker file is missing 'KIBANA_HOME' variable and this must lead to a failure of run.sh The commit from Noriko [1] has updated the run.sh [2] script to match the one in upstream [3] but didn't include changes for Dockerfile [4], where upstream defines 'KIBANA_HOME' [5] [1] http://pkgs.devel.redhat.com/cgit/rpms/logging-kibana-docker/commit/?h=rhaos-3.6-rhel-7&id=0347227a720380ec42ea1c5d4aaaf23577475a97 [2] http://pkgs.devel.redhat.com/cgit/rpms/logging-kibana-docker/tree/run.sh?h=rhaos-3.6-rhel-7 [3] https://raw.githubusercontent.com/openshift/origin-aggregated-logging/ef5e093cda8fde7aefd3950b25eec9a98a1559d4/kibana/run.sh [4] http://pkgs.devel.redhat.com/cgit/rpms/logging-kibana-docker/tree/Dockerfile?h=rhaos-3.6-rhel-7 [5] https://raw.githubusercontent.com/openshift/origin-aggregated-logging/ef5e093cda8fde7aefd3950b25eec9a98a1559d4/kibana/Dockerfile
Thanks for the clarification, Jan. You are right. Sorry, I missed to update the kibana Dockerfile in the dist-git. I'm adding it and rebuilding the kibana image.
Building the kibana image is done: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=565287 logging-kibana-docker-v3.6.109-2 Sorry for introducing the undefined variable 'KIBANA_HOME' bug. The change was reverted: Revert "Updating Dockerfile version and release v3.6.99 2" and added the readiness probe: Updating Dockerfile version and release v3.6.109 2 Add a readiness probe to the Kibana image
*** Bug 1452807 has been marked as a duplicate of this bug. ***
Add TestBlocker, since Bug 1452807 was blocking some logging testing.
Using the logging-kibana image in comment 59, the logging-kibana pod starts and both kibana and kibana-proxy containers go into Ready condition. Can we get this fix expedited and an image pushed to the ops mirror? I tried the official 3.6.116 image and it still has the issue of kibana pod going CrashLoopBackoff.
(In reply to Mike Fiedler from comment #62) > Using the logging-kibana image in comment 59, the logging-kibana pod starts > and both kibana and kibana-proxy containers go into Ready condition. Can > we get this fix expedited and an image pushed to the ops mirror? > > I tried the official 3.6.116 image and it still has the issue of kibana pod > going CrashLoopBackoff. 3.6.116 is supposed to have the fix. Please mark FailedQA. Sorry for the inconvenience.
Correction to comment 62. I am seeing the same behavior with the image from comment 59 and 3.6.116. The kibana pod is still going into CrashLoopBackoff, but without the kibana-proxy errors seen in comment 51 and in Bug 1452807. The kibana and kibana-proxy logs (attached) now appear to be clean but the logging-kibana-pod is cycling between fully Ready, Error and CrashLoopBackoff. Please let me know what further info I can gather. logging-kibana-5-pq39j 2/2 Running 1 19s 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 1/2 Error 1 23s 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 1/2 CrashLoopBackOff 1 35s 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 2/2 Running 2 37s 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 1/2 Error 2 41s 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 1/2 CrashLoopBackOff 2 56s 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 2/2 Running 3 1m 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 1/2 Error 3 1m 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 1/2 CrashLoopBackOff 3 1m 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 2/2 Running 4 1m 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 1/2 Error 4 2m 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal logging-kibana-5-pq39j 1/2 CrashLoopBackOff 4 2m 172.20.3.24 ip-172-31-18-156.us-west-2.compute.internal
Created attachment 1289415 [details] kibana and kibana-proxy logs from 3.6.116
Tested with: registry.ops.openshift.com/openshift3/logging-fluentd v3.6.116 6819bdde7f83 47 hours ago 233.1 MB registry.ops.openshift.com/openshift3/logging-kibana v3.6.116 a3e7c14233be 47 hours ago 342.4 MB registry.ops.openshift.com/openshift3/logging-auth-proxy v3.6.116 4ecd26a8e9c5 47 hours ago 229.6 MB registry.ops.openshift.com/openshift3/logging-fluentd v3.6.114 a7e65663f572 3 days ago 233.1 MB nhosoi/logging-kibana-docker rhaos-3.6-rhel-7-docker-candidate-53963-20170614213755 03db141a6026 5 days ago 342.4 MB Tried both the 3.6.116 and nhosoi/logging-kibana-docker images with the results reported in comment 64.
Fixed in koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=567197 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:rhaos-3.6-rhel-7-docker-candidate-88157-20170621202522 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:v3.6 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:v3.6.122 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-auth-proxy:v3.6.122-2
Running with the logging-auth-proxy container image from comment 67 results in a logging-kibana pod which repeatedly exits (apparently normally) and restarts: logging-kibana-5-d65st 0/2 ContainerCreating 0 0s logging-kibana-5-d65st 1/2 Running 0 9s logging-kibana-5-d65st 2/2 Running 0 19s logging-kibana-5-d65st 1/2 Completed 0 24s logging-kibana-5-d65st 2/2 Running 1 26s logging-kibana-5-d65st 1/2 Completed 1 30s logging-kibana-5-d65st 1/2 CrashLoopBackOff 1 44s logging-kibana-5-d65st 2/2 Running 2 45s logging-kibana-5-d65st 1/2 Completed 2 49s logging-kibana-5-d65st 1/2 CrashLoopBackOff 2 1m Logs attached. I downloaded from brew - I can also retry when it is pushed to the ops mirror.
Created attachment 1290465 [details] kibana and kibana-proxy logs from 3.6.122 internal build
Issue is fixed, kibana pods are in running status and log entries could be retrieved from kibana UI. # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-12sq7 1/1 Running 0 30m logging-curator-ops-1-wf3xz 1/1 Running 0 30m logging-es-data-master-ljq0gu76-1-d2kw8 1/1 Running 0 31m logging-es-ops-data-master-5kx189jv-1-b2cmf 1/1 Running 0 31m logging-fluentd-6tw96 1/1 Running 0 30m logging-fluentd-l9wr3 1/1 Running 0 30m logging-kibana-1-nxbjf 2/2 Running 0 30m logging-kibana-ops-1-4lgww 2/2 Running 0 30m Images from brew registry logging-elasticsearch v3.6 19ad6f8e4738 29 minutes ago 404.2 MB logging-auth-proxy v3.6 d94bddb3dcba 8 hours ago 214.8 MB logging-kibana v3.6 4eabc3acd717 21 hours ago 342.4 MB logging-fluentd v3.6 08e8a59602fe 21 hours ago 232.5 MB logging-curator v3.6 a0148dd96b8d 2 weeks ago 221.5 MB
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716