Description of problem: After executing /etc/ansible/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/redeploy-openshift-ca.yml kibana and elasticsearch stop working Problem seems to related to contacting kibana-elasticsearch services: > oc logs heapster-lb0oc E0803 12:23:06.592400 1 client.go:192] Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (no host was tried) Logs can be visualized in the console but kibana via archive or directly to the site is not possible. latest log of kibana-logging pod: green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 30000ms"} {"type":"log","@timestamp":"2017-08-10T02:37:25+00:00","tags":["status","plugin:elasticsearch","error"],"pid":1,"name":"plugin:elasticsearch","state":"red","message":"Status changed from green to red - Request Timeout after 3000ms","prevState":"green","prevMsg":"Kibana index ready"} {"type":"log","@timestamp":"2017-08-10T02:37:28+00:00","tags":["status","plugin:elasticsearch","info"],"pid":1,"name":"plugin:elasticsearch","state":"green","message":"Status changed from red to green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 3000ms"} {"type":"log","@timestamp":"2017-08-10T09:17:39+00:00","tags":["status","plugin:elasticsearch","error"],"pid":1,"name":"plugin:elasticsearch","state":"red","message":"Status changed from green to red - Request Timeout after 30000ms","prevState":"green","prevMsg":"Kibana index ready"} {"type":"log","@timestamp":"2017-08-10T09:17:42+00:00","tags":["status","plugin:elasticsearch","info"],"pid":1,"name":"plugin:elasticsearch","state":"green","message":"Status changed from red to green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 30000ms"} {"type":"log","@timestamp":"2017-08-10T16:01:05+00:00","tags":["status","plugin:elasticsearch","error"],"pid":1,"name":"plugin:elasticsearch","state":"red","message":"Status changed from green to red - Request Timeout after 3000ms","prevState":"green","prevMsg":"Kibana index ready"} {"type":"log","@timestamp":"2017-08-10T16:01:09+00:00","tags":["status","plugin:elasticsearch","info"],"pid":1,"name":"plugin:elasticsearch","state":"green","message":"Status changed from red to green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 3000ms"} entering into the site directly, i.e: throws a "502 Bad Gateway" error. > oc get pods NAME READY STATUS RESTARTS AGE logging-curator-1-g0vyx 1/1 Running 0 49m logging-es-bxgpgroo-8-e1xe9 1/1 Running 0 23m logging-fluentd-h3bpk 1/1 Running 3 120d logging-fluentd-idcsn 1/1 Running 3 120d logging-fluentd-ttos3 1/1 Running 2 148d logging-kibana-3-g1edq 2/2 Running 0 2h > oc logs logging-fluentd-h3bpk the other logging-fluentd pod give the same .... 2017-08-03 17:03:42 +0200 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) (Errno::ECONNREFUSED) 2017-08-03 17:03:42 +0200 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) (Errno::ECONNREFUSED) 2017-08-03 17:03:44 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:28 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710" 2017-08-03 17:03:44 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:03:44 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:33 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34" 2017-08-03 17:03:44 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:03:46 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:35 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34" 2017-08-03 17:03:46 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:03:46 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:30 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710" 2017-08-03 17:03:46 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:03:46 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:39 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34" 2017-08-03 17:03:46 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:03:47 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:34 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710" 2017-08-03 17:03:47 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:03:47 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:47 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34" 2017-08-03 17:03:47 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:03:47 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:43 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710" 2017-08-03 17:03:47 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:03:47 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:04:04 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34" 2017-08-03 17:03:47 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:03:48 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:04:00 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710" 2017-08-03 17:03:48 +0200 [warn]: suppressed same stacktrace 2017-08-03 17:04:01 +0200 [warn]: retry succeeded. plugin_id="object:17fe710" 2017-08-03 17:04:05 +0200 [warn]: retry succeeded. plugin_id="object:1831a34" Redeploying logging doesn't seem to help either: $ oc new-app logging-deployer-template -p MODE=reinstall -p IMAGE_VERSION=3.4.1 Version-Release number of selected component (if applicable): /root/buildinfo/Dockerfile-openshift3-logging-elasticsearch-3.4.1-34 /root/buildinfo/Dockerfile-openshift3-logging-fluentd-3.4.1-20 /root/buildinfo/Dockerfile-openshift3-logging-fluentd-3.4.1-20 /root/buildinfo/Dockerfile-openshift3-logging-fluentd-3.4.1-20 /root/buildinfo/Dockerfile-openshift3-logging-kibana-3.4.1-21 /root/buildinfo/Dockerfile-openshift3-logging-auth-proxy-3.4.1-23 How reproducible: N/A Steps to Reproduce: 1. ansible-playbook -v -i /etc/ansible/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/redeploy-openshift-ca.yml 2. After CA is updated it starts failing 3. Actual results: Kibana and ES stopped working. Expected results: Kibana and ES should keep working. Additional info: Running this script: #!/bin/bash _BASE=/etc/fluent/keys _CA=$_BASE/ca _CERT=$_BASE/cert _KEY=$_BASE/key ls -l $_CA $_CERT $_KEY ES_URL='https://logging-es:9200' curl_get="curl -X GET --cacert $_CA --cert $_CERT --key $_KEY" $curl_get $ES_URL/?pretty gives this output: lrwxrwxrwx. 1 root root 9 Aug 8 11:16 /etc/fluent/keys/ca -> ..data/ca lrwxrwxrwx. 1 root root 11 Aug 8 11:16 /etc/fluent/keys/cert -> ..data/cert lrwxrwxrwx. 1 root root 10 Aug 8 11:16 /etc/fluent/keys/key -> ..data/key { "name" : "Terrax the Tamer", "cluster_name" : "logging-es", "cluster_uuid" : "MqlZ5H4aS9mLHk0RjT5JRg", "version" : { "number" : "2.4.1", "build_hash" : "945a6e093cc306cec722eb0207b671962b6d8905", "build_timestamp" : "2016-11-17T20:39:42Z", "build_snapshot" : false, "lucene_version" : "5.5.2" }, "tagline" : "You Know, for Search" } - `oc get configmap logging-deployer -o yaml`: apiVersion: v1 data: es-cluster-size: "1" es-instance-ram: 2G kibana-hostname: kibana.foo.com public-master-url: https://openshift.foo.com:8443 kind: ConfigMap metadata: creationTimestamp: 2017-03-08T13:03:44Z name: logging-deployer namespace: logging resourceVersion: "24193997" selfLink: /api/v1/namespaces/logging/configmaps/logging-deployer uid: a8d52f1b-03ff-11e7-a22c-005056b55614
The logging certs to be updated are originally generated here: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging/tasks/generate_certs.yaml The metrics certs to be updated are originally generated here: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_metrics/tasks/generate_certificates.yaml Can we move this to an installer issue and have someone from your team address it? Setting the target release to 3.6 since we will need it there and probably should be backported to 3.5
Yeah that's fine.