Bug 1480612

Summary:	After executing redeploy-openshift-ca.yml playbook kibana and elasticsearch stopped working
Product:	OpenShift Container Platform	Reporter:	Joel Rosental R. <jrosenta>
Component:	Installer	Assignee:	Scott Dodson <sdodson>
Status:	CLOSED WONTFIX	QA Contact:	Johnny Liu <jialiu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.4.1	CC:	abutcher, aos-bugs, jokerman, jrosenta, mmccomas, rmeggins, sdodson
Target Milestone:	---	Keywords:	Unconfirmed
Target Release:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-02 17:55:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joel Rosental R. 2017-08-11 13:38:58 UTC

Description of problem:
After executing /etc/ansible/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/redeploy-openshift-ca.yml kibana and elasticsearch stop working

Problem seems to related to contacting kibana-elasticsearch services:

> oc logs heapster-lb0oc
E0803 12:23:06.592400       1 client.go:192] Could not update tags: Hawkular returned status code 500, error message: Failed to perform operation due to an error: All host(s) tried for query failed (no host was tried)

Logs can be visualized in the console but kibana via archive or directly to the site is not possible.

latest log of kibana-logging pod:

green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 30000ms"}
{"type":"log","@timestamp":"2017-08-10T02:37:25+00:00","tags":["status","plugin:elasticsearch","error"],"pid":1,"name":"plugin:elasticsearch","state":"red","message":"Status changed from green to red - Request Timeout after 3000ms","prevState":"green","prevMsg":"Kibana index ready"}
{"type":"log","@timestamp":"2017-08-10T02:37:28+00:00","tags":["status","plugin:elasticsearch","info"],"pid":1,"name":"plugin:elasticsearch","state":"green","message":"Status changed from red to green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 3000ms"}
{"type":"log","@timestamp":"2017-08-10T09:17:39+00:00","tags":["status","plugin:elasticsearch","error"],"pid":1,"name":"plugin:elasticsearch","state":"red","message":"Status changed from green to red - Request Timeout after 30000ms","prevState":"green","prevMsg":"Kibana index ready"}
{"type":"log","@timestamp":"2017-08-10T09:17:42+00:00","tags":["status","plugin:elasticsearch","info"],"pid":1,"name":"plugin:elasticsearch","state":"green","message":"Status changed from red to green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 30000ms"}
{"type":"log","@timestamp":"2017-08-10T16:01:05+00:00","tags":["status","plugin:elasticsearch","error"],"pid":1,"name":"plugin:elasticsearch","state":"red","message":"Status changed from green to red - Request Timeout after 3000ms","prevState":"green","prevMsg":"Kibana index ready"}
{"type":"log","@timestamp":"2017-08-10T16:01:09+00:00","tags":["status","plugin:elasticsearch","info"],"pid":1,"name":"plugin:elasticsearch","state":"green","message":"Status changed from red to green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 3000ms"}

entering into the site directly, i.e: throws a "502 Bad Gateway" error.

> oc get pods
NAME                          READY     STATUS    RESTARTS   AGE
logging-curator-1-g0vyx       1/1       Running   0          49m
logging-es-bxgpgroo-8-e1xe9   1/1       Running   0          23m
logging-fluentd-h3bpk         1/1       Running   3          120d
logging-fluentd-idcsn         1/1       Running   3          120d
logging-fluentd-ttos3         1/1       Running   2          148d
logging-kibana-3-g1edq        2/2       Running   0          2h

> oc logs logging-fluentd-h3bpk the other logging-fluentd pod give the same
....
2017-08-03 17:03:42 +0200 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) (Errno::ECONNREFUSED)
2017-08-03 17:03:42 +0200 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Connection refused - connect(2) (Errno::ECONNREFUSED)
2017-08-03 17:03:44 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:28 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710"
  2017-08-03 17:03:44 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:03:44 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:33 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34"
  2017-08-03 17:03:44 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:03:46 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:35 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34"
  2017-08-03 17:03:46 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:03:46 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:30 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710"
  2017-08-03 17:03:46 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:03:46 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:39 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34"
  2017-08-03 17:03:46 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:03:47 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:34 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710"
  2017-08-03 17:03:47 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:03:47 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:47 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34"
  2017-08-03 17:03:47 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:03:47 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:03:43 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710"
  2017-08-03 17:03:47 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:03:47 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:04:04 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:1831a34"
  2017-08-03 17:03:47 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:03:48 +0200 [warn]: temporarily failed to flush the buffer. next_retry=2017-08-03 17:04:00 +0200 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"})!" plugin_id="object:17fe710"
  2017-08-03 17:03:48 +0200 [warn]: suppressed same stacktrace
2017-08-03 17:04:01 +0200 [warn]: retry succeeded. plugin_id="object:17fe710"
2017-08-03 17:04:05 +0200 [warn]: retry succeeded. plugin_id="object:1831a34"

Redeploying logging doesn't seem to help either:

$ oc new-app logging-deployer-template -p MODE=reinstall -p IMAGE_VERSION=3.4.1

Version-Release number of selected component (if applicable):
/root/buildinfo/Dockerfile-openshift3-logging-elasticsearch-3.4.1-34
/root/buildinfo/Dockerfile-openshift3-logging-fluentd-3.4.1-20
/root/buildinfo/Dockerfile-openshift3-logging-fluentd-3.4.1-20
/root/buildinfo/Dockerfile-openshift3-logging-fluentd-3.4.1-20
/root/buildinfo/Dockerfile-openshift3-logging-kibana-3.4.1-21
/root/buildinfo/Dockerfile-openshift3-logging-auth-proxy-3.4.1-23

How reproducible:
N/A

Steps to Reproduce:
1. ansible-playbook -v -i /etc/ansible/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/redeploy-openshift-ca.yml

2. After CA is updated it starts failing
3.

Actual results:
Kibana and ES stopped working.

Expected results:
Kibana and ES should keep working.

Additional info:
Running this script:
#!/bin/bash

_BASE=/etc/fluent/keys
_CA=$_BASE/ca
_CERT=$_BASE/cert
_KEY=$_BASE/key
ls -l $_CA $_CERT $_KEY

ES_URL='https://logging-es:9200'
curl_get="curl -X GET --cacert $_CA --cert $_CERT --key $_KEY"
$curl_get $ES_URL/?pretty

gives this output:

lrwxrwxrwx. 1 root root  9 Aug  8 11:16 /etc/fluent/keys/ca -> ..data/ca
lrwxrwxrwx. 1 root root 11 Aug  8 11:16 /etc/fluent/keys/cert -> ..data/cert
lrwxrwxrwx. 1 root root 10 Aug  8 11:16 /etc/fluent/keys/key -> ..data/key
{
  "name" : "Terrax the Tamer",
  "cluster_name" : "logging-es",
  "cluster_uuid" : "MqlZ5H4aS9mLHk0RjT5JRg",
  "version" : {
    "number" : "2.4.1",
    "build_hash" : "945a6e093cc306cec722eb0207b671962b6d8905",
    "build_timestamp" : "2016-11-17T20:39:42Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.2"
  },
  "tagline" : "You Know, for Search"
}

- `oc get configmap logging-deployer -o yaml`:

apiVersion: v1
data:
  es-cluster-size: "1"
  es-instance-ram: 2G
  kibana-hostname: kibana.foo.com
  public-master-url: https://openshift.foo.com:8443
kind: ConfigMap
metadata:
  creationTimestamp: 2017-03-08T13:03:44Z
  name: logging-deployer
  namespace: logging
  resourceVersion: "24193997"
  selfLink: /api/v1/namespaces/logging/configmaps/logging-deployer
  uid: a8d52f1b-03ff-11e7-a22c-005056b55614

Comment 3 Jeff Cantrill 2017-08-24 14:32:06 UTC

The logging certs to be updated are originally generated here: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging/tasks/generate_certs.yaml

The metrics certs to be updated are originally generated here: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_metrics/tasks/generate_certificates.yaml

Can we move this to an installer issue and have someone from your team address it?

Setting the target release to 3.6 since we will need it there and probably should be backported to 3.5

Comment 4 Scott Dodson 2017-08-24 16:05:52 UTC

Yeah that's fine.