Bug 1469814

Summary:	Timeouts from Kibana -> Elasticsearch, trouble keeping master
Product:	OpenShift Container Platform	Reporter:	Steven Walter <stwalter>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Xia Zhao <xiazhao>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.4.1	CC:	aos-bugs, pportant, pweil, rmeggins, sghosh, stwalter, szobair
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-11 16:27:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steven Walter 2017-07-11 21:48:46 UTC

Description of problem:
Can get logs when querying directly, but kibana sees ES as switching states and ES has trouble keeping master.

Version-Release number of selected component (if applicable):
3.4.1

How reproducible:
Unconfirmed



Actual results:
{"type":"log","@timestamp":"2017-07-07T13:40:30+00:00","tags":["status","plugin:elasticsearch","error"],"pid":1,"name":"plugin:elasticsearch","state":"red","message":"Status changed from red to red - Request Timeout after 3000ms","prevState":"red","prevMsg":"Request Timeout after 30000ms"}
{"type":"log","@timestamp":"2017-07-07T13:40:44+00:00","tags":["status","plugin:elasticsearch","info"],"pid":1,"name":"plugin:elasticsearch","state":"green","message":"Status changed from red to green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 3000ms"}

[2017-07-10 12:57:55,220][WARN ][com.floragunn.searchguard.configuration.PrivilegesEvaluator] _all does not exist in cluster metadata
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]


However we can see logs if we hit:

https://kibana.example.company.com/elasticsearch/project.ej-saj-cp-pd-tic*/_search?pretty=true&q=*:*
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 37,
    "successful" : 37,
    "failed" : 0
  },
  "hits" : {
    "total" : 146532,
    "max_score" : 1.0,
    "hits" : [ {
. . . .
        "message" : "2017-05-31 14:13:18.163 EDT [saj] WARN  Example log message.\n",
(...)




Expected results:

Kibana shows logs

Additional info:
-User has cluster-admin privilege given directly to the user account:
# oadm policy add-cluster-role-to-user cluster-admin testuser
 
-No readiness probes

Attaching screenshots + logs

Comment 4 Jeff Cantrill 2017-07-13 16:15:29 UTC

@Peter,
Do you understand what might be going on here.  It looks like the master keeps dropping out and coming back.  The initial SSL negotiation error looks to be maybe happening during and upgrade situation as various pods are bumped?

Comment 7 Shah Zobair 2017-08-14 17:19:20 UTC

We are also seeing the issue on a Production cluster. We have 4000 projects running and I think it is causing the issue. Is there any way (preferably ConfigMap or environmental variable) to increase the value of elasticsearch.requestTimeout: 30000 from /etc/kibana/kibana.yml

KIBANA LOG:

{"type":"response","@timestamp":"2017-08-14T16:59:37+00:00","tags":[],"pid":1,"method":"post","statusCode":200,"req":{"url":"/elasticsearch/_mget?timeout=0&ignore_unavailable=true&preference=1502729976748","method":"post","headers":{"forwarded":"for=171.206.1.139;host=kibana.useast20.app.example.com;proto=https","x-forwarded-proto":"https","x-forwarded-port":"443","x-forwarded-host":"kibana.useast20.app.example.com","cache-control":"no-cache","content-length":"62","host":"kibana.useast20.app.example.com","user-agent":"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; InfoPath.3)","accept-encoding":"gzip, deflate","accept-language":"en-US","referer":"https://kibana.useast20.app.example.com/app/kibana","kbn-version":"4.5.4","accept":"application/json, text/plain, */*","content-type":"application/json;charset=utf-8","connection":"close","x-proxy-remote-user":"ZK566H0"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1","referer":"https://kibana.useast20.app.example.com/app/kibana"},"res":{"statusCode":200,"responseTime":30005,"contentLength":9},"message":"POST /elasticsearch/_mget?timeout=0&ignore_unavailable=true&preference=1502729976748 200 30005ms - 9.0B"}
{"type":"log","@timestamp":"2017-08-14T17:01:03+00:00","tags":["status","plugin:elasticsearch","error"],"pid":1,"name":"plugin:elasticsearch","state":"red","message":"Status changed from green to red - Request Timeout after 30000ms","prevState":"green","prevMsg":"Kibana index ready"}
{"type":"log","@timestamp":"2017-08-14T17:01:07+00:00","tags":["status","plugin:elasticsearch","info"],"pid":1,"name":"plugin:elasticsearch","state":"green","message":"Status changed from red to green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 30000ms"}


ES CLUSTER status:

[root@lltwe5lgpisd1 ~]# oc rsh logging-es-032jt0xp-2-0jwdf

sh-4.2$ curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cluster/health?pretty=true
{
  "cluster_name" : "logging-es",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 67,
  "active_shards" : 201,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

sh-4.2$ curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host         ip           node
    67        6.7gb    27.1gb    366.4gb    393.5gb            6 10.128.4.13  10.128.4.13  Magician
    67        6.7gb    27.1gb    366.4gb    393.5gb            6 10.128.3.147 10.128.3.147 Turac
    67        6.7gb      27gb    366.4gb    393.5gb            6 10.130.3.156 10.130.3.156 Volcana

Comment 8 Rich Megginson 2017-08-14 20:09:53 UTC

If you want to try experimenting with different values in kibana.yml:

- copy the existing config file out of the pod

oc exec $kibana_pod -c kibana -- cat /opt/kibana/config/kibana.yml > kibana.yml.orig

- copy it

cp kibana.yml.orig kibana.yml

- create a configmap from this

oc create configmap logging-kibana --from-file=kibana.yml=kibana.yml

- edit the configmap as needed

oc edit configmap logging-kibana

- edit the kibana dc - add a volume and volume mount for the config

oc edit dc logging-kibana

in the section for the container name: kibana, in the volumeMounts section, add something like this:

        - mountPath: /opt/kibana/config
          name: kibanaconfig
          readOnly: true

in the volumes: section, add something like this:

      - name: kibanaconfig
        configmap:
          defaultMode: 420
          name: logging-kibana

You may have to redeploy kibana:

oc rollout latest dc/logging-kibana
oc rollout status -w dc/logging-kibana

Comment 9 Peter Portante 2017-08-15 01:20:57 UTC

(In reply to Shah Zobair from comment #7)
> We are also seeing the issue on a Production cluster. We have 4000 projects
> running and I think it is causing the issue. 
>.
>.
>. 
>   "number_of_nodes" : 3,
>   "number_of_data_nodes" : 3,
>   "active_primary_shards" : 67,
>   "active_shards" : 201,

This says you only have 67 indices (each with 2 replicas, for a total of 201 shards).

If you had 3,000 active projects, one might expect you to have logs from 4,000 active projects, you should have 4,000 primary shards, and 12,000 active shards.

This is likely another problem.  What do the deployment configurations for Elasticsearch look like?

Comment 10 Shah Zobair 2017-08-15 15:11:33 UTC

(In reply to Peter Portante from comment #9)
> (In reply to Shah Zobair from comment #7)
> > We are also seeing the issue on a Production cluster. We have 4000 projects
> > running and I think it is causing the issue. 
> >.
> >.
> >. 
> >   "number_of_nodes" : 3,
> >   "number_of_data_nodes" : 3,
> >   "active_primary_shards" : 67,
> >   "active_shards" : 201,
> 
> This says you only have 67 indices (each with 2 replicas, for a total of 201
> shards).
> 
> If you had 3,000 active projects, one might expect you to have logs from
> 4,000 active projects, you should have 4,000 primary shards, and 12,000
> active shards.
> 
> This is likely another problem.  What do the deployment configurations for
> Elasticsearch look like?

Actually I have cleared ES storage (shards) and spin up a new logging cluster 
 with updated images after getting this error. I thought it might solve the issue and that is why we have few numbers of shards now.

apiVersion: v1
kind: DeploymentConfig
metadata:
  creationTimestamp: 2017-08-10T23:06:17Z
  generation: 8
  labels:
    component: es
    logging-infra: elasticsearch
    provider: openshift
  name: logging-es-032jt0xp
  namespace: logging
  resourceVersion: "36078401"
  selfLink: /oapi/v1/namespaces/logging/deploymentconfigs/logging-es-032jt0xp
  uid: 83dbb0db-7e20-11e7-ac06-1c98ec21ea40
spec:
  replicas: 1
  selector:
    component: es
    deployment: logging-es-032jt0xp
    provider: openshift
  strategy:
    recreateParams:
      timeoutSeconds: 600
    resources: {}
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: es
        deployment: logging-es-032jt0xp
        provider: openshift
    spec:
      containers:
      - env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: KUBERNETES_TRUST_CERT
          value: "true"
        - name: SERVICE_DNS
          value: logging-es-cluster
        - name: CLUSTER_NAME
          value: logging-es
        - name: INSTANCE_RAM
          value: 4G
        - name: NODE_QUORUM
          value: "2"
        - name: RECOVER_AFTER_NODES
          value: "2"
        - name: RECOVER_EXPECTED_NODES
          value: "3"
        - name: RECOVER_AFTER_TIME
          value: 5m
        image: docker-registry-lab.sdi.corp.example.com:6555/openshift3/logging-elasticsearch:3.4.1
        imagePullPolicy: Always
        name: elasticsearch
        ports:
        - containerPort: 9200
          name: restapi
          protocol: TCP
        - containerPort: 9300
          name: cluster
          protocol: TCP
        resources:
          limits:
            memory: 4Gi
          requests:
            memory: 512Mi
        terminationMessagePath: /dev/termination-log
        volumeMounts:
        - mountPath: /etc/elasticsearch/secret
          name: elasticsearch
          readOnly: true
        - mountPath: /usr/share/java/elasticsearch/config
          name: elasticsearch-config
          readOnly: true
        - mountPath: /elasticsearch/persistent
          name: elasticsearch-storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      securityContext:
        supplementalGroups:
        - 65534
      serviceAccount: aggregated-logging-elasticsearch
      serviceAccountName: aggregated-logging-elasticsearch
      terminationGracePeriodSeconds: 600
      volumes:
      - name: elasticsearch
        secret:
          defaultMode: 420
          secretName: logging-elasticsearch
      - configMap:
          defaultMode: 420
          name: logging-elasticsearch
        name: elasticsearch-config
      - name: elasticsearch-storage
        persistentVolumeClaim:
          claimName: logging-es-3
  test: false
  triggers: []
status:
  conditions:
  - lastTransitionTime: 2017-08-10T23:06:18Z
    message: Deployment config does not have minimum availability.
    status: "False"
    type: Available
  - lastTransitionTime: 2017-08-10T23:54:14Z
    message: Replication controller "logging-es-032jt0xp-2" has completed progressing
    reason: NewReplicationControllerAvailable
    status: "True"
    type: Progressing
  details:
    causes:
    - type: Manual
    message: manual change
  latestVersion: 2
  observedGeneration: 8
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1

Comment 11 Shah Zobair 2017-08-16 15:48:08 UTC

There might be a limitation on number of projects in the cluster, We have tested it in our LAB and found that if number of projects are lower than 1800, kibana works fine. Do we have any sizing matrix for this?

Comment 12 Jeff Cantrill 2017-09-11 18:47:45 UTC

Shah,

1. Are you trying to use Kibana as an admin or regular user?
2. Can you describe the type of query you are trying to perform?

I am trying to figure out if the issue you are experiencing is related to what is available to you in Kibana or some limitation in Elasticsearch. We are trying to improve performance of the logging stack and this could be related to any number of issues: memory, cpu, thread usage, etc.

Comment 13 Shah Zobair 2017-09-14 16:49:35 UTC

(In reply to Jeff Cantrill from comment #12)
> Shah,
> 
> 1. Are you trying to use Kibana as an admin or regular user?
> 2. Can you describe the type of query you are trying to perform?
> 
> I am trying to figure out if the issue you are experiencing is related to
> what is available to you in Kibana or some limitation in Elasticsearch. We
> are trying to improve performance of the logging stack and this could be
> related to any number of issues: memory, cpu, thread usage, etc.

Jeff,

I tried to access kibana using both admin and regular user. I have also limited the number of projects for the admin but that didn't solve. As soon as we deleted all those projects to 1800, it worked just fine.

I don't think it is a CPU or Memory issue as the cluster is on Bare Metal having large numbers of cores and memory. I have assigned 8GB for ES Pods. There might be some mapping limitations in ES that it can not work with large number of projects (just a thought).

Comment 14 Jeff Cantrill 2017-09-21 19:13:10 UTC

Peter,

In IRC you suggested this could be related to disk IO or lack thereof

Comment 15 Peter Portante 2017-09-21 19:37:34 UTC

(In reply to Jeff Cantrill from comment #14)
> Peter,
> 
> In IRC you suggested this could be related to disk IO or lack thereof

Possible, or over-committed on the nodes where ES runs and OOM Kills keep happening.

Comment 16 Jeff Cantrill 2017-10-06 15:02:23 UTC

Can you reproduce the issue and provide us the output of https://github.com/openshift/origin-aggregated-logging/blob/master/hack/logging-dump.sh so we can get a snapshot of whats going on?

Comment 17 Steven Walter 2017-10-11 14:12:54 UTC

Cusotmer closed case because the reinstalled logging stack. We now do not have a reproducer. If you wish we can close INSUFFICIENT DATA