Bug 1469814
Summary: | Timeouts from Kibana -> Elasticsearch, trouble keeping master | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Steven Walter <stwalter> |
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Xia Zhao <xiazhao> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.4.1 | CC: | aos-bugs, pportant, pweil, rmeggins, sghosh, stwalter, szobair |
Target Milestone: | --- | ||
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-10-11 16:27:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Steven Walter
2017-07-11 21:48:46 UTC
@Peter, Do you understand what might be going on here. It looks like the master keeps dropping out and coming back. The initial SSL negotiation error looks to be maybe happening during and upgrade situation as various pods are bumped? We are also seeing the issue on a Production cluster. We have 4000 projects running and I think it is causing the issue. Is there any way (preferably ConfigMap or environmental variable) to increase the value of elasticsearch.requestTimeout: 30000 from /etc/kibana/kibana.yml KIBANA LOG: {"type":"response","@timestamp":"2017-08-14T16:59:37+00:00","tags":[],"pid":1,"method":"post","statusCode":200,"req":{"url":"/elasticsearch/_mget?timeout=0&ignore_unavailable=true&preference=1502729976748","method":"post","headers":{"forwarded":"for=171.206.1.139;host=kibana.useast20.app.example.com;proto=https","x-forwarded-proto":"https","x-forwarded-port":"443","x-forwarded-host":"kibana.useast20.app.example.com","cache-control":"no-cache","content-length":"62","host":"kibana.useast20.app.example.com","user-agent":"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; InfoPath.3)","accept-encoding":"gzip, deflate","accept-language":"en-US","referer":"https://kibana.useast20.app.example.com/app/kibana","kbn-version":"4.5.4","accept":"application/json, text/plain, */*","content-type":"application/json;charset=utf-8","connection":"close","x-proxy-remote-user":"ZK566H0"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1","referer":"https://kibana.useast20.app.example.com/app/kibana"},"res":{"statusCode":200,"responseTime":30005,"contentLength":9},"message":"POST /elasticsearch/_mget?timeout=0&ignore_unavailable=true&preference=1502729976748 200 30005ms - 9.0B"} {"type":"log","@timestamp":"2017-08-14T17:01:03+00:00","tags":["status","plugin:elasticsearch","error"],"pid":1,"name":"plugin:elasticsearch","state":"red","message":"Status changed from green to red - Request Timeout after 30000ms","prevState":"green","prevMsg":"Kibana index ready"} {"type":"log","@timestamp":"2017-08-14T17:01:07+00:00","tags":["status","plugin:elasticsearch","info"],"pid":1,"name":"plugin:elasticsearch","state":"green","message":"Status changed from red to green - Kibana index ready","prevState":"red","prevMsg":"Request Timeout after 30000ms"} ES CLUSTER status: [root@lltwe5lgpisd1 ~]# oc rsh logging-es-032jt0xp-2-0jwdf sh-4.2$ curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cluster/health?pretty=true { "cluster_name" : "logging-es", "status" : "green", "timed_out" : false, "number_of_nodes" : 3, "number_of_data_nodes" : 3, "active_primary_shards" : 67, "active_shards" : 201, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 0, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 100.0 } sh-4.2$ curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_cat/allocation?v shards disk.indices disk.used disk.avail disk.total disk.percent host ip node 67 6.7gb 27.1gb 366.4gb 393.5gb 6 10.128.4.13 10.128.4.13 Magician 67 6.7gb 27.1gb 366.4gb 393.5gb 6 10.128.3.147 10.128.3.147 Turac 67 6.7gb 27gb 366.4gb 393.5gb 6 10.130.3.156 10.130.3.156 Volcana If you want to try experimenting with different values in kibana.yml: - copy the existing config file out of the pod oc exec $kibana_pod -c kibana -- cat /opt/kibana/config/kibana.yml > kibana.yml.orig - copy it cp kibana.yml.orig kibana.yml - create a configmap from this oc create configmap logging-kibana --from-file=kibana.yml=kibana.yml - edit the configmap as needed oc edit configmap logging-kibana - edit the kibana dc - add a volume and volume mount for the config oc edit dc logging-kibana in the section for the container name: kibana, in the volumeMounts section, add something like this: - mountPath: /opt/kibana/config name: kibanaconfig readOnly: true in the volumes: section, add something like this: - name: kibanaconfig configmap: defaultMode: 420 name: logging-kibana You may have to redeploy kibana: oc rollout latest dc/logging-kibana oc rollout status -w dc/logging-kibana (In reply to Shah Zobair from comment #7) > We are also seeing the issue on a Production cluster. We have 4000 projects > running and I think it is causing the issue. >. >. >. > "number_of_nodes" : 3, > "number_of_data_nodes" : 3, > "active_primary_shards" : 67, > "active_shards" : 201, This says you only have 67 indices (each with 2 replicas, for a total of 201 shards). If you had 3,000 active projects, one might expect you to have logs from 4,000 active projects, you should have 4,000 primary shards, and 12,000 active shards. This is likely another problem. What do the deployment configurations for Elasticsearch look like? (In reply to Peter Portante from comment #9) > (In reply to Shah Zobair from comment #7) > > We are also seeing the issue on a Production cluster. We have 4000 projects > > running and I think it is causing the issue. > >. > >. > >. > > "number_of_nodes" : 3, > > "number_of_data_nodes" : 3, > > "active_primary_shards" : 67, > > "active_shards" : 201, > > This says you only have 67 indices (each with 2 replicas, for a total of 201 > shards). > > If you had 3,000 active projects, one might expect you to have logs from > 4,000 active projects, you should have 4,000 primary shards, and 12,000 > active shards. > > This is likely another problem. What do the deployment configurations for > Elasticsearch look like? Actually I have cleared ES storage (shards) and spin up a new logging cluster with updated images after getting this error. I thought it might solve the issue and that is why we have few numbers of shards now. apiVersion: v1 kind: DeploymentConfig metadata: creationTimestamp: 2017-08-10T23:06:17Z generation: 8 labels: component: es logging-infra: elasticsearch provider: openshift name: logging-es-032jt0xp namespace: logging resourceVersion: "36078401" selfLink: /oapi/v1/namespaces/logging/deploymentconfigs/logging-es-032jt0xp uid: 83dbb0db-7e20-11e7-ac06-1c98ec21ea40 spec: replicas: 1 selector: component: es deployment: logging-es-032jt0xp provider: openshift strategy: recreateParams: timeoutSeconds: 600 resources: {} type: Recreate template: metadata: creationTimestamp: null labels: component: es deployment: logging-es-032jt0xp provider: openshift spec: containers: - env: - name: NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: KUBERNETES_TRUST_CERT value: "true" - name: SERVICE_DNS value: logging-es-cluster - name: CLUSTER_NAME value: logging-es - name: INSTANCE_RAM value: 4G - name: NODE_QUORUM value: "2" - name: RECOVER_AFTER_NODES value: "2" - name: RECOVER_EXPECTED_NODES value: "3" - name: RECOVER_AFTER_TIME value: 5m image: docker-registry-lab.sdi.corp.example.com:6555/openshift3/logging-elasticsearch:3.4.1 imagePullPolicy: Always name: elasticsearch ports: - containerPort: 9200 name: restapi protocol: TCP - containerPort: 9300 name: cluster protocol: TCP resources: limits: memory: 4Gi requests: memory: 512Mi terminationMessagePath: /dev/termination-log volumeMounts: - mountPath: /etc/elasticsearch/secret name: elasticsearch readOnly: true - mountPath: /usr/share/java/elasticsearch/config name: elasticsearch-config readOnly: true - mountPath: /elasticsearch/persistent name: elasticsearch-storage dnsPolicy: ClusterFirst restartPolicy: Always securityContext: supplementalGroups: - 65534 serviceAccount: aggregated-logging-elasticsearch serviceAccountName: aggregated-logging-elasticsearch terminationGracePeriodSeconds: 600 volumes: - name: elasticsearch secret: defaultMode: 420 secretName: logging-elasticsearch - configMap: defaultMode: 420 name: logging-elasticsearch name: elasticsearch-config - name: elasticsearch-storage persistentVolumeClaim: claimName: logging-es-3 test: false triggers: [] status: conditions: - lastTransitionTime: 2017-08-10T23:06:18Z message: Deployment config does not have minimum availability. status: "False" type: Available - lastTransitionTime: 2017-08-10T23:54:14Z message: Replication controller "logging-es-032jt0xp-2" has completed progressing reason: NewReplicationControllerAvailable status: "True" type: Progressing details: causes: - type: Manual message: manual change latestVersion: 2 observedGeneration: 8 replicas: 1 unavailableReplicas: 1 updatedReplicas: 1 There might be a limitation on number of projects in the cluster, We have tested it in our LAB and found that if number of projects are lower than 1800, kibana works fine. Do we have any sizing matrix for this? Shah, 1. Are you trying to use Kibana as an admin or regular user? 2. Can you describe the type of query you are trying to perform? I am trying to figure out if the issue you are experiencing is related to what is available to you in Kibana or some limitation in Elasticsearch. We are trying to improve performance of the logging stack and this could be related to any number of issues: memory, cpu, thread usage, etc. (In reply to Jeff Cantrill from comment #12) > Shah, > > 1. Are you trying to use Kibana as an admin or regular user? > 2. Can you describe the type of query you are trying to perform? > > I am trying to figure out if the issue you are experiencing is related to > what is available to you in Kibana or some limitation in Elasticsearch. We > are trying to improve performance of the logging stack and this could be > related to any number of issues: memory, cpu, thread usage, etc. Jeff, I tried to access kibana using both admin and regular user. I have also limited the number of projects for the admin but that didn't solve. As soon as we deleted all those projects to 1800, it worked just fine. I don't think it is a CPU or Memory issue as the cluster is on Bare Metal having large numbers of cores and memory. I have assigned 8GB for ES Pods. There might be some mapping limitations in ES that it can not work with large number of projects (just a thought). Peter, In IRC you suggested this could be related to disk IO or lack thereof (In reply to Jeff Cantrill from comment #14) > Peter, > > In IRC you suggested this could be related to disk IO or lack thereof Possible, or over-committed on the nodes where ES runs and OOM Kills keep happening. Can you reproduce the issue and provide us the output of https://github.com/openshift/origin-aggregated-logging/blob/master/hack/logging-dump.sh so we can get a snapshot of whats going on? Cusotmer closed case because the reinstalled logging stack. We now do not have a reproducer. If you wish we can close INSUFFICIENT DATA |