Created attachment 1221509 [details] es log of pod #1 Description of problem: Scale up es and kibana to replicas=2 in dc, after all pods is running, kibana UI tell that ES is in red status. Lots of NPEs encountered in es: [2016-11-17 03:34:17,121][INFO ][cluster.routing.allocation] [Alex] Cluster health status changed from [GREEN] to [RED] (reason: [nodes joined]). [2016-11-17 03:34:17,125][INFO ][cluster.service ] [Alex] new_master {Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{logging-es-cluster/172.30.72.149:9300}{master=true}, reason: zen-disco-join(elected_as_master, [0] joins received) [2016-11-17 03:34:17,653][INFO ][cluster.metadata ] [Alex] [.searchguard.logging-es-1d72mift-1-n2g61] creating index, cause [api], templates [], shards [1]/[0], mappings [] .... [2016-11-17 03:34:20,197][WARN ][io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter] [Alex] Trying to seed ACL when ES has not not yet started: 'None of the configured nodes were available: [{Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{localhost/127.0.0.1:9300}{master=true}]' .... [2016-11-17 03:48:36,704][WARN ][rest.suppressed ] path: /_cluster/health/.kibana, params: {index=.kibana, timeout=5s} java.lang.NullPointerException at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.<init>(TransportMasterNodeAction.java:129) at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:107) at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:51) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:172) at io.fabric8.elasticsearch.plugin.KibanaUserReindexAction.apply(KibanaUserReindexAction.java:81) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) at com.floragunn.searchguard.filter.SearchGuardFilter.apply(SearchGuardFilter.java:160) ... Version-Release number of selected component (if applicable): ops registry: openshift3/logging-deployer 84594e8a4a8d openshift3/logging-auth-proxy ec334b0c2669 openshift3/logging-fluentd 125ed17f96bc openshift3/logging-elasticsearch 9b9452c0f8c2 openshift3/logging-kibana 7fc9916eea4d openshift3/logging-curator 9af78fc06248 How reproducible: Always Steps to Reproduce: 1.Deploy logging 3.4.0, wait until all pods are started and kibana accesible with log entries 2.Scale up es and kibana to replicas=2 oc scale dc $es-dc --replicas=2 oc scale dc $kibana-dc --replicas=2 3.Wait until all pods are running $ oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-dvc5q 1/1 Running 0 44m logging-deployer-d8sqq 0/1 Completed 0 46m logging-es-1d72mift-1-n2g61 1/1 Running 0 23m logging-es-1d72mift-1-xf345 1/1 Running 0 44m logging-fluentd-xaa77 1/1 Running 0 41m logging-kibana-1-8i8oq 2/2 Running 0 23m logging-kibana-1-y98bw 2/2 Running 0 44m 4. Visit kibana route Actual results: 5. Kibana not accesible since es is in red, NPE found in es Expected results: 5. Kibana and es should be fine after scale up Additional info: 1. sceenshot attached 2. es / kibana logs attached
Created attachment 1221510 [details] es log of pod #2
Created attachment 1221511 [details] fluentd log
Created attachment 1221512 [details] kibana log of pod #1
Created attachment 1221515 [details] kibana log of pod #2
One more question for fluentd: I continuously see these outputs during the tests these days, do developer have any thinking on suspect issues? 2016-11-17 03:45:42 -0500 [warn]: dump an error event: error_class=NoMethodError error="undefined method `empty?' for 200:Fixnum" tag="kubernetes.var.log.containers.logging-kibana-1-y98bw_logging_kibana-2090c31cdc5df972bfacd0fe87b3f84a5c3da206fd7a3f0ecb24b5b5efaab70f.log" time=1479390341 record={"type"=>"response", "@timestamp"=>"2016-11-17T08:45:41+00:00", "method"=>"get", "statusCode"=>200, "message"=>"GET /bundles/src/ui/public/images/elk.ico 200 11ms - 9.0B", "time"=>"2016-11-17T08:45:41.998080798Z", "docker"=>{"container_id"=>"2090c31cdc5df972bfacd0fe87b3f84a5c3da206fd7a3f0ecb24b5b5efaab70f"}, "kubernetes"=>{"namespace_name"=>"logging", "pod_id"=>"955020bc-ac9d-11e6-bc94-0eb7c647230a", "pod_name"=>"logging-kibana-1-y98bw", "labels"=>{"app"=>"logging-kibana-template", "component"=>"kibana", "deployment"=>"logging-kibana-1", "deploymentconfig"=>"logging-kibana", "provider"=>"openshift"}, "host"=>"ip-172-18-12-46.ec2.internal", "namespace_id"=>"a28fbb8a-ac9a-11e6-b5ab-0eb7c647230a", "container_name"=>"kibana"}, "hostname"=>"ip-172-18-12-46.ec2.internal", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"10.129.0.19", "ipaddr6"=>"fe80::f410:f4ff:fe0d:ed58", "inputname"=>"fluent-plugin-in_tail", "name"=>"fluentd openshift", "received_at"=>"2016-11-17T08:45:41.998080798Z", "version"=>"0.12.29 1.4.0"}}}
It looks like what happened was ES rejected the cluster join due to the fact that the second ES node was on the same host; they may have updated their code to assume the same address means the previous cluster node at that location had failed and restarted... I'll look to see if there's something we can configure to allow this; currently it appears to be an ES bug/limitation. As a side question, why did we scale up the number of replicas on the ES dc? To my knowledge that still is not the recommended method for changing the ES scale (I don't think it contributed to this error). ES logs 1 - [WARN ][discovery.zen ] [Psylocke] received join request from node [{Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{172.30.72.149:9300}{master=true}], but found existing node {Psylocke}{LYbO3EsTSv6YD8u8I5MEog}{172.30.72.149}{logging-es-cluster/172.30.72.149:9300}{master=true} with same address, removing existing node [INFO ][cluster.service ] [Psylocke] removed {{Psylocke}{LYbO3EsTSv6YD8u8I5MEog}{172.30.72.149}{logging-es-cluster/172.30.72.149:9300}{master=true},}, added {{Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{172.30.72.149:9300}{master=true},}, reason: zen-disco-join(join from node[{Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{172.30.72.149:9300}{master=true}]) ES logs 2 - [INFO ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesDiscovery] [Alex] master_left [null], reason [failed to perform initial connect [null]] [ERROR][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesDiscovery] [Alex] unexpected failure during [zen-disco-master_failed (null)] java.lang.NullPointerException at org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:615) at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468) at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
Please verify how 'scaling' is being performed for Elasticsearch. Per the description: "Scale up es and kibana to replicas=2 in dc.." increasing the DC replica count is not a valid scenario for logging. Additional ES nodes are added to the ES cluster but adding more DCs as documented here: https://github.com/openshift/origin-aggregated-logging/tree/master/deployer#elasticsearch-1.
This latest deployer contains the config changes that are mentioned in comment 20 12125414 buildContainer (noarch) completed successfully koji_builds: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=525227 repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:rhaos-3.4-rhel-7-docker-candidate-20161118145457 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4.0.28-2 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4.0.28 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:3.4.0
verified with latest logging stacks. Do as the below steps: 1. deploy logging stack. 2. scale es dc to 0. #oc scale dc/logging-es-2m0n43nj --replicas=0 3. scale es dc to 2. #oc scale dc/logging-es-2m0n43nj --replicas=2 4. check kibana UI. Version: logging-fluentd 3.4.0 7b11a29c82c1 47 hours ago logging-deployer 3.4.0 f4395cebdb4a 2 days ago logging-elasticsearch 3.4.0 6716a0ad8b2b 3 days ago logging-auth-proxy 3.4.0 ec334b0c2669 10 days ago logging-kibana 3.4.0 7fc9916eea4d 2 weeks ago logging-curator 3.4.0 9af78fc06248 3 weeks ago
Prerelease issue, no docs needed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066