Bug 1396019 - ES is in red status with NPEs after scale up
Summary: ES is in red status with NPEs after scale up
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: ewolinet
QA Contact: Xia Zhao
URL:
Whiteboard: aos-scalability-34
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-17 09:20 UTC by Xia Zhao
Modified: 2017-03-08 18:43 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-01-18 12:55:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
es log of pod #1 (29.27 KB, text/plain)
2016-11-17 09:20 UTC, Xia Zhao
no flags Details
es log of pod #2 (6.26 MB, text/plain)
2016-11-17 09:21 UTC, Xia Zhao
no flags Details
fluentd log (15.61 KB, text/plain)
2016-11-17 09:23 UTC, Xia Zhao
no flags Details
kibana log of pod #1 (136.76 KB, text/plain)
2016-11-17 09:23 UTC, Xia Zhao
no flags Details
kibana log of pod #2 (158.57 KB, text/plain)
2016-11-17 09:24 UTC, Xia Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Xia Zhao 2016-11-17 09:20:28 UTC
Created attachment 1221509 [details]
es log of pod #1

Description of problem:
Scale up es and kibana to replicas=2 in dc, after all pods is running, kibana UI tell that ES is in red status.

Lots of NPEs encountered in es:
[2016-11-17 03:34:17,121][INFO ][cluster.routing.allocation] [Alex] Cluster health status changed from [GREEN] to [RED] (reason: [nodes joined]).
[2016-11-17 03:34:17,125][INFO ][cluster.service          ] [Alex] new_master {Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{logging-es-cluster/172.30.72.149:9300}{master=true}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-11-17 03:34:17,653][INFO ][cluster.metadata         ] [Alex] [.searchguard.logging-es-1d72mift-1-n2g61] creating index, cause [api], templates [], shards [1]/[0], mappings []
....
[2016-11-17 03:34:20,197][WARN ][io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter] [Alex] Trying to seed ACL when ES has not not yet started: 'None of the configured nodes were available: [{Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{localhost/127.0.0.1:9300}{master=true}]'
....
[2016-11-17 03:48:36,704][WARN ][rest.suppressed          ] path: /_cluster/health/.kibana, params: {index=.kibana, timeout=5s}
java.lang.NullPointerException
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.<init>(TransportMasterNodeAction.java:129)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:107)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:51)
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:172)
    at io.fabric8.elasticsearch.plugin.KibanaUserReindexAction.apply(KibanaUserReindexAction.java:81)
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170)
    at com.floragunn.searchguard.filter.SearchGuardFilter.apply(SearchGuardFilter.java:160)
 ...

Version-Release number of selected component (if applicable):
ops registry:
openshift3/logging-deployer    84594e8a4a8d
openshift3/logging-auth-proxy    ec334b0c2669
openshift3/logging-fluentd    125ed17f96bc
openshift3/logging-elasticsearch    9b9452c0f8c2
openshift3/logging-kibana    7fc9916eea4d
openshift3/logging-curator    9af78fc06248

How reproducible:
Always

Steps to Reproduce:
1.Deploy logging 3.4.0, wait until all pods are started and kibana accesible with log entries
2.Scale up es and kibana to replicas=2
oc scale dc $es-dc --replicas=2
oc scale dc $kibana-dc --replicas=2
3.Wait until all pods are running
$ oc get po
NAME                          READY     STATUS      RESTARTS   AGE
logging-curator-1-dvc5q       1/1       Running     0          44m
logging-deployer-d8sqq        0/1       Completed   0          46m
logging-es-1d72mift-1-n2g61   1/1       Running     0          23m
logging-es-1d72mift-1-xf345   1/1       Running     0          44m
logging-fluentd-xaa77         1/1       Running     0          41m
logging-kibana-1-8i8oq        2/2       Running     0          23m
logging-kibana-1-y98bw        2/2       Running     0          44m
4. Visit kibana route

Actual results:
5. Kibana not accesible since es is in red, NPE found in es

Expected results:
5. Kibana and es should be fine after scale up

Additional info:
1. sceenshot attached
2. es / kibana logs attached

Comment 1 Xia Zhao 2016-11-17 09:21:21 UTC
Created attachment 1221510 [details]
es log of pod #2

Comment 2 Xia Zhao 2016-11-17 09:23:05 UTC
Created attachment 1221511 [details]
fluentd log

Comment 3 Xia Zhao 2016-11-17 09:23:37 UTC
Created attachment 1221512 [details]
kibana log of pod #1

Comment 4 Xia Zhao 2016-11-17 09:24:07 UTC
Created attachment 1221515 [details]
kibana log of pod #2

Comment 5 Xia Zhao 2016-11-17 09:26:47 UTC
One more question for fluentd:

I continuously see these outputs during the tests these days, do developer have any thinking on suspect issues?

2016-11-17 03:45:42 -0500 [warn]: dump an error event: error_class=NoMethodError error="undefined method `empty?' for 200:Fixnum" tag="kubernetes.var.log.containers.logging-kibana-1-y98bw_logging_kibana-2090c31cdc5df972bfacd0fe87b3f84a5c3da206fd7a3f0ecb24b5b5efaab70f.log" time=1479390341 record={"type"=>"response", "@timestamp"=>"2016-11-17T08:45:41+00:00", "method"=>"get", "statusCode"=>200, "message"=>"GET /bundles/src/ui/public/images/elk.ico 200 11ms - 9.0B", "time"=>"2016-11-17T08:45:41.998080798Z", "docker"=>{"container_id"=>"2090c31cdc5df972bfacd0fe87b3f84a5c3da206fd7a3f0ecb24b5b5efaab70f"}, "kubernetes"=>{"namespace_name"=>"logging", "pod_id"=>"955020bc-ac9d-11e6-bc94-0eb7c647230a", "pod_name"=>"logging-kibana-1-y98bw", "labels"=>{"app"=>"logging-kibana-template", "component"=>"kibana", "deployment"=>"logging-kibana-1", "deploymentconfig"=>"logging-kibana", "provider"=>"openshift"}, "host"=>"ip-172-18-12-46.ec2.internal", "namespace_id"=>"a28fbb8a-ac9a-11e6-b5ab-0eb7c647230a", "container_name"=>"kibana"}, "hostname"=>"ip-172-18-12-46.ec2.internal", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"10.129.0.19", "ipaddr6"=>"fe80::f410:f4ff:fe0d:ed58", "inputname"=>"fluent-plugin-in_tail", "name"=>"fluentd openshift", "received_at"=>"2016-11-17T08:45:41.998080798Z", "version"=>"0.12.29 1.4.0"}}}

Comment 7 ewolinet 2016-11-17 14:34:27 UTC
It looks like what happened was ES rejected the cluster join due to the fact that the second ES node was on the same host; they may have updated their code to assume the same address means the previous cluster node at that location had failed and restarted...

I'll look to see if there's something we can configure to allow this; currently it appears to be an ES bug/limitation.

As a side question, why did we scale up the number of replicas on the ES dc? To my knowledge that still is not the recommended method for changing the ES scale (I don't think it contributed to this error).

ES logs 1 - 
[WARN ][discovery.zen            ] [Psylocke] received join request from node [{Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{172.30.72.149:9300}{master=true}], but found existing node {Psylocke}{LYbO3EsTSv6YD8u8I5MEog}{172.30.72.149}{logging-es-cluster/172.30.72.149:9300}{master=true} with same address, removing existing node
[INFO ][cluster.service          ] [Psylocke] removed {{Psylocke}{LYbO3EsTSv6YD8u8I5MEog}{172.30.72.149}{logging-es-cluster/172.30.72.149:9300}{master=true},}, added {{Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{172.30.72.149:9300}{master=true},}, reason: zen-disco-join(join from node[{Alex}{iwHtrnYGTPuXheB4tiJn6Q}{172.30.72.149}{172.30.72.149:9300}{master=true}])


ES logs 2 - 
[INFO ][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesDiscovery] [Alex] master_left [null], reason [failed to perform initial connect [null]]
[ERROR][io.fabric8.elasticsearch.discovery.kubernetes.KubernetesDiscovery] [Alex] unexpected failure during [zen-disco-master_failed (null)]
java.lang.NullPointerException
	at org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:615)
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45)
	at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:468)
	at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Comment 19 Jeff Cantrill 2016-11-18 14:47:07 UTC
Please verify how 'scaling' is being performed for Elasticsearch.  Per the description:  "Scale up es and kibana to replicas=2 in dc.."  increasing the DC replica count is not a valid scenario for logging.  Additional ES nodes are added to the ES cluster but adding more DCs as documented here: https://github.com/openshift/origin-aggregated-logging/tree/master/deployer#elasticsearch-1.

Comment 23 ewolinet 2016-11-18 20:09:26 UTC
This latest deployer contains the config changes that are mentioned in comment 20

12125414 buildContainer (noarch) completed successfully
koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=525227
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:rhaos-3.4-rhel-7-docker-candidate-20161118145457
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4.0.28-2
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4.0.28
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:v3.4
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-deployer:3.4.0

Comment 24 Peng Li 2016-11-21 06:03:33 UTC
verified with latest logging stacks.

Do as the below steps:
1. deploy logging stack.
2. scale es dc to 0.
#oc scale dc/logging-es-2m0n43nj --replicas=0

3. scale es dc to 2.
#oc scale dc/logging-es-2m0n43nj --replicas=2

4. check kibana UI.


Version:
logging-fluentd            3.4.0               7b11a29c82c1        47 hours ago        
logging-deployer           3.4.0               f4395cebdb4a        2 days ago          
logging-elasticsearch      3.4.0               6716a0ad8b2b        3 days ago          
logging-auth-proxy         3.4.0               ec334b0c2669        10 days ago         
logging-kibana             3.4.0               7fc9916eea4d        2 weeks ago         
logging-curator            3.4.0               9af78fc06248        3 weeks ago

Comment 25 ewolinet 2016-12-12 15:51:03 UTC
Prerelease issue, no docs needed.

Comment 27 errata-xmlrpc 2017-01-18 12:55:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.