Bug 1420217 - Encounter ConnectException and some other exceptions in ES log after deploying logging 3.5.0 stacks
Summary: Encounter ConnectException and some other exceptions in ES log after deployin...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Jeff Cantrill
QA Contact: Xia Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-08 08:37 UTC by Junqi Zhao
Modified: 2017-08-16 19:51 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-08-10 05:17:28 UTC
Target Upstream Version:


Attachments (Terms of Use)
ES_LOG_jsonfile.txt is es log upon json-file logging driver (56.29 KB, text/plain)
2017-02-08 08:37 UTC, Junqi Zhao
no flags Details
ES_LOG_journald.txt is es log upon journald logging driver (72.42 KB, text/plain)
2017-02-08 08:38 UTC, Junqi Zhao
no flags Details
ConnectException In ES Log (16.19 KB, text/plain)
2017-02-13 06:22 UTC, Junqi Zhao
no flags Details
es pod log (4.33 KB, text/plain)
2017-02-14 06:08 UTC, Junqi Zhao
no flags Details
es dc log (6.88 KB, text/plain)
2017-02-14 06:08 UTC, Junqi Zhao
no flags Details
es pod log (14.46 KB, text/plain)
2017-02-14 06:12 UTC, Junqi Zhao
no flags Details
es rc info - 20170223 (8.92 KB, text/plain)
2017-02-23 06:08 UTC, Junqi Zhao
no flags Details
es dc info - 20170223 (6.88 KB, text/plain)
2017-02-23 06:09 UTC, Junqi Zhao
no flags Details
es pod log - 20170223 (12.46 KB, text/plain)
2017-02-23 06:09 UTC, Junqi Zhao
no flags Details
elasticsearch.yml (3.12 KB, text/plain)
2017-02-24 07:45 UTC, Junqi Zhao
no flags Details
3.4.1_es_pod_info.txt (9.28 KB, text/plain)
2017-02-24 07:45 UTC, Junqi Zhao
no flags Details
3.5.0_es_pod_info.txt (19.57 KB, text/plain)
2017-02-24 07:46 UTC, Junqi Zhao
no flags Details
es pod log, still have java.net.ConnectException (28.23 KB, text/plain)
2017-03-08 06:59 UTC, Junqi Zhao
no flags Details
es pod log, java.net.ConnectException (29.11 KB, text/plain)
2017-03-20 06:06 UTC, Junqi Zhao
no flags Details
es pod log - 20170228 (31.37 KB, text/plain)
2017-03-28 01:46 UTC, Junqi Zhao
no flags Details
es pod log - 20170331 (903.20 KB, text/plain)
2017-03-31 01:09 UTC, Junqi Zhao
no flags Details
es-ops pod log - 20170331 (881.45 KB, text/plain)
2017-03-31 01:10 UTC, Junqi Zhao
no flags Details
es_log_latest_3.6 (19.09 KB, text/plain)
2017-07-13 02:45 UTC, Xia Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC

Description Junqi Zhao 2017-02-08 08:37:32 UTC
Created attachment 1248558 [details]
ES_LOG_jsonfile.txt is es log upon json-file logging driver

Description of problem:
1. use json-file as logging driver, and deployed logging with ansible, ConnectException, SSLException, OutOfMemoryError Exception in ES log.
See the attached ES_LOG_jsonfile.txt, first, ConnectException shows, a few minutes later,SSLException shows, the ssl connection closed, and last,
"java.lang.OutOfMemoryError: Java heap space" shows.

2. use journald as logging driver, and deployed logging with ansible, see the attached ES_LOG_Journald.txt, first, ConnectException shows, a few minutes later,CircuitBreakingException shows.


# oc get po
NAME                          READY     STATUS    RESTARTS   AGE
logging-curator-1-dslwb       1/1       Running   1          40m
logging-es-9nmvgerr-1-92ttn   1/1       Running   0          40m
logging-fluentd-qf0xc         1/1       Running   0          41m
logging-kibana-1-dmjsf        2/2       Running   0          40m


Version-Release number of selected component (if applicable):
image id:
openshift3/logging-elasticsearch    7605f043d232
openshift3/logging-kibana    e0ab09c2cbeb
openshift3/logging-fluentd    47057624ecab
openshift3/logging-auth-proxy    139f7943475e
openshift3/logging-curator    7f034fdf7702


# openshift version
openshift v3.5.0.17+c55cf2b
kubernetes v1.5.2+43a9be4
etcd 3.1.0

How reproducible:
Always

Steps to Reproduce:
1. prepare the inventory file

[oo_first_master]
$master-public-dns ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="~/cfile/libra.pem" openshift_public_hostname=$master-public-dns

[oo_first_master:vars]
deployment_type=openshift-enterprise
openshift_release=v3.5.0
openshift_logging_install_logging=true

openshift_logging_kibana_hostname=kibana.$sub-domain
public_master_url=https://$master-public-dns:8443

openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/
openshift_logging_image_version=3.5.0

openshift_logging_namespace=juzhao
openshift_logging_fluentd_use_journal=false/true

2. Running the playbook from a control machine (my laptop) which is not oo_master:
git clone https://github.com/openshift/openshift-ansible
ansible-playbook -vvv -i ~/inventory   playbooks/common/openshift-cluster/openshift_logging.yml

Actual results:
ConnectException, SSLException, OutOfMemoryError Exception in ES log, see the attached ES_LOG_jsonfile.txt. 

ConnectException, SSLException, CircuitBreakingException in ES log, see the attached ES_LOG_Journald.txt.

Expected results:
No error in ES log

Additional info:
Attached ES log, for both json-file and journald logging driver

Comment 1 Junqi Zhao 2017-02-08 08:38:38 UTC
Created attachment 1248559 [details]
ES_LOG_journald.txt is es log upon journald logging driver

Comment 2 Jeff Cantrill 2017-02-08 15:22:52 UTC
Is this a typical size node you  normal test with?  The amount of allocated memory seems significantly small at 488m of max heap.

Comment 3 Junqi Zhao 2017-02-09 01:43:32 UTC
Tested on GCE, vm_type is n1-standard-2,memory is 7.5 GB, and this error don't happen with logging 3.4

Comment 6 Jeff Cantrill 2017-02-09 14:13:58 UTC
Closing as I believe this is related to lack of memory resources.

Comment 7 Junqi Zhao 2017-02-10 01:46:44 UTC
Jeff,

There are two scenarios in this defect.

1. use json-file as logging driver, and deployed logging with ansible, ConnectException, SSLException, OutOfMemoryError Exception in ES log.
first, ConnectException shows, a few minutes later,SSLException shows, 
the ssl connection closed, and last,
"java.lang.OutOfMemoryError: Java heap space" shows.

2. use journald as logging driver, and deployed logging with ansible, first, ConnectException shows, a few minutes later,CircuitBreakingException show


Based on your comments, I think you are talking about scenario 1, and machine memory is 7.5 GB, how can we increase heap size for ES, I always think the heap size is allocated by ES, so if customer come across this error, how to increase heap size by themselves.

for scenario 2, do we know why ConnectException and CircuitBreakingException happen?

Comment 8 Peter Portante 2017-02-10 02:54:34 UTC
In both ES log entries, at the beginning, ES emits a like saying how much HEAP is in use:

2017-02-08 06:31:12,271][INFO ][env] [Saint Elmo] heap size [471.7mb], compressed ordinary object pointers [true]

I see this in both cases.  Something is telling ES to use a tiny amount of memory.  In a 7.5 GB VM instance, I would think giving ES 3.5 GB of HEAP, maybe 4 GB of HEAP, would be okay, but not great.

If you can, run the VM with 16 GB of RAM at least so that it can use about 8 GB of HEAP for normal operation.

Watch for Field Data size growth and know that if you start seeing "monitor.jvm" log messages from Elasticsearch you are likely to be running out of HEAP soon.  Once you see Java OOM messages, usually Elasticsearch is toast.

Comment 9 Junqi Zhao 2017-02-10 06:42:53 UTC
Peter,

I tested on a 7.5 GB VM instance when this defect is filed, in both ES log entries, we can find:
 
ES_JAVA_OPTS: '-Dmapper.allow_dots_in_name=true -Xms128M -Xmx488m'

If we want to increase the heap size, I think we can change -Xmx to a bigger number.

There are two questions I am confused:

1. Is -Xmx488m defined by the program? if so, we can not change this value if logging already deployed.

2. These error don't happen with logging 3.4, although -Xmx is a little bigger than logging 3.5

logging 3.4 ES log:
Comparing the specificed RAM to the maximum recommended for ElasticSearch...
Inspecting the maximum RAM available...
ES_JAVA_OPTS: '-Dmapper.allow_dots_in_name=true -Xms128M -Xmx512m'
Checking if Elasticsearch is ready on https://localhost:9200 ...[2017-02-10 06:39:10,829][INFO ][node                     ] [Skull the Slayer] version[2.4.1], pid[1], build[945a6e0/2016-11-17T20:39:42Z]
[2017-02-10 06:39:10,940][INFO ][node                     ] [Skull the Slayer] initializing ...
...[2017-02-10 06:39:14,360][INFO ][plugins                  ] [Skull the Slayer] modules [reindex, lang-expression, lang-groovy], plugins [search-guard-ssl, openshift-elasticsearch, cloud-kubernetes, search-guard-2], sites []
.[2017-02-10 06:39:14,496][INFO ][env                      ] [Skull the Slayer] using [1] data paths, mounts [[/elasticsearch/persistent (/dev/mapper/rhel-root)]], net usable_space [7.8gb], net total_space [9.9gb], spins? [possibly], types [xfs]
[2017-02-10 06:39:14,496][INFO ][env                      ] [Skull the Slayer] heap size [495.3mb], compressed ordinary object pointers [true]

Comment 10 Jeff Cantrill 2017-02-10 17:52:14 UTC
Amount of memory is calculated based on: https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/run.sh#L27

Comment 11 Jeff Cantrill 2017-02-10 17:55:37 UTC
Feel free to reopen this issue if you feel its a problem.  The way memory is allocated has not changed between 3.4 and 3.5.  I think it would help to attach compariable DC, RC, or Pod JSON so we could see from what values it is making this calculation.  Since we moved to ansible in 3.4, maybe we are passing a value incorrectly.

Comment 12 Peter Portante 2017-02-10 18:30:32 UTC
So in the attached logs, I see the following: "Setting the maximum allowable RAM to 976m which is the largest amount available"

This is emitted at line: https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/run.sh#L51

Something has set the cgroup limit to 978MB, even though the configuration is probably for more.

Comment 13 Junqi Zhao 2017-02-13 06:18:22 UTC
due to the defect https://bugzilla.redhat.com/show_bug.cgi?id=1421563, only ConnectException shows in ES log now, same exception when this defect was field,
since ConnectException not fixed, I have to re-open this defect.


Error trace:
[2017-02-13 04:18:52,878][INFO ][transport                ] [Mister Machine] Using [com.floragunn.searchguard.ssl.transport.SearchGuardSSLNettyTransport] as transport, overridden by [search-guard-ssl]
[2017-02-13 04:18:52,994][INFO ][client.transport         ] [Mister Machine] failed to connect to node [{#transport#-1}{127.0.0.1}{127.0.0.1:9300}], removed from nodes list
ConnectTransportException[[][127.0.0.1:9300] connect_timeout[30s]]; nested: ConnectException[Connection refused: localhost/127.0.0.1:9300];
	at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:967)
	at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:933)
	at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:906)
	at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:267)
	at org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler.doSample(TransportClientNodesService.java:390)
	at org.elasticsearch.client.transport.TransportClientNodesService$NodeSampler.sample(TransportClientNodesService.java:336)
	at org.elasticsearch.client.transport.TransportClientNodesService.addTransportAddresses(TransportClientNodesService.java:187)
	at org.elasticsearch.client.transport.TransportClient.addTransportAddress(TransportClient.java:243)
	at io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter.<init>(DynamicACLFilter.java:166)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:50)
	at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
	at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:104)
	at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
	at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:886)
	at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
	at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
	at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:46)
	at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:201)
	at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
	at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:879)
	at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
	at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
	at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
	at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:93)
	at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70)
	at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:46)
	at org.elasticsearch.node.Node.<init>(Node.java:213)
	at org.elasticsearch.node.Node.<init>(Node.java:140)
	at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143)
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194)
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:286)
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:45)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:9300
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


fully ES log is attached.

Comment 14 Junqi Zhao 2017-02-13 06:22:06 UTC
Created attachment 1249753 [details]
ConnectException In ES Log

Comment 15 Rich Megginson 2017-02-13 16:43:32 UTC
I have seen connection issues like this when the cluster is coming up, or when one of the es pods is stopped while some of the other es nodes/pods in the cluster are still running.  I think this is a transient error, and should go away, and at any rate should not affect data delivery.

However, if you can correlate this error with missing data or other problems, that would be a cause for concern.

Comment 16 Peter Portante 2017-02-14 02:49:10 UTC
The log still shows that this instance is running ES with a Java heap set to about 500 MB.

That is too small to be of any use or supportable.  Can you reproduce this with a Java HEAP of at least 4 GB, perhaps even 8 GB?

Comment 17 Junqi Zhao 2017-02-14 06:07:45 UTC
@Peter,

Test on AWS with vm_type: m4.xlarge, memory is 16GB, ES log still shows the -Xmx512m.
checked ES dc, info:

        resources:
          limits:
            memory: 1Gi
          requests:
            memory: 512Mi

from https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/run.sh#L24

the amount of RAM allocated should be half of available instance RAM.

I understand why the system set -Xmx to 512m, if we want to change the heap size, we should change the memory limits in ES dc and then scale it up.

since the defect https://bugzilla.redhat.com/show_bug.cgi?id=1421563, I will try to verify this defect later. And I want to emphasize here, this defect does not happen with logging 3.4, and we never changed the memory limits in ES dc.


Attached ES pod log and ES dc log

Comment 18 Junqi Zhao 2017-02-14 06:08:20 UTC
Created attachment 1250122 [details]
es pod log

Comment 19 Junqi Zhao 2017-02-14 06:08:53 UTC
Created attachment 1250123 [details]
es dc log

Comment 20 Junqi Zhao 2017-02-14 06:12:06 UTC
Created attachment 1250124 [details]
es pod log

Comment 21 Rich Megginson 2017-02-14 18:55:12 UTC
Not sure if ansible changed this or changes this, but here is the old deployer/templates/es.yaml dc template:

              resources:
                limits:
                  memory: "${INSTANCE_RAM}i"
                requests:
                  memory: "512Mi"

What is the INSTANCE_RAM env. var setting for the es pod?

oc env pod $espod --list

The INSTANCE_RAM setting comes from the ES_INSTANCE_RAM parameter in the old deployer.  The INSTANCE_RAM by default is 512G in the es Dockerfile.

current ansible code has this in es.j2

            limits:
              memory: "{{es_memory_limit}}"
{% if es_cpu_limit is defined and es_cpu_limit is not none %}
              cpu: "{{es_cpu_limit}}"
{% endif %}
            requests:
              memory: "512Mi"

with ansible this value comes from  the variable openshift_logging_es_memory_limit which has a default value of 1024Mi

So, to be compatible with 3.4, the new ansible code should set the default as 8Gi.

Comment 24 Jeff Cantrill 2017-02-16 14:45:32 UTC
Default 3.5 to be same as 3.4 https://github.com/openshift/openshift-ansible/pull/3381

Comment 25 openshift-github-bot 2017-02-16 16:58:36 UTC
Commits pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/de9b80326b8a9e2c600cdaa5d25fcebf38f840a3
bug 1420217. Default ES memory to be compariable to 3.4 deployer

https://github.com/openshift/openshift-ansible/commit/43330fbd7cb90491d63e54430394a309f7a41f73
Merge pull request #3381 from jcantrill/bz_1420217_default_es_memory

bug 1420217. Default ES memory to be compariable to 3.4 deployer

Comment 27 Jeff Cantrill 2017-02-20 20:57:06 UTC
Is this issue resolved or do you have additional logs you can provide?

Comment 28 Junqi Zhao 2017-02-21 01:06:33 UTC
This issue is not resolved, since the defect BZ #1420219, I don't have additional logs to provide now.

Comment 30 Jeff Cantrill 2017-02-22 19:19:02 UTC
Can you reproduce and provide additional logs or close this issue if it is resolved?

Comment 31 Junqi Zhao 2017-02-23 00:36:28 UTC
@Jeff

Besides ES pod log, what kind of other logs do you want me to provide, then I can provide all to you.

Comment 32 Peter Portante 2017-02-23 03:22:02 UTC
Have you reproduced this problem with at least 8 GB of instance RAM, or even 16 GB of instance RAM for both the ES pod and the ES ops pods?

It is not clear to me reading through all the comments that it was demonstrated how the problem was reproduced in those environments, where the logs from those reproducers were provided showing the same failure as originally reported.

Comment 33 Junqi Zhao 2017-02-23 06:06:58 UTC
@Jeff @Peter
Deployed logging stack via ansible with 8GB RAM machine, attached ES dc,rc, pod log, and I think the java.net.ConnectException does not have something to do with memory size.

ansible inventory file
[oo_first_master]
$master ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="./libra-new.pem" openshift_public_hostname=$master

[oo_first_master:vars]
deployment_type=openshift-enterprise
openshift_release=v3.5.0
openshift_logging_install_logging=true

openshift_logging_kibana_hostname=kibana.$sub-domain
openshift_logging_kibana_ops_hostname=kibana-ops.$sub-domain
public_master_url=https://$master:$port
openshift_logging_fluentd_hosts=$node
openshift_logging_fluentd_use_journal=true
openshift_logging_use_ops=false

openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/
openshift_logging_image_version=3.5.0

openshift_logging_namespace=$namespace


Please check the attached logs, if you want me to provide more info, please let me know

Comment 34 Junqi Zhao 2017-02-23 06:08:15 UTC
Created attachment 1256773 [details]
es rc info - 20170223

Comment 35 Junqi Zhao 2017-02-23 06:09:23 UTC
Created attachment 1256776 [details]
es dc info - 20170223

Comment 36 Junqi Zhao 2017-02-23 06:09:56 UTC
Created attachment 1256777 [details]
es pod log - 20170223

Comment 37 Peter Portante 2017-02-23 11:59:34 UTC
(In reply to Junqi Zhao from comment #33)
> @Jeff @Peter
> Deployed logging stack via ansible with 8GB RAM machine, attached ES dc,rc,
> pod log, and I think the java.net.ConnectException does not have something
> to do with memory size.
...
> Please check the attached logs, if you want me to provide more info, please
> let me know

Seems like it really is the case that port 9300 has nobody listening on it from address "localhost".  Can you capture the /etc/elasticsearch/elasticsearch.yml file that the ES instance is using?  And if possible, see what ports are being listen'd on in that container?

Comment 38 Junqi Zhao 2017-02-24 07:43:53 UTC
see attached elasticsearch.yml, and searched with `netstat -anp | grep 9300`, no return value, 9300 port was not used.

I also compared with 3.4.1 es pod log, it is ES_INTERNAL_IP:9300 there
# oc logs logging-es-xluc1lps-1-cjsqk
[2017-02-24 07:24:01,337][INFO ][cluster.service          ] [Micromax] new_master {Micromax}{Dt5C0vZ2QgmcgtmK5cWX8g}{10.2.2.44}{10.2.2.44:9300}{master=true}, reason: zen-disco-join(elected_as_master, [0] joins received)

# oc get po -o wide
NAME                              READY     STATUS      RESTARTS   AGE       IP          NODE
logging-es-xluc1lps-1-cjsqk       1/1       Running     0          15m       10.2.2.44   ip-172-18-9-248.ec2.internal

but in 3.5.0 log, it is 127.0.0.1:9300, I think this caused ConnectException.

[2017-02-24 06:09:36,525][INFO ][transport                ] [Madame Masque] Using [com.floragunn.searchguard.ssl.transport.SearchGuardSSLNettyTransport] as transport, overridden by [search-guard-ssl]
[2017-02-24 06:09:36,700][INFO ][client.transport         ] [Madame Masque] failed to connect to node [{#transport#-1}{127.0.0.1}{127.0.0.1:9300}], removed from nodes list

May this info could help you.

Comment 39 Junqi Zhao 2017-02-24 07:45:21 UTC
Created attachment 1257166 [details]
elasticsearch.yml

Comment 40 Junqi Zhao 2017-02-24 07:45:59 UTC
Created attachment 1257167 [details]
3.4.1_es_pod_info.txt

Comment 41 Junqi Zhao 2017-02-24 07:46:28 UTC
Created attachment 1257168 [details]
3.5.0_es_pod_info.txt

Comment 42 Luke Meyer 2017-02-27 20:03:27 UTC
Agree, the difference in logs would appear to indicate the 3.5 instance is trying to connect to the other ES instance as 127.0.0.1 instead of its kube pod IP. I wonder if our ES-kube clustering plugin needs a tweak?

Comment 44 Jeff Cantrill 2017-03-01 14:19:30 UTC
Fixed in https://github.com/openshift/origin-aggregated-logging/pull/340

Comment 45 Junqi Zhao 2017-03-03 06:14:03 UTC
@Jeff,

Tested again, ES still throws "java.net.ConnectException" error,
from your fix, OSE_ES_VER should be 2.4.4.2, but OSE_ES_VER is still 2.4.4.1
# oc rsh logging-es-r4dazx42-1-brfv7
sh-4.2$ env | grep VER
SG_VER=2.4.4.10
SG_SSL_VER=2.4.4.19
OSE_ES_VER=2.4.4.1
RECOVER_EXPECTED_NODES=1
ES_VER=2.4.4
RECOVER_AFTER_NODES=0
ES_CLOUD_K8S_VER=2.4.4
RECOVER_AFTER_TIME=5m
JAVA_VER=1.8.0

Should we need to build image?
# docker images|grep logging
openshift3/logging-curator         3.5.0               8cfcb23f26b6        2 days ago          211.1 MB
openshift3/logging-elasticsearch   3.5.0               d715f4d34ad4        3 weeks ago         399.2 MB
openshift3/logging-kibana          3.5.0               e0ab09c2cbeb        5 weeks ago         342.9 MB
openshift3/logging-fluentd         3.5.0               47057624ecab        5 weeks ago         233.1 MB
openshift3/logging-auth-proxy      3.5.0               139f7943475e        6 weeks ago         220 MB

Comment 46 Jeff Cantrill 2017-03-07 16:29:33 UTC
I'm working to move into OCP repos.  Already merged in origin

Comment 47 Jeff Cantrill 2017-03-07 19:06:11 UTC
12706216 buildContainer (noarch) completed successfully
koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=542604
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:rhaos-3.5-rhel-7-docker-candidate-20170307134801
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:3.5.0-6
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:3.5.0
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:v3.5

Comment 48 Junqi Zhao 2017-03-08 06:56:29 UTC
@Jeff,

Tested with logging-elasticsearch:3.5.0-6, ES still throws out java.net.ConnectException, see the attached ES log

# oc rsh logging-es-8hkuwh00-1-372qg
sh-4.2$ env | grep VER
SG_VER=2.4.4.10
SG_SSL_VER=2.4.4.19
OSE_ES_VER=2.4.4.2
RECOVER_EXPECTED_NODES=1
ES_VER=2.4.4
RECOVER_AFTER_NODES=0
ES_CLOUD_K8S_VER=2.4.4
RECOVER_AFTER_TIME=5m
JAVA_VER=1.8.0

Comment 49 Junqi Zhao 2017-03-08 06:59:53 UTC
Created attachment 1261100 [details]
es pod log, still have java.net.ConnectException

Comment 50 Jeff Cantrill 2017-03-10 17:02:57 UTC
build in #47  did not include the plugin fix which was not tagged into the correct channels.  Awaiting build service to become available.

Comment 51 Jeff Cantrill 2017-03-13 15:02:02 UTC
Please try this one.

koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=543761
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:rhaos-3.5-rhel-7-docker-candidate-20170313104001
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:3.5.0-9
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:3.5.0
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-elasticsearch:v3.5

Comment 52 Junqi Zhao 2017-03-14 05:25:13 UTC
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1431935, will verify it after BZ # 1431935 get fixed.

Comment 53 Junqi Zhao 2017-03-20 06:05:23 UTC
Verified with logging-elasticsearch:3.5.0-9, ES still throws out java.net.ConnectException, see the attached ES log

# oc rsh logging-es-h34lhvtg-1-0l536
sh-4.2$ env | grep VER
SG_VER=2.4.4.10
SG_SSL_VER=2.4.4.19
OSE_ES_VER=2.4.4.2
RECOVER_EXPECTED_NODES=1
ES_VER=2.4.4
RECOVER_AFTER_NODES=0
ES_CLOUD_K8S_VER=2.4.4
RECOVER_AFTER_TIME=5m
JAVA_VER=1.8.0

Comment 54 Junqi Zhao 2017-03-20 06:06:28 UTC
Created attachment 1264701 [details]
es pod log, java.net.ConnectException

Comment 61 Junqi Zhao 2017-03-28 01:46:10 UTC
The logging stacks are deployed in the environment mentioned in Comment 59, attached es pod logs, you can also check on these machines under project logging.

Comment 62 Junqi Zhao 2017-03-28 01:46:57 UTC
Created attachment 1266813 [details]
es pod log - 20170228

Comment 64 Junqi Zhao 2017-03-29 06:29:04 UTC
Yes, it is transient exception, it does not affect the whole logging function, but from user's perspective, it is not friendly

Comment 66 Jeff Cantrill 2017-03-29 19:42:31 UTC
Lowing the priority as per #64 the stack is functional and this does not appear to block anything.

Comment 67 Jeff Cantrill 2017-03-30 13:23:41 UTC
Junqi,

Can you edit the logging-es#logging.yml configmap to make the root logger 'DEBUG' and provide additional logs for further investigation.

Comment 68 Junqi Zhao 2017-03-31 01:08:59 UTC
log level is debug for es and es-ops, see the attached log files.

Comment 69 Junqi Zhao 2017-03-31 01:09:35 UTC
Created attachment 1267685 [details]
es pod log - 20170331

Comment 70 Junqi Zhao 2017-03-31 01:10:20 UTC
Created attachment 1267686 [details]
es-ops pod log - 20170331

Comment 74 Jeff Cantrill 2017-05-13 17:33:30 UTC
Fixed upstream with: https://github.com/fabric8io/openshift-elasticsearch-plugin/pull/77

Comment 75 openshift-github-bot 2017-06-13 04:11:08 UTC
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/e77c62bb3c0bfb046db504753c6b4d4c02ffbe85
bug 1420217. Update ES plugin that squashes stack on start

Comment 77 Xia Zhao 2017-07-12 05:41:31 UTC
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1469918

Comment 78 Xia Zhao 2017-07-13 02:44:21 UTC
It's fixed with the latest v3.6 image v3.6.143-2. Uploaded the full_es_log_latest_3.6, no ConnectException or others observed, set to verified.

Images tested with:
logging-elasticsearch   v3.6.143-2          ca1c9074bf99        6 hours ago         404.7 MB

logging-elasticsearch   v3.6                ca1c9074bf99        6 hours ago         404.7 MB

Comment 79 Xia Zhao 2017-07-13 02:45:58 UTC
Created attachment 1297356 [details]
es_log_latest_3.6

Comment 81 errata-xmlrpc 2017-08-10 05:17:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716


Note You need to log in before you can comment on or make changes to this bug.