Bug 1654378 - Slow network response causing failures with cluster-logging
Summary: Slow network response causing failures with cluster-logging
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.11.z
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-28 16:01 UTC by Jeff Cantrill
Modified: 2023-12-15 16:14 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-21 14:38:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sdn pod logs for infra node (16.80 MB, text/plain)
2018-11-28 19:07 UTC, Jeff Cantrill
no flags Details

Description Jeff Cantrill 2018-11-28 16:01:31 UTC
Description of problem:

cluster-logging pods are failing to cluster in a timely fashion presumably because of slow or unresponsive unicast.  Seeing similar behavior in a number of online environments:

starter-us-west-1
starter-us-west-2
starter-us-east-2a


Version-Release number of selected component (if applicable):
v3.11.44


Actual results:
Unicast ping results are 5s

Expected results: 
Unicast ping results less then 1s


Additional info:

Seeing messages like:
[2018-11-28T15:37:24,237][WARN ][o.e.d.z.UnicastZenPing   ] [logging-es-data-master-ml8zmsl7] timed out after [5s] resolving host [logging-es-cluster]

and 

[2018-11-28T15:46:16,885][WARN ][o.e.d.z.PublishClusterStateAction] [logging-es-data-master-ml8zmsl7] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{logging-es-data-master-zolfd11v}{7bYgDC3eTLCFGpqmt-w3ow}{UPTP1EvIR8qs4L_IZ7k43Q}{10.129.0.230}{10.129.0.230:9300}])

Comment 1 Jeff Cantrill 2018-11-28 18:37:20 UTC
Also seems to affect:

starter-us-east-1b

Comment 2 Jeff Cantrill 2018-11-28 18:58:21 UTC
more logs from starter-us-east-1a

[2018-11-27T00:00:12,370][WARN ][o.e.c.NodeConnectionsService] [logging-es-data-master-w04c4fyr] failed to connect to node {logging-es-data-master-t05uaomz}{u8Mg3qgqRfqMC5ZTvTx-hQ}{HdqoiYdBTAGfaxrXv99JgQ}{10.131
.0.178}{10.131.0.178:9300} (tried [829] times)
org.elasticsearch.transport.ConnectTransportException: [logging-es-data-master-t05uaomz][10.131.0.178:9300] connect_timeout[30s]
	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:342) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:183) [elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) [elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: 10.131.0.178/10.131.0.178:9300
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
	... 1 more
Caused by: java.net.NoRouteToHostException: No route to host
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
	... 1 more
[2018-11-27T00:00:13,397][WARN ][o.e.d.z.ZenDiscovery     ] [logging-es-data-master-w04c4fyr] not enough master nodes discovered during pinging (found [[Candidate{node={logging-es-data-master-w04c4fyr}{9jq_bSL9R
IanE-15gmEj9Q}{I98P4Ml3QyS_TmkQ_dNLXw}{10.129.4.194}{10.129.4.194:9300}, clusterStateVersion=28785}]], but needed [2]), pinging again

Comment 3 Jeff Cantrill 2018-11-28 19:07:52 UTC
Created attachment 1509616 [details]
sdn pod logs for infra node

The SDN pod logs for the infra node on which one of the logging pods is running.  I see messages regarding deleting endpoints for the 'logging-es' service. I don't know the significance or timing here of whether it occurs when the ES pods are going up and down.

Comment 4 Dan Winship 2018-11-29 14:17:06 UTC
(In reply to Jeff Cantrill from comment #0)
> Seeing messages like:
> [2018-11-28T15:37:24,237][WARN ][o.e.d.z.UnicastZenPing   ]
> [logging-es-data-master-ml8zmsl7] timed out after [5s] resolving host
> [logging-es-cluster]

"resolving host" sounds like this is a DNS problem, not a transport problem?

> Caused by: java.net.NoRouteToHostException: No route to host

In general, "no route to host" when connecting to a service IP means that there are no pods backing that service currently. Clayton and dcbw are debugging CI flakes involving seemingly-spurious "no route to host" but I don't know the current details there or if this is at all related.

> I see messages regarding deleting endpoints for the 'logging-es' service. I
> don't know the significance or timing here of whether it occurs when the ES
> pods are going up and down.

That's what it looks like. It never deletes *all* of the IPs.

Comment 5 Jeff Cantrill 2018-11-29 14:30:18 UTC
(In reply to Dan Winship from comment #4)

> 
> "resolving host" sounds like this is a DNS problem, not a transport problem?
> 
> > Caused by: java.net.NoRouteToHostException: No route to host
> 
> In general, "no route to host" when connecting to a service IP means that
> there are no pods backing that service currently. Clayton and dcbw are
> debugging CI flakes involving seemingly-spurious "no route to host" but I
> don't know the current details there or if this is at all related.

This particular service defined as headless with the setting `publishNotReadyAddresses: true` so that the endpoints are published regardless of the readiness probe.  I presume the DNS issue here is that its initially unable to get the list of IPs to use for subsequent requests.  The pods are up just not able to see one another

> 
> > I see messages regarding deleting endpoints for the 'logging-es' service. I
> > don't know the significance or timing here of whether it occurs when the ES
> > pods are going up and down.
> 
> That's what it looks like. It never deletes *all* of the IPs.

Is there more I can do to help determine whats going on? I see no issues in engint but the same issue in the other env

Comment 6 Casey Callendrello 2018-12-05 14:22:23 UTC
From what you're describing, it sounds like the endpoints are failing healthchecks and are being removed from the service. Can you capture an "iptables-save" on the node of the client pod when this occurs?

Comment 7 Jeff Cantrill 2019-02-11 14:27:14 UTC
Closing as to my knowledge there is not currently any issue. I will reopen if needed.

Comment 13 Ben Bennett 2020-09-18 13:46:14 UTC
Can you please open a new bug?  The symptoms may be the same, but the root cause may be different given that the previous bug was closed over a year ago, and mixing the two just leads to confusion.  It does make sense to provide a link to the previous bug that you found though.

We will need much more information than what was provided in comment 8.

The service having endpoints get added and removed may just mean that the pod started and then terminated (when it could not find peers).  We would need to know what sdn plugin they are using (are they using ovs-multitenant or ovs-networkpolicy?) and once we have that there will be follow-up questions.

Can they oc exec into the pod and reach the peers?  What do the endpoints for the service look like (oc get endpoints)?

Comment 15 Jeff Cantrill 2020-09-18 14:11:11 UTC
Can we get a snapshot of the logging environment [1] that should give us the ES logs?

https://github.com/openshift/origin-aggregated-loggi[1] ng/blob/release-3.11/hack/logging-dump.sh


Note You need to log in before you can comment on or make changes to this bug.