Description of problem: cluster-logging pods are failing to cluster in a timely fashion presumably because of slow or unresponsive unicast. Seeing similar behavior in a number of online environments: starter-us-west-1 starter-us-west-2 starter-us-east-2a Version-Release number of selected component (if applicable): v3.11.44 Actual results: Unicast ping results are 5s Expected results: Unicast ping results less then 1s Additional info: Seeing messages like: [2018-11-28T15:37:24,237][WARN ][o.e.d.z.UnicastZenPing ] [logging-es-data-master-ml8zmsl7] timed out after [5s] resolving host [logging-es-cluster] and [2018-11-28T15:46:16,885][WARN ][o.e.d.z.PublishClusterStateAction] [logging-es-data-master-ml8zmsl7] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{logging-es-data-master-zolfd11v}{7bYgDC3eTLCFGpqmt-w3ow}{UPTP1EvIR8qs4L_IZ7k43Q}{10.129.0.230}{10.129.0.230:9300}])
Also seems to affect: starter-us-east-1b
more logs from starter-us-east-1a [2018-11-27T00:00:12,370][WARN ][o.e.c.NodeConnectionsService] [logging-es-data-master-w04c4fyr] failed to connect to node {logging-es-data-master-t05uaomz}{u8Mg3qgqRfqMC5ZTvTx-hQ}{HdqoiYdBTAGfaxrXv99JgQ}{10.131 .0.178}{10.131.0.178:9300} (tried [829] times) org.elasticsearch.transport.ConnectTransportException: [logging-es-data-master-t05uaomz][10.131.0.178:9300] connect_timeout[30s] at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?] at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:342) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:183) [elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674) [elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.10.redhat-1.jar:5.6.10.redhat-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: 10.131.0.178/10.131.0.178:9300 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?] at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?] at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?] ... 1 more Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?] at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?] at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?] ... 1 more [2018-11-27T00:00:13,397][WARN ][o.e.d.z.ZenDiscovery ] [logging-es-data-master-w04c4fyr] not enough master nodes discovered during pinging (found [[Candidate{node={logging-es-data-master-w04c4fyr}{9jq_bSL9R IanE-15gmEj9Q}{I98P4Ml3QyS_TmkQ_dNLXw}{10.129.4.194}{10.129.4.194:9300}, clusterStateVersion=28785}]], but needed [2]), pinging again
Created attachment 1509616 [details] sdn pod logs for infra node The SDN pod logs for the infra node on which one of the logging pods is running. I see messages regarding deleting endpoints for the 'logging-es' service. I don't know the significance or timing here of whether it occurs when the ES pods are going up and down.
(In reply to Jeff Cantrill from comment #0) > Seeing messages like: > [2018-11-28T15:37:24,237][WARN ][o.e.d.z.UnicastZenPing ] > [logging-es-data-master-ml8zmsl7] timed out after [5s] resolving host > [logging-es-cluster] "resolving host" sounds like this is a DNS problem, not a transport problem? > Caused by: java.net.NoRouteToHostException: No route to host In general, "no route to host" when connecting to a service IP means that there are no pods backing that service currently. Clayton and dcbw are debugging CI flakes involving seemingly-spurious "no route to host" but I don't know the current details there or if this is at all related. > I see messages regarding deleting endpoints for the 'logging-es' service. I > don't know the significance or timing here of whether it occurs when the ES > pods are going up and down. That's what it looks like. It never deletes *all* of the IPs.
(In reply to Dan Winship from comment #4) > > "resolving host" sounds like this is a DNS problem, not a transport problem? > > > Caused by: java.net.NoRouteToHostException: No route to host > > In general, "no route to host" when connecting to a service IP means that > there are no pods backing that service currently. Clayton and dcbw are > debugging CI flakes involving seemingly-spurious "no route to host" but I > don't know the current details there or if this is at all related. This particular service defined as headless with the setting `publishNotReadyAddresses: true` so that the endpoints are published regardless of the readiness probe. I presume the DNS issue here is that its initially unable to get the list of IPs to use for subsequent requests. The pods are up just not able to see one another > > > I see messages regarding deleting endpoints for the 'logging-es' service. I > > don't know the significance or timing here of whether it occurs when the ES > > pods are going up and down. > > That's what it looks like. It never deletes *all* of the IPs. Is there more I can do to help determine whats going on? I see no issues in engint but the same issue in the other env
From what you're describing, it sounds like the endpoints are failing healthchecks and are being removed from the service. Can you capture an "iptables-save" on the node of the client pod when this occurs?
Closing as to my knowledge there is not currently any issue. I will reopen if needed.
Can you please open a new bug? The symptoms may be the same, but the root cause may be different given that the previous bug was closed over a year ago, and mixing the two just leads to confusion. It does make sense to provide a link to the previous bug that you found though. We will need much more information than what was provided in comment 8. The service having endpoints get added and removed may just mean that the pod started and then terminated (when it could not find peers). We would need to know what sdn plugin they are using (are they using ovs-multitenant or ovs-networkpolicy?) and once we have that there will be follow-up questions. Can they oc exec into the pod and reach the peers? What do the endpoints for the service look like (oc get endpoints)?
Can we get a snapshot of the logging environment [1] that should give us the ES logs? https://github.com/openshift/origin-aggregated-loggi[1] ng/blob/release-3.11/hack/logging-dump.sh