Created attachment 1543585 [details] tcpdump Mar 12 2019 at 03:33 AM -07:00
OpenShift engineering and the kernel network engineers both agree that this looks like the Kernel bug described in https://access.redhat.com/solutions/3827421 If you can try the iptables rule in the solution above on a single node and see if that addresses the problem that would be good.
> Ben, do you think the insert_failed counter as per comment 17 gives more evidence for conntrack iptable related issue ? Yes! https://bugzilla.redhat.com/show_bug.cgi?id=1648965#c34 I wish they would try the NOTRACK workaround documented in https://access.redhat.com/solutions/3827421 > Also tcpdump which shows no activity for 9 seconds does it relate to https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts as customer is on old kernel which does not have that fix ? I don't see a 9 second gap in the timestamps... is it possible that the terminal just stalled rather than the lookups stalling? > Sorry for asking this again but customer is really stubborn to implement changes as it is their payment related environment. Yes, but the NOTRACK change is easy to audit, easy to install, easy to test, and easy to roll-back.
*** This bug has been marked as a duplicate of bug 1648965 ***
Hi, We are checking for now if this issue is happening only with java related applications. Also few things from tcpdumps which we saw earlier we see that for lookup request it appends to the domain name as per https://bugzilla.redhat.com/show_bug.cgi?id=1688069#c17 as per the /etc/resolv.conf So first few requests fail and then it succeeds so shall we try blackholing as per https://access.redhat.com/solutions/3993581 or may be reduce the ndots for one their apps and monitor it ? Is there anything in Java where the lookups done are limited and are done for only sometime and after that it will fail ? Reduce number of pods per node ? Shall we take strace of dnsmasq and java container ? Thanks and regards, Miheer
OCP 3.9 has reached the end of full support [1]. Closing this BZ as NOTABUG. There were outstanding questions about what to collect, etc. and the customer cases were resolved with a kernel fix. If there is a customer case to be attached with a valid support exception and we still need a fix here, please post those details and reopen. [1] - https://access.redhat.com/support/policy/updates/openshift
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days