Bug 1688069 - DNS failing to resolve intermittently. Application is failing to connect with DB(hosted outside of Openshift).
Summary: DNS failing to resolve intermittently. Application is failing to connect with...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.9.z
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-13 04:07 UTC by Miheer Salunke
Modified: 2023-09-15 01:28 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-20 15:43:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 6 Neelesh Agrawal 2019-03-13 11:37:23 UTC
Created attachment 1543585 [details]
tcpdump Mar 12 2019  at  03:33 AM -07:00

Comment 16 Ben Bennett 2019-03-15 18:55:19 UTC
OpenShift engineering and the kernel network engineers both agree that this looks like the Kernel bug described in https://access.redhat.com/solutions/3827421

If you can try the iptables rule in the solution above on a single node and see if that addresses the problem that would be good.

Comment 21 Ben Bennett 2019-03-18 17:49:03 UTC
> Ben, do you think the insert_failed counter as per comment 17 gives more evidence for conntrack iptable related issue ?

Yes!  https://bugzilla.redhat.com/show_bug.cgi?id=1648965#c34

I wish they would try the NOTRACK workaround documented in https://access.redhat.com/solutions/3827421


> Also tcpdump which shows no activity for 9 seconds does it relate to https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts as customer is on old kernel which does not have that fix ?

I don't see a 9 second gap in the timestamps... is it possible that the terminal just stalled rather than the lookups stalling?


> Sorry for asking this again but customer is really stubborn to implement changes as it is their payment related environment.

Yes, but the NOTRACK change is easy to audit, easy to install, easy to test, and easy to roll-back.

Comment 24 Stephen Cuppett 2019-03-28 01:42:42 UTC

*** This bug has been marked as a duplicate of bug 1648965 ***

Comment 26 Miheer Salunke 2019-04-04 02:00:34 UTC
Hi,

We are checking for now if this issue is happening only with java related applications.

Also few things from tcpdumps which we saw earlier we see that for lookup request it appends  to the domain name as per https://bugzilla.redhat.com/show_bug.cgi?id=1688069#c17 as per the /etc/resolv.conf

So first few requests fail and then it succeeds so shall we try blackholing as per https://access.redhat.com/solutions/3993581  

or may be reduce the ndots for one their apps and monitor it ?  

Is there anything in Java where the lookups  done are limited and are done for only sometime and after that it will fail ?

Reduce number of pods per node ?

Shall we take strace of dnsmasq and java container ? 



Thanks and regards,
Miheer

Comment 27 Stephen Cuppett 2019-11-20 15:43:07 UTC
OCP 3.9 has reached the end of full support [1]. Closing this BZ as NOTABUG. There were outstanding questions about what to collect, etc. and the customer cases were resolved with a kernel fix. If there is a customer case to be attached with a valid support exception and we still need a fix here, please post those details and reopen.

[1] - https://access.redhat.com/support/policy/updates/openshift

Comment 28 Red Hat Bugzilla 2023-09-15 01:28:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.