1903880 – resolv: Duplicate transaction ID causing timeouts

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1903880 - resolv: Duplicate transaction ID causing timeouts

Summary: resolv: Duplicate transaction ID causing timeouts

Keywords:
Status:	CLOSED DUPLICATE of bug 1904153
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	glibc
Sub Component:
Version:	8.3
Hardware:	aarch64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	glibc team
QA Contact:	qe-baseos-tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	1868106
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-03 02:30 UTC by benh@amazon.com
Modified:	2023-07-18 14:30 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-11 14:52:09 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description benh@amazon.com 2020-12-03 02:30:19 UTC

Description of problem:

On EC2 m6 instances, A and AAAA DNS requests occasionally get sent with the same transaction ID. This causes the responses to sometimes get lost causing a 5s timeout.

Version-Release number of selected component (if applicable):

glibc-2.28-127.el8

How reproducible:

Fairly easily

On an m6 instance run:

for i in `seq 0 5000` ; do date >> /home/$(whoami)/log.txt; curl -s -w '%{time_namelookup}'  www.google.com  -o /dev/null >> /home/$(whoami)/log.txt; echo "" >> /home/$(whoami)/log.txt; sleep 5; done

for a while, then look at log.txt, you'll see an occasional 2s or 5s lag.

Additional info:

This is probably the same bug addressed for z-series in #1868106. It only affects m6 as a1 is too slow to get the same value out of gettimeofday() twice in a row and x86 uses the HP_TIMING stuff.

I fixed it in Amazon Linux 2 differently by doing a partial backport of:

 359653aaacad463d916323f03c0ac3c47405aafa Do not use HP_TIMING_NOW for random bits

to completely avoid the duplicates by switching to clock_gettime (I dropped the mktemp part of the patch as this is getting more churn upstream and has a higher regression risk).

However the patch to deal with the duplicates gracefully should do as well:

 f1f00c072138af90ae6da180f260111f09afe7a3 resolv: Handle transaction ID collisions in parallel queries (bug 26600)

Comment 2 Carlos O'Donell 2020-12-03 03:29:49 UTC

Ben,

Thanks for the report and feedback.

The full fix (f1f00c072138af90ae6da180f260111f09afe7a3) is already planned for release in RHEL 8.4.0 which should GA sometime in 2021Q2.

Do you have customers running into this and reporting it to Amazon?

I'm trying to determine if this has broader scope than just the initial z-series issues we saw with OpenShift platform.

Comment 3 benh@amazon.com 2020-12-03 03:47:07 UTC

Yes we do. Specifically on m6 instances. It verified today that it affects RHEL8.3 with the above test script. It was originally reported against Amazon Linux 2 but the customer who reported it is now asking about RHEL :-)

It is fundamentally the same problem as the z-series one. I haven't had a chance to build a patched RHEL8 glibc to verify the fix (in part because for some obscure reason rpmbuild is failing to find dependencies such as libstdc++-static),
but I've debugged the problem on AL2 and it's basically the same thing, so I think your fix will work. If you can send me a patched aarch64 RPM I can give it a spin to confirm.

Comment 4 Carlos O'Donell 2020-12-03 04:02:40 UTC

Ben,

Thanks for the offer to test.

Create the following /etc/yum.repos.d/rhbz1903880.repo
~~~
[rhbz1903880]
name=RHEL 8.4.0 testfix for bug 1903880
baseurl=https://people.redhat.com/codonell/rhel-8.4.0-rhbz1903880
enabled=1
gpgcheck=0
protect=1
~~~
Then dnf upgrade should just upgrade you to the testfix version.

There is no expectation of support for these RPMs, and please do not use them in production :-)

If the customer needs an immediate fix in RHEL 8.3.0 then please advise them to talk to us directly.

Please feel free to send RHEL users our way if they have questions.

Thanks again for the double-check on your end.

Comment 5 benh@amazon.com 2020-12-03 04:07:43 UTC

Thanks. I'll use the above exclusively to verify the fix on m6 & let you know the results.

Comment 11 David Duncan 2020-12-03 15:37:49 UTC

FYI, per our CCSP agreement, customers using on-demand RHEL should contact AWS Support for assistance and we will escalate and

Comment 12 Carlos O'Donell 2020-12-03 17:39:58 UTC

(In reply to David Duncan from comment #11)
> FYI, per our CCSP agreement, customers using on-demand RHEL should contact
> AWS Support for assistance and we will escalate and

David, Sorry it looks like your sentence is cut off?

Comment 13 Florian Weimer 2020-12-11 14:52:09 UTC

Thank you for your bug report.

We are now tracking delivery of the fix into Red Hat Enterprise Linux 8.3 as bug 1904153, so for process reasons, I'm closing this bug as a duplicate of the other bug.

All these bug fixes cover all architectures at the same time because we are changing generic code.

*** This bug has been marked as a duplicate of bug 1904153 ***

Note You need to log in before you can comment on or make changes to this bug.