Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Descriptionbenh@amazon.com
2020-12-03 02:30:19 UTC
Description of problem:
On EC2 m6 instances, A and AAAA DNS requests occasionally get sent with the same transaction ID. This causes the responses to sometimes get lost causing a 5s timeout.
Version-Release number of selected component (if applicable):
glibc-2.28-127.el8
How reproducible:
Fairly easily
On an m6 instance run:
for i in `seq 0 5000` ; do date >> /home/$(whoami)/log.txt; curl -s -w '%{time_namelookup}' www.google.com -o /dev/null >> /home/$(whoami)/log.txt; echo "" >> /home/$(whoami)/log.txt; sleep 5; done
for a while, then look at log.txt, you'll see an occasional 2s or 5s lag.
Additional info:
This is probably the same bug addressed for z-series in #1868106. It only affects m6 as a1 is too slow to get the same value out of gettimeofday() twice in a row and x86 uses the HP_TIMING stuff.
I fixed it in Amazon Linux 2 differently by doing a partial backport of:
359653aaacad463d916323f03c0ac3c47405aafa Do not use HP_TIMING_NOW for random bits
to completely avoid the duplicates by switching to clock_gettime (I dropped the mktemp part of the patch as this is getting more churn upstream and has a higher regression risk).
However the patch to deal with the duplicates gracefully should do as well:
f1f00c072138af90ae6da180f260111f09afe7a3 resolv: Handle transaction ID collisions in parallel queries (bug 26600)
Ben,
Thanks for the report and feedback.
The full fix (f1f00c072138af90ae6da180f260111f09afe7a3) is already planned for release in RHEL 8.4.0 which should GA sometime in 2021Q2.
Do you have customers running into this and reporting it to Amazon?
I'm trying to determine if this has broader scope than just the initial z-series issues we saw with OpenShift platform.
Yes we do. Specifically on m6 instances. It verified today that it affects RHEL8.3 with the above test script. It was originally reported against Amazon Linux 2 but the customer who reported it is now asking about RHEL :-)
It is fundamentally the same problem as the z-series one. I haven't had a chance to build a patched RHEL8 glibc to verify the fix (in part because for some obscure reason rpmbuild is failing to find dependencies such as libstdc++-static),
but I've debugged the problem on AL2 and it's basically the same thing, so I think your fix will work. If you can send me a patched aarch64 RPM I can give it a spin to confirm.
Ben,
Thanks for the offer to test.
Create the following /etc/yum.repos.d/rhbz1903880.repo
~~~
[rhbz1903880]
name=RHEL 8.4.0 testfix for bug 1903880
baseurl=https://people.redhat.com/codonell/rhel-8.4.0-rhbz1903880
enabled=1
gpgcheck=0
protect=1
~~~
Then dnf upgrade should just upgrade you to the testfix version.
There is no expectation of support for these RPMs, and please do not use them in production :-)
If the customer needs an immediate fix in RHEL 8.3.0 then please advise them to talk to us directly.
Please feel free to send RHEL users our way if they have questions.
Thanks again for the double-check on your end.
(In reply to David Duncan from comment #11)
> FYI, per our CCSP agreement, customers using on-demand RHEL should contact
> AWS Support for assistance and we will escalate and
David, Sorry it looks like your sentence is cut off?
Thank you for your bug report.
We are now tracking delivery of the fix into Red Hat Enterprise Linux 8.3 as bug 1904153, so for process reasons, I'm closing this bug as a duplicate of the other bug.
All these bug fixes cover all architectures at the same time because we are changing generic code.
*** This bug has been marked as a duplicate of bug 1904153 ***
Description of problem: On EC2 m6 instances, A and AAAA DNS requests occasionally get sent with the same transaction ID. This causes the responses to sometimes get lost causing a 5s timeout. Version-Release number of selected component (if applicable): glibc-2.28-127.el8 How reproducible: Fairly easily On an m6 instance run: for i in `seq 0 5000` ; do date >> /home/$(whoami)/log.txt; curl -s -w '%{time_namelookup}' www.google.com -o /dev/null >> /home/$(whoami)/log.txt; echo "" >> /home/$(whoami)/log.txt; sleep 5; done for a while, then look at log.txt, you'll see an occasional 2s or 5s lag. Additional info: This is probably the same bug addressed for z-series in #1868106. It only affects m6 as a1 is too slow to get the same value out of gettimeofday() twice in a row and x86 uses the HP_TIMING stuff. I fixed it in Amazon Linux 2 differently by doing a partial backport of: 359653aaacad463d916323f03c0ac3c47405aafa Do not use HP_TIMING_NOW for random bits to completely avoid the duplicates by switching to clock_gettime (I dropped the mktemp part of the patch as this is getting more churn upstream and has a higher regression risk). However the patch to deal with the duplicates gracefully should do as well: f1f00c072138af90ae6da180f260111f09afe7a3 resolv: Handle transaction ID collisions in parallel queries (bug 26600)