Description of problem: On EC2 m6 instances, A and AAAA DNS requests occasionally get sent with the same transaction ID. This causes the responses to sometimes get lost causing a 5s timeout. Version-Release number of selected component (if applicable): glibc-2.28-127.el8 How reproducible: Fairly easily On an m6 instance run: for i in `seq 0 5000` ; do date >> /home/$(whoami)/log.txt; curl -s -w '%{time_namelookup}' www.google.com -o /dev/null >> /home/$(whoami)/log.txt; echo "" >> /home/$(whoami)/log.txt; sleep 5; done for a while, then look at log.txt, you'll see an occasional 2s or 5s lag. Additional info: This is probably the same bug addressed for z-series in #1868106. It only affects m6 as a1 is too slow to get the same value out of gettimeofday() twice in a row and x86 uses the HP_TIMING stuff. I fixed it in Amazon Linux 2 differently by doing a partial backport of: 359653aaacad463d916323f03c0ac3c47405aafa Do not use HP_TIMING_NOW for random bits to completely avoid the duplicates by switching to clock_gettime (I dropped the mktemp part of the patch as this is getting more churn upstream and has a higher regression risk). However the patch to deal with the duplicates gracefully should do as well: f1f00c072138af90ae6da180f260111f09afe7a3 resolv: Handle transaction ID collisions in parallel queries (bug 26600)
Ben, Thanks for the report and feedback. The full fix (f1f00c072138af90ae6da180f260111f09afe7a3) is already planned for release in RHEL 8.4.0 which should GA sometime in 2021Q2. Do you have customers running into this and reporting it to Amazon? I'm trying to determine if this has broader scope than just the initial z-series issues we saw with OpenShift platform.
Yes we do. Specifically on m6 instances. It verified today that it affects RHEL8.3 with the above test script. It was originally reported against Amazon Linux 2 but the customer who reported it is now asking about RHEL :-) It is fundamentally the same problem as the z-series one. I haven't had a chance to build a patched RHEL8 glibc to verify the fix (in part because for some obscure reason rpmbuild is failing to find dependencies such as libstdc++-static), but I've debugged the problem on AL2 and it's basically the same thing, so I think your fix will work. If you can send me a patched aarch64 RPM I can give it a spin to confirm.
Ben, Thanks for the offer to test. Create the following /etc/yum.repos.d/rhbz1903880.repo ~~~ [rhbz1903880] name=RHEL 8.4.0 testfix for bug 1903880 baseurl=https://people.redhat.com/codonell/rhel-8.4.0-rhbz1903880 enabled=1 gpgcheck=0 protect=1 ~~~ Then dnf upgrade should just upgrade you to the testfix version. There is no expectation of support for these RPMs, and please do not use them in production :-) If the customer needs an immediate fix in RHEL 8.3.0 then please advise them to talk to us directly. Please feel free to send RHEL users our way if they have questions. Thanks again for the double-check on your end.
Thanks. I'll use the above exclusively to verify the fix on m6 & let you know the results.
FYI, per our CCSP agreement, customers using on-demand RHEL should contact AWS Support for assistance and we will escalate and
(In reply to David Duncan from comment #11) > FYI, per our CCSP agreement, customers using on-demand RHEL should contact > AWS Support for assistance and we will escalate and David, Sorry it looks like your sentence is cut off?
Thank you for your bug report. We are now tracking delivery of the fix into Red Hat Enterprise Linux 8.3 as bug 1904153, so for process reasons, I'm closing this bug as a duplicate of the other bug. All these bug fixes cover all architectures at the same time because we are changing generic code. *** This bug has been marked as a duplicate of bug 1904153 ***