1928773 – hostname lookup delays when master node down

Bug 1928773 - hostname lookup delays when master node down

Summary: hostname lookup delays when master node down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	jechen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1930917 (view as bug list)
Depends On:	1928304
Blocks:	1935297
TreeView+	depends on / blocked

Reported:	2021-02-15 14:32 UTC by Miciah Dashiel Butler Masters
Modified:	2024-06-14 00:18 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1928304
Clones:	1935297 (view as bug list)
Environment:
Last Closed:	2021-03-16 23:22:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-dns-operator pull 236	None	closed	[release-4.6] Bug 1928773: Set CoreDNS readiness probe period and timeout each to 3 seconds	2021-03-08 21:29:19 UTC
Github	openshift sdn pull 261	None	closed	[release-4.6] Bug 1928773: Prefer local endpoint for cluster DNS service	2021-03-08 21:29:18 UTC
Red Hat Product Errata	RHBA-2021:0753	None	Closed	[bug] images can be made unusable by users	2022-05-12 10:42:00 UTC

Comment 1 Miciah Dashiel Butler Masters 2021-02-23 03:49:51 UTC

*** Bug 1930917 has been marked as a duplicate of this bug. ***

Comment 2 Miciah Dashiel Butler Masters 2021-02-25 22:20:32 UTC

Waiting for https://github.com/openshift/kubernetes/pull/581 to be merged so that I can rebase https://github.com/openshift/sdn/pull/261.

Comment 4 Miciah Dashiel Butler Masters 2021-03-03 21:00:24 UTC

Here's an update on where this backport stands.  The 4.6.z backport is waiting on two things: verification of the 4.7.z backport, and passing CI tests.  

The 4.7.z backport got delayed by a process issue unrelated to the fix itself.  CI tests are failing due to an issue with our CI infrastructure: one of the CI jobs verifies changes on GCP, and we are currently having general issues with GCE API rate limiting, again unrelated to the fix itself.  

I anticipate that the 4.7.z backport will be verified this week.  Then the 4.6.z backport can be verified next week, and shipped in the fast/stable channels approximately 2.5 weeks from now.  (There is a possibility that the 4.6.z backport will be available in the candidate-4.6 channel a little earlier.)

Comment 7 Ben Bennett 2021-03-04 14:57:41 UTC

*** Bug 1921797 has been marked as a duplicate of this bug. ***

Comment 10 jechen 2021-03-08 15:05:21 UTC

Verified in 4.6.0-0.nightly-2021-03-06-050044

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-03-06-050044   True        False         24m     Cluster version is 4.6.0-0.nightly-2021-03-06-050044

# in first terminal, created test pod, rsh into test pod and run infinite loop to measure DNS lookup time
$ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/aosqe-pod-for-ping.json
pod/hello-pod created
$ oc rsh hello-pod
/ # while (true); do time getent hosts kubernetes.default.svc.cluster.local; sleep 1; done
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.02s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
<---- snip----->


# In second terminal. reboot one of the master node
$ oc get node
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-947k4ft-f76d1-9nnw9-master-0         Ready    master   57m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-master-1         Ready    master   57m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-master-2         Ready    master   57m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-worker-b-drbnn   Ready    worker   49m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-worker-c-fwrs8   Ready    worker   49m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-worker-d-6jpxz   Ready    worker   49m   v1.19.0+2f3101c

$ oc debug node/ci-ln-947k4ft-f76d1-9nnw9-master-1
Starting pod/ci-ln-947k4ft-f76d1-9nnw9-master-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# reboot


#monitor DNS lookup delay in first terminal for 3-5 minutes while master node is rebooting, make sure there is no delay
<---- snip----->

172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.01s
user	0m 0.00s
sys	0m 0.00s
<---- snip----->

[jechen@jechen ~]$  oc -n openshift-dns get ds/dns-default -oyaml

<---- snip----->
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 3           <--- verified fix with https://github.com/openshift/cluster-dns-operator/pull/236
          successThreshold: 1
          timeoutSeconds: 3          <--- verified fix with https://github.com/openshift/cluster-dns-operator/pull/236

<---- snip----->

Comment 11 jechen 2021-03-08 18:46:46 UTC

sorry, I marked wrong status, changed to verified

Comment 13 errata-xmlrpc 2021-03-16 23:22:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.21 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0753

Note You need to log in before you can comment on or make changes to this bug.