Bug 1928773 - hostname lookup delays when master node down [NEEDINFO]
Summary: hostname lookup delays when master node down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: DNS
Version: 4.6
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.6.z
Assignee: Miciah Dashiel Butler Masters
QA Contact: jechen
URL:
Whiteboard:
: 1930917 (view as bug list)
Depends On: 1928304
Blocks: 1935297
TreeView+ depends on / blocked
 
Reported: 2021-02-15 14:32 UTC by Miciah Dashiel Butler Masters
Modified: 2021-04-20 08:45 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1928304
: 1935297 (view as bug list)
Environment:
Last Closed: 2021-03-16 23:22:18 UTC
Target Upstream Version:
skanakal: needinfo? (mmasters)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 236 0 None closed [release-4.6] Bug 1928773: Set CoreDNS readiness probe period and timeout each to 3 seconds 2021-03-08 21:29:19 UTC
Github openshift sdn pull 261 0 None closed [release-4.6] Bug 1928773: Prefer local endpoint for cluster DNS service 2021-03-08 21:29:18 UTC
Red Hat Product Errata RHBA-2021:0753 0 None None None 2021-03-16 23:22:34 UTC

Comment 1 Miciah Dashiel Butler Masters 2021-02-23 03:49:51 UTC
*** Bug 1930917 has been marked as a duplicate of this bug. ***

Comment 2 Miciah Dashiel Butler Masters 2021-02-25 22:20:32 UTC
Waiting for https://github.com/openshift/kubernetes/pull/581 to be merged so that I can rebase https://github.com/openshift/sdn/pull/261.

Comment 4 Miciah Dashiel Butler Masters 2021-03-03 21:00:24 UTC
Here's an update on where this backport stands.  The 4.6.z backport is waiting on two things: verification of the 4.7.z backport, and passing CI tests.  

The 4.7.z backport got delayed by a process issue unrelated to the fix itself.  CI tests are failing due to an issue with our CI infrastructure: one of the CI jobs verifies changes on GCP, and we are currently having general issues with GCE API rate limiting, again unrelated to the fix itself.  

I anticipate that the 4.7.z backport will be verified this week.  Then the 4.6.z backport can be verified next week, and shipped in the fast/stable channels approximately 2.5 weeks from now.  (There is a possibility that the 4.6.z backport will be available in the candidate-4.6 channel a little earlier.)

Comment 7 Ben Bennett 2021-03-04 14:57:41 UTC
*** Bug 1921797 has been marked as a duplicate of this bug. ***

Comment 10 jechen 2021-03-08 15:05:21 UTC
Verified in 4.6.0-0.nightly-2021-03-06-050044

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-03-06-050044   True        False         24m     Cluster version is 4.6.0-0.nightly-2021-03-06-050044

# in first terminal, created test pod, rsh into test pod and run infinite loop to measure DNS lookup time
$ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/aosqe-pod-for-ping.json
pod/hello-pod created
$ oc rsh hello-pod
/ # while (true); do time getent hosts kubernetes.default.svc.cluster.local; sleep 1; done
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.02s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
<---- snip----->


# In second terminal. reboot one of the master node
$ oc get node
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-947k4ft-f76d1-9nnw9-master-0         Ready    master   57m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-master-1         Ready    master   57m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-master-2         Ready    master   57m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-worker-b-drbnn   Ready    worker   49m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-worker-c-fwrs8   Ready    worker   49m   v1.19.0+2f3101c
ci-ln-947k4ft-f76d1-9nnw9-worker-d-6jpxz   Ready    worker   49m   v1.19.0+2f3101c

$ oc debug node/ci-ln-947k4ft-f76d1-9nnw9-master-1
Starting pod/ci-ln-947k4ft-f76d1-9nnw9-master-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# reboot


#monitor DNS lookup delay in first terminal for 3-5 minutes while master node is rebooting, make sure there is no delay
<---- snip----->

172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.01s
user	0m 0.00s
sys	0m 0.00s
<---- snip----->

[jechen@jechen ~]$  oc -n openshift-dns get ds/dns-default -oyaml

<---- snip----->
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 3           <--- verified fix with https://github.com/openshift/cluster-dns-operator/pull/236
          successThreshold: 1
          timeoutSeconds: 3          <--- verified fix with https://github.com/openshift/cluster-dns-operator/pull/236

<---- snip----->

Comment 11 jechen 2021-03-08 18:46:46 UTC
sorry, I marked wrong status, changed to verified

Comment 13 errata-xmlrpc 2021-03-16 23:22:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.21 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0753


Note You need to log in before you can comment on or make changes to this bug.