Bug 1928304

Summary: hostname lookup delays when master node down
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: DNS QA Contact: jechen <jechen>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aarapov, abodhe, alchan, amcdermo, aos-bugs, bjarolim, hongli, juqiao, mfuruta, mmasters, mzali, openshift-bugs-escalate, rbolling, rh-container, rhowe, sgaikwad
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1928773 (view as bug list) Environment:
Last Closed: 2021-03-10 11:24:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1919737    
Bug Blocks: 1928773, 1930917    

Comment 1 Miciah Dashiel Butler Masters 2021-02-23 03:46:53 UTC
*** Bug 1930913 has been marked as a duplicate of this bug. ***

Comment 2 Miciah Dashiel Butler Masters 2021-02-25 22:20:26 UTC
Waiting for https://github.com/openshift/cluster-dns-operator/pull/235 to be approved.

Comment 6 jechen 2021-03-02 18:08:41 UTC
attempted to verify with 4.7.0-0.nightly-2021-03-01-085007, could only verify pull 259. pull 235 is missing, waiting for next image.

Comment 7 jechen 2021-03-03 16:29:12 UTC
no new 4.7 nightly build available today

Comment 8 jechen 2021-03-04 03:10:34 UTC
verified with 4.7.0-0.nightly-2021-03-04-004412, test passed


oc get clusterversions.config.openshift.io
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-03-04-004412   True        False         20m     Cluster version is 4.7.0-0.nightly-2021-03-04-004412


## in first terminal, create a test pod, and rsh into the test pod, run the infinite loop to measure time for DNS lookup

[jechen@jechen ~]$  oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/aosqe-pod-for-ping.json
pod/hello-pod created
[jechen@jechen ~]$ oc rsh hello-pod
/ # while (true); do time getent hosts kubernetes.default.svc.cluster.local; sleep 1; done
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.08s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s

<----snip----->

### in a second terminal, reboot a master node
$ oc get node
NAME                                            STATUS   ROLES    AGE   VERSION
hongli-47bv-hv7k6-master-0                      Ready    master   43m   v1.20.0+5fbfd19
hongli-47bv-hv7k6-master-1                      Ready    master   43m   v1.20.0+5fbfd19
hongli-47bv-hv7k6-master-2                      Ready    master   43m   v1.20.0+5fbfd19
hongli-47bv-hv7k6-worker-northcentralus-fg28q   Ready    worker   28m   v1.20.0+5fbfd19
hongli-47bv-hv7k6-worker-northcentralus-frggf   Ready    worker   34m   v1.20.0+5fbfd19
hongli-47bv-hv7k6-worker-northcentralus-wk6ls   Ready    worker   34m   v1.20.0+5fbfd19
[jechen@jechen ~]$ oc debug node/hongli-47bv-hv7k6-master-1
Starting pod/hongli-47bv-hv7k6-master-1-debug ...
To use host binaries, run `chroot /host`
chroot /hostPod IP: 10.0.0.5
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# reboot



### In first terminal, monitor to ensure that there is no DNS lookup delay when the master is rebooting (for about 3-5 min)
<----snip----->
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.01s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
172.30.0.1        kubernetes.default.svc.cluster.local  kubernetes.default.svc.cluster.local
real	0m 0.00s
user	0m 0.00s
sys	0m 0.00s
<----snip----->



### verified that dns readiness probe has been changed with pull 235
$ oc -n openshift-dns get ds/dns-default -oyaml


        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 3       <------verified the change with https://github.com/openshift/cluster-dns-operator/pull/235/
          successThreshold: 1
          timeoutSeconds: 3      <------verified the change with https://github.com/openshift/cluster-dns-operator/pull/235/

Comment 10 errata-xmlrpc 2021-03-10 11:24:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0678