Bug 1928773
Summary: | hostname lookup delays when master node down | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Miciah Dashiel Butler Masters <mmasters> | |
Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> | |
Networking sub component: | DNS | QA Contact: | jechen <jechen> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | aarapov, abodhe, agabriel, amcdermo, aos-bugs, bjarolim, hongli, juqiao, mfuruta, mmasters, mzali, openshift-bugs-escalate, openshift-bugzilla-robot, rbolling, rh-container, rhowe, sgaikwad, skanakal | |
Version: | 4.6 | |||
Target Milestone: | --- | |||
Target Release: | 4.6.z | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | 1928304 | |||
: | 1935297 (view as bug list) | Environment: | ||
Last Closed: | 2021-03-16 23:22:18 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1928304 | |||
Bug Blocks: | 1935297 |
Comment 1
Miciah Dashiel Butler Masters
2021-02-23 03:49:51 UTC
Waiting for https://github.com/openshift/kubernetes/pull/581 to be merged so that I can rebase https://github.com/openshift/sdn/pull/261. Here's an update on where this backport stands. The 4.6.z backport is waiting on two things: verification of the 4.7.z backport, and passing CI tests. The 4.7.z backport got delayed by a process issue unrelated to the fix itself. CI tests are failing due to an issue with our CI infrastructure: one of the CI jobs verifies changes on GCP, and we are currently having general issues with GCE API rate limiting, again unrelated to the fix itself. I anticipate that the 4.7.z backport will be verified this week. Then the 4.6.z backport can be verified next week, and shipped in the fast/stable channels approximately 2.5 weeks from now. (There is a possibility that the 4.6.z backport will be available in the candidate-4.6 channel a little earlier.) *** Bug 1921797 has been marked as a duplicate of this bug. *** Verified in 4.6.0-0.nightly-2021-03-06-050044 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2021-03-06-050044 True False 24m Cluster version is 4.6.0-0.nightly-2021-03-06-050044 # in first terminal, created test pod, rsh into test pod and run infinite loop to measure DNS lookup time $ oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/aosqe-pod-for-ping.json pod/hello-pod created $ oc rsh hello-pod / # while (true); do time getent hosts kubernetes.default.svc.cluster.local; sleep 1; done 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.02s user 0m 0.00s sys 0m 0.00s 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.00s user 0m 0.00s sys 0m 0.00s 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.00s user 0m 0.00s sys 0m 0.00s 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.00s user 0m 0.00s sys 0m 0.00s <---- snip-----> # In second terminal. reboot one of the master node $ oc get node NAME STATUS ROLES AGE VERSION ci-ln-947k4ft-f76d1-9nnw9-master-0 Ready master 57m v1.19.0+2f3101c ci-ln-947k4ft-f76d1-9nnw9-master-1 Ready master 57m v1.19.0+2f3101c ci-ln-947k4ft-f76d1-9nnw9-master-2 Ready master 57m v1.19.0+2f3101c ci-ln-947k4ft-f76d1-9nnw9-worker-b-drbnn Ready worker 49m v1.19.0+2f3101c ci-ln-947k4ft-f76d1-9nnw9-worker-c-fwrs8 Ready worker 49m v1.19.0+2f3101c ci-ln-947k4ft-f76d1-9nnw9-worker-d-6jpxz Ready worker 49m v1.19.0+2f3101c $ oc debug node/ci-ln-947k4ft-f76d1-9nnw9-master-1 Starting pod/ci-ln-947k4ft-f76d1-9nnw9-master-1-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.4 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# reboot #monitor DNS lookup delay in first terminal for 3-5 minutes while master node is rebooting, make sure there is no delay <---- snip-----> 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.00s user 0m 0.00s sys 0m 0.00s 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.00s user 0m 0.00s sys 0m 0.00s 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.00s user 0m 0.00s sys 0m 0.00s 172.30.0.1 kubernetes.default.svc.cluster.local kubernetes.default.svc.cluster.local real 0m 0.01s user 0m 0.00s sys 0m 0.00s <---- snip-----> [jechen@jechen ~]$ oc -n openshift-dns get ds/dns-default -oyaml <---- snip-----> readinessProbe: failureThreshold: 3 httpGet: path: /health port: 8080 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 3 <--- verified fix with https://github.com/openshift/cluster-dns-operator/pull/236 successThreshold: 1 timeoutSeconds: 3 <--- verified fix with https://github.com/openshift/cluster-dns-operator/pull/236 <---- snip-----> sorry, I marked wrong status, changed to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.21 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0753 |