Description of problem: dns-node-resolver script corrupts Node /etc/hosts when image-registry service is not being used in the cluster Version-Release number of selected component (if applicable): Details in this ticket from Openshift 4.5.8, but bug appears to be longstanding. How reproducible: Seen in all of our clusters. Steps to Reproduce: 1. Disable internal registry. 2. Examine the /etc/hosts from any node Actual results: Hosts file contains: ``` 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 ;; image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver Connection image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver to image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver 172.29.0.10#53(172.29.0.10) image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver for image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver image-registry.openshift-image-registry.svc.cluster.local image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver failed: image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver timed image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver out. image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver ``` Expected results: ``` 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 ``` Additional info: https://github.com/openshift/cluster-dns-operator/blob/4cfc0735685b02ed98df323423cb390d54bd6c51/assets/dns/daemonset.yaml#L87 hardcodes the service name, and the script imbedded in the same daemonset (https://github.com/openshift/cluster-dns-operator/blob/master/assets/dns/daemonset.yaml#L91-L148) makes no attempt to verify if that service is running resulting in the corruption seem above.
Target set to next release version while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
Can you elaborate some more on how/when you disabled the internal registry? Was this post install? I wanted to clear up when this bit -- "when image-registry service is not being used in the cluster" -- occurred?. Was this subsequently replaced with a standalone registry?
(In reply to Andrew McDermott from comment #2) > Can you elaborate some more on how/when you disabled the internal registry? Nevermind; I now see that to remove (disable) the internal registry you have to change "ManagementState" on the config, per: https://docs.openshift.com/container-platform/4.1/registry/configuring-registry-operator.html#registry-operator-configuration-resource-overview_configuring-registry-operator > Was this post install? I wanted to clear up when this bit -- "when > image-registry service is not being used in the cluster" -- occurred?. Was > this subsequently replaced with a standalone registry?
Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
I've also been hit by this issue running cluster version 4.5.16. Our /etc/hosts file corrupts in the same way however we do not have the internal registry disabled. For us this can (seemly) randomly happen on new clusters or newly joined nodes.
(In reply to tas from comment #6) > I've also been hit by this issue running cluster version 4.5.16. > Our /etc/hosts file corrupts in the same way however we do not have the > internal registry disabled. > > For us this can (seemly) randomly happen on new clusters or newly joined > nodes. Can you share the contents of /etc/hosts when this happens. I want to see if it is related and/or similar to the case where the internal registry is disabled.
(In reply to Andrew McDermott from comment #8) > (In reply to tas from comment #6) > > I've also been hit by this issue running cluster version 4.5.16. > > Our /etc/hosts file corrupts in the same way however we do not have the > > internal registry disabled. > > > > For us this can (seemly) randomly happen on new clusters or newly joined > > nodes. > > Can you share the contents of /etc/hosts when this happens. I want to see if > it is related and/or similar to the case where the internal registry is > disabled. Here is the output [root@infnod-1 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 ;; image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver Connection image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver to image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver 10.xxx.xxx.10#53(10.xxx.xxx.10) image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver for image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver image-registry.openshift-image-registry.svc.cluster.local image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver failed: image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver timed image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver out. image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver --- For clarity, I've rewritten the dig error: Connection to 10.xxx.xxx.10#53(10.xxx.xxx.10) for image-registry.openshift-image-registry.svc.cluster.local failed: timed out.
The dig errors suggest its more to do with a network race condition preventing it from connecting to the cluster DNS servers.
Using iptables I've been able to kinda reproduce its behaviour. The dig lookups using UDP immediately return a non-zero exit code when it receives no response. The dig lookups using TCP behave differently. When dig receives no response during the three way handshake (no ACK), it times out, prints an error message and retries. During the retries, if dig gets a response it'll exit with code 0 but the output will contain error messages mixed in with query result. Adding +retry=0 or similar parameter to the TCP lookups would immediately cause dig to exit with a non-zero code.
(In reply to tas from comment #11) > Using iptables I've been able to kinda reproduce its behaviour. > > The dig lookups using UDP immediately return a non-zero exit code when it > receives no response. > The dig lookups using TCP behave differently. When dig receives no response > during the three way handshake (no ACK), it times out, prints an error > message and retries. > > During the retries, if dig gets a response it'll exit with code 0 but the > output will contain error messages mixed in with query result. > > Adding +retry=0 or similar parameter to the TCP lookups would immediately > cause dig to exit with a non-zero code. This is really helpful. Thank you. Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
Dear Furuta-san, We got the comment from Andrew. Although the script we implemented with MachineConfig has the problem as he said, it's the same even if we don't apply it since node-resolver script which is executed by dns-default pod also has the same problem, according to Bug 1882485. (And there is no information about how to fix in Bug 1882485 now) So, the workaround we suggested is still the only way to avoid the issue customer faced. Now we have a question. In Red Hat side, who will give the green light to our workaround? Or, does Comment #12 mean that Red Had accepted our workaround? Best Regards, Masaki Hatada
Sorry, please ignore Comment #13....
Sorry, I removed original needinfo flag. I added again...
tested with 4.7.0-0.nightly-2020-12-20-031835 but failed. after disable internal registry, the old entry for image-registry.openshift-image-registry.svc still exists in the /etc/hosts file. # oc -n openshift-image-registry get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE image-registry ClusterIP 172.30.188.94 <none> 5000/TCP 6h10m image-registry-operator ClusterIP None <none> 60000/TCP 6h37m # oc debug node/ip-10-0-130-27.us-east-2.compute.internal sh-4.4# chroot /host sh-4.4# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 172.30.188.94 image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver ### disable internal registry (set configs.imageregistry.operator.openshift.io spec.ManagementState.Removed) ### ensure the service image-registry is removed. # oc -n openshift-image-registry get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE image-registry-operator ClusterIP None <none> 60000/TCP 6h37m # oc debug node/ip-10-0-130-27.us-east-2.compute.internal sh-4.4# chroot /host sh-4.4# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 172.30.188.94 image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
I'm not sure this should be considered a failure. The root issue was that /etc/hosts was getting corrupted if the dns lookup returned an error, most commonly when the internal registry was disabled. While it's not 100% correct for /etc/hosts to contain an IP for the image registry when it's disabled, the output listed shows that the fix is working, since /etc/hosts isn't corrupted. If the remaining IP is really an issue, please open a separate bug for it.
moving to verified per Comment 18 and 19
*** Bug 1916466 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days