Bug 1882485 - dns-node-resolver corrupts /etc/hosts if internal registry is not in use
Summary: dns-node-resolver corrupts /etc/hosts if internal registry is not in use
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Ryan Fredette
QA Contact: Hongan Li
URL:
Whiteboard:
: 1916466 (view as bug list)
Depends On:
Blocks: 1916907
TreeView+ depends on / blocked
 
Reported: 2020-09-24 17:28 UTC by Bryan Croft
Modified: 2023-09-18 00:22 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Intermittent DNS errors Consequence: dns-node-resolver created invalid entries in the node's /etc/hosts file Fix: Filtering error messages out of DNS requests that eventually return a valid record Result: dns-node-resolver no longer creates invalid /etc/hosts entries
Clone Of:
: 1916907 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:21:14 UTC
Target Upstream Version:
Embargoed:
mas-hatada: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 223 0 None closed Bug 1882485: Prevent dig errors from corrupting host's /etc/hosts 2021-02-17 01:52:15 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:22:07 UTC

Description Bryan Croft 2020-09-24 17:28:02 UTC
Description of problem:

dns-node-resolver script corrupts Node /etc/hosts when image-registry service is  not being used in the cluster


Version-Release number of selected component (if applicable):

Details in this ticket from Openshift 4.5.8, but bug appears to be longstanding.


How reproducible:

Seen in all of our clusters.


Steps to Reproduce:
1.  Disable internal registry.
2.  Examine the /etc/hosts from any node


Actual results:

Hosts file contains:
```
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
;; image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
Connection image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
to image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
172.29.0.10#53(172.29.0.10) image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
for image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
image-registry.openshift-image-registry.svc.cluster.local image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
failed: image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
timed image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
out. image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
```


Expected results:
```
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
```


Additional info:

https://github.com/openshift/cluster-dns-operator/blob/4cfc0735685b02ed98df323423cb390d54bd6c51/assets/dns/daemonset.yaml#L87  hardcodes the service name, and the script imbedded in the same daemonset (https://github.com/openshift/cluster-dns-operator/blob/master/assets/dns/daemonset.yaml#L91-L148) makes no attempt to verify if that service is running resulting in the corruption seem above.

Comment 1 Andrew McDermott 2020-09-25 16:05:48 UTC
Target set to next release version while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.

Comment 2 Andrew McDermott 2020-09-25 16:24:35 UTC
Can you elaborate some more on how/when you disabled the internal registry? Was this post install? I wanted to clear up when this bit -- "when image-registry service is not being used in the cluster" -- occurred?. Was this subsequently replaced with a standalone registry?

Comment 3 Andrew McDermott 2020-09-25 17:08:05 UTC
(In reply to Andrew McDermott from comment #2)
> Can you elaborate some more on how/when you disabled the internal registry?

Nevermind; I now see that to remove (disable) the internal registry you have to change "ManagementState" on the config, per:

https://docs.openshift.com/container-platform/4.1/registry/configuring-registry-operator.html#registry-operator-configuration-resource-overview_configuring-registry-operator

> Was this post install? I wanted to clear up when this bit -- "when
> image-registry service is not being used in the cluster" -- occurred?. Was
> this subsequently replaced with a standalone registry?

Comment 4 Andrew McDermott 2020-10-02 16:11:43 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 5 Andrew McDermott 2020-10-23 15:57:50 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 6 tas 2020-11-05 12:31:03 UTC
I've also been hit by this issue running cluster version 4.5.16.
Our /etc/hosts file corrupts in the same way however we do not have the internal registry disabled.

For us this can (seemly) randomly happen on new clusters or newly joined nodes.

Comment 8 Andrew McDermott 2020-11-16 14:54:12 UTC
(In reply to tas from comment #6)
> I've also been hit by this issue running cluster version 4.5.16.
> Our /etc/hosts file corrupts in the same way however we do not have the
> internal registry disabled.
> 
> For us this can (seemly) randomly happen on new clusters or newly joined
> nodes.

Can you share the contents of /etc/hosts when this happens. I want to see if
it is related and/or similar to the case where the internal registry is disabled.

Comment 9 tas 2020-11-17 16:45:56 UTC
(In reply to Andrew McDermott from comment #8)
> (In reply to tas from comment #6)
> > I've also been hit by this issue running cluster version 4.5.16.
> > Our /etc/hosts file corrupts in the same way however we do not have the
> > internal registry disabled.
> > 
> > For us this can (seemly) randomly happen on new clusters or newly joined
> > nodes.
> 
> Can you share the contents of /etc/hosts when this happens. I want to see if
> it is related and/or similar to the case where the internal registry is
> disabled.

Here is the output

[root@infnod-1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
;; image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
Connection image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
to image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
10.xxx.xxx.10#53(10.xxx.xxx.10) image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
for image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
image-registry.openshift-image-registry.svc.cluster.local image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
failed: image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
timed image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver
out. image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver

---
For clarity, I've rewritten the dig error: Connection to 10.xxx.xxx.10#53(10.xxx.xxx.10) for image-registry.openshift-image-registry.svc.cluster.local failed: timed out.

Comment 10 tas 2020-11-18 07:29:00 UTC
The dig errors suggest its more to do with a network race condition preventing it from connecting to the cluster DNS servers.

Comment 11 tas 2020-11-18 09:53:21 UTC
Using iptables I've been able to kinda reproduce its behaviour.

The dig lookups using UDP immediately return a non-zero exit code when it receives no response.
The dig lookups using TCP behave differently. When dig receives no response during the three way handshake (no ACK), it times out, prints an error message and retries. 

During the retries, if dig gets a response it'll exit with code 0 but the output will contain error messages mixed in with query result.

Adding +retry=0 or similar parameter to the TCP lookups would immediately cause dig to exit with a non-zero code.

Comment 12 Andrew McDermott 2020-12-04 16:42:56 UTC
(In reply to tas from comment #11)
> Using iptables I've been able to kinda reproduce its behaviour.
> 
> The dig lookups using UDP immediately return a non-zero exit code when it
> receives no response.
> The dig lookups using TCP behave differently. When dig receives no response
> during the three way handshake (no ACK), it times out, prints an error
> message and retries. 
> 
> During the retries, if dig gets a response it'll exit with code 0 but the
> output will contain error messages mixed in with query result.
> 
> Adding +retry=0 or similar parameter to the TCP lookups would immediately
> cause dig to exit with a non-zero code.

This is really helpful. Thank you.

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 13 Masaki Hatada 2020-12-10 09:57:02 UTC
Dear Furuta-san,

We got the comment from Andrew.
Although the script we implemented with MachineConfig has the problem as he said, it's the same even if we don't apply it since node-resolver script which is executed by dns-default pod also has the same problem, according to Bug 1882485. 
(And there is no information about how to fix in Bug 1882485 now)

So, the workaround we suggested is still the only way to avoid the issue customer faced.

Now we have a question. In Red Hat side, who will give the green light to our workaround?
Or, does Comment #12 mean that Red Had accepted our workaround?

Best Regards,
Masaki Hatada

Comment 14 Masaki Hatada 2020-12-10 10:05:20 UTC
Sorry, please ignore Comment #13....

Comment 15 Masaki Hatada 2020-12-10 10:07:06 UTC
Sorry, I removed original needinfo flag. I added again...

Comment 18 Hongan Li 2020-12-21 08:19:02 UTC
tested with 4.7.0-0.nightly-2020-12-20-031835 but failed.

after disable internal registry, the old entry for image-registry.openshift-image-registry.svc still exists in the /etc/hosts file.


# oc -n openshift-image-registry get svc
NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
image-registry            ClusterIP   172.30.188.94   <none>        5000/TCP    6h10m
image-registry-operator   ClusterIP   None         <none>        60000/TCP   6h37m

# oc debug node/ip-10-0-130-27.us-east-2.compute.internal
sh-4.4# chroot /host
sh-4.4# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.30.188.94 image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver

### disable internal registry (set configs.imageregistry.operator.openshift.io spec.ManagementState.Removed)
### ensure the service image-registry is removed.
# oc -n openshift-image-registry get svc
NAME                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE
image-registry-operator   ClusterIP   None         <none>        60000/TCP   6h37m

# oc debug node/ip-10-0-130-27.us-east-2.compute.internal
sh-4.4# chroot /host 
sh-4.4# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
172.30.188.94 image-registry.openshift-image-registry.svc image-registry.openshift-image-registry.svc.cluster.local # openshift-generated-node-resolver

Comment 19 Ryan Fredette 2021-01-07 18:44:09 UTC
I'm not sure this should be considered a failure. The root issue was that /etc/hosts was getting corrupted if the dns lookup returned an error, most commonly when the internal registry was disabled. While it's not 100% correct for /etc/hosts to contain an IP for the image registry when it's disabled, the output listed shows that the fix is working, since /etc/hosts isn't corrupted.

If the remaining IP is really an issue, please open a separate bug for it.

Comment 20 Hongan Li 2021-01-11 03:17:19 UTC
moving to verified per Comment 18 and 19

Comment 21 Andrew McDermott 2021-01-15 17:12:11 UTC
*** Bug 1916466 has been marked as a duplicate of this bug. ***

Comment 23 errata-xmlrpc 2021-02-24 15:21:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 26 Red Hat Bugzilla 2023-09-18 00:22:36 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.