Hide Forgot
This is the output from running the lines from the script mentioned in the first comment: [@infra01 ~]# def_route=$(/sbin/ip route list match 0.0.0.0/0 | awk '{print $3 }') [@infra01 ~]# def_route_int=$(/sbin/ip route get to ${def_route} | awk -F 'dev' '{print $2}' | head -n1 | awk '{print $1}') [@infra01 ~]# def_route_ip=$(/sbin/ip route get to ${def_route} | awk -F 'src' '{print $2}' | head -n1 | awk '{print $1}') [@infra01 ~]# echo $def_route_ip 192.168.52.153
Added oc describe nodes as a private attachment.
(In reply to David Caldwell from comment #17) ... > --cluster-dns=192.168.152.153 --cluster-domain=cluster.local ... Hi David, I think the reason is the above option "--cluster-dns=192.168.152.153" of kubelet, the nameserver of pods is controlled by this option. Workaround is changing the default value "dnsIP: 0.0.0.0" of /etc/origin/node/node-config.yaml, like: ``` dnsBindAddress: 127.0.0.1:53 dnsDomain: cluster.local dnsIP: 192.168.52.33 dnsNameservers: null dnsRecursiveResolvConf: /etc/origin/node/resolv.conf ``` Then you need run "systemctl restart atomic-openshift-node.service" to make the change take effect.
sorry a typo in Comment 18, should be changed to "dnsIP: 192.168.52.153" on your example node.
Hi, Thanks for the update. > I think the reason is the above option "--cluster-dns=192.168.152.153" of kubelet Yes, but where is this value coming from? The customer checked node-config.yaml and the node-config-compute configmap and found '0.0.0.0' set for 'dnsIP' in node-config.yaml and the node-config-compute configmap: ``` [root@node01 ~]# grep -i dnsIp /etc/origin/node/node-config.yaml dnsIP: 0.0.0.0 [root@master02 ~]# oc get cm node-config-compute -n openshift-node -o yaml |grep -i dnsIP dnsIP: 0.0.0.0 ``` Do you know what is passing this dnsIP value "--cluster-dns=192.168.152.153" to hyperkube? > Workaround is changing the default value "dnsIP: 0.0.0.0" of /etc/origin/node/node-config.yaml Yes, but since OCP version 3.11, I don't think that this is stable because if the configmap changes, this will be overwritten. From the 3.11 docs: "To make configuration changes to an existing node, edit the appropriate configuration map" and "When a sync pod detects configuration map change, it updates the node-config.yaml on all nodes". So, any change made to node-config.yaml is not safe, unless I am mistaken. To workaround this issue, the customer has deleted the dnsIP section of node-config.yaml and the pods are now picking up their DNS setting from the node's resolv.conf. Since removing dnsIP, the customer is seeing this warning in events: ``` Warning MissingClusterDNS 1m (x162839 over 12d) kubelet, node09.<domain>.com kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy ``` It looks like the main problem with this BZ is that when a node creates pods using the 'ClusterFirst' policy (ie, when dnsIP is set in node-config.yaml), the incorrect DNS setting is passed to the pods for some reason.
You are right, should edit the configmap instead of node-config.yaml. I just tried to update node-config.yaml to see if kubelet dns option changed. I don't recommend to deleted the dnsIP section of node-config.yam as customer did. @Ravi, do you have any suggestion? do you know how dnsIP is passing to hyperkube?
The PRs below may be helpful to understand how dnsIP is selected if using multiple NICs and default "dnsIP: 0.0.0.0". https://github.com/openshift/origin/pull/21737 https://github.com/openshift/origin/pull/21866
@hongli Thank you for those links. The author of https://github.com/openshift/origin/pull/21737 summarises the current situation well in his opening comment. So, it appears to be a known issue since 3.10. Do you know of any workaround other than deleting the dnsIP section?
A related issue has come up where someone using OCP 3.11 along with Hashicorp's Consul results in the dnsIp defaulting to the wrong interface through the logic selecting the first ipv4, non-loopback interface. Apparently, Consul requires a dummy0 interface be configured which is only used for Consul and isn't going to route properly anywhere else. It is possible for this interface to get picked up by OpenShift and used in the --cluster-dns flag and eventually passed on to Pods. The patch that Daein has offered would be a good solution since the dummy0 for the customer having problems is *not* in the default route. Solutions right now involve custom node configmaps or rebuilding initramfs to omit dummy module. Do we think Daein's patch would be worthy to be included in the upcoming 3.11.z release?
Also, would this be feasible to backport to 3.10.z as well? Customer having the issue mentioned in c#30 is stopping their production upgrade to 3.11.z and would like to see the fix in 3.10.z before they proceed (b/c they have to upgrade to latest 3.10.z and then to 3.11).
Comment #29 summarizes the issue and the supported options (although limited in some cases). (1) If we are dealing with only a handful of nodes that has multiple NICs, immediate workaround could be: - Create a new node group (configmap) for each node - Set 'dnsIP' config option to '0.0.0.0' - Set 'nodeIP' config option to desired routable IPv4 address from one of the NIC. This will ensure kubelet and pod DNS resolver to use the same address. If (1) is not acceptable because we are dealing with many nodes or for other reasons, then we need to change the origin code to compute DNSIP same as how kubelet computes it's node address. Looks like @Daein proposed the same idea here: https://github.com/openshift/origin/pull/21866 This PR needs some changes before it can be merged.
Fixed by https://github.com/openshift/origin/pull/22485 Created corresponding 3.10 bug: https://bugzilla.redhat.com/show_bug.cgi?id=1696394
According to Comment 45, the bug was tested and verified in v3.11.107.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1605
I think you forgot to compile the atomic-openshift-docker-excluder-3.11.117 package. When running the prerequisites.yaml it complains about it missing -> can't go further. I also do not see it in the Errata under packages. Is there a way for a quick fix here?
*** Bug 1719453 has been marked as a duplicate of this bug. ***