Description of problem: /etc/resolv.conf is not updated correctly when the ip command prints output which has unexpected columns order. * "route -n" output Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.1.68 0.0.0.0 UG 100 0 0 eth0 10.0.1.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0 10.128.0.0 0.0.0.0 255.252.0.0 U 0 0 0 tun0 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 172.30.0.0 0.0.0.0 255.255.0.0 U 0 0 0 tun0 * "ip route show table all" output, you should focus the 'local' prefix line. $ cat ./ip_route_show_table_all default via 10.0.1.68 dev eth0 proto static metric 100 10.0.1.0/24 dev eth0 proto kernel scope link src 10.0.1.68 metric 100 10.128.0.0/14 dev tun0 scope link 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 172.30.0.0/16 dev tun0 broadcast 10.0.1.0 dev eth0 table local proto kernel scope link src 10.0.1.68 local 10.0.1.68 dev eth0 table local proto kernel scope host src 10.0.1.68 broadcast 10.0.1.255 dev eth0 table local proto kernel scope link src 10.0.1.68 ... * Affected codes in "99-origin-dns.sh" ~~~ ###################################################################### # couldn't find an existing method to determine if the interface owns the # default route def_route=$(/sbin/ip route list match 0.0.0.0/0 | awk '{print $3 }') def_route_int=$(/sbin/ip route get to ${def_route} | awk '{print $3}') def_route_ip=$(/sbin/ip route get to ${def_route} | awk '{print $5}') if [[ ${DEVICE_IFACE} == ${def_route_int} ]]; then if [ ! -f /etc/dnsmasq.d/origin-dns.conf ]; then cat << EOF > /etc/dnsmasq.d/origin-dns.conf ~~~ The def_route has been assigned unexpected columns order as follows. ~~~ # /sbin/ip route list match 0.0.0.0/0 | awk '{print $3}' 10.0.1.68 --- def_route is allocated as below output. # /sbin/ip route get to 10.0.1.68 local 10.0.1.68 dev lo src 10.0.1.68 cache <local> # /sbin/ip route get to 10.0.1.68 | awk '{print $3}' dev # /sbin/ip route get to 10.0.1.68 | awk '{print $5}' src ~~~ Version-Release number of the following components: rpm -q openshift-ansible openshift-ansible-3.9.33-1.git.56.19ba16e.el7.noarch rpm -q ansible ansible-2.4.6.0-1.el7ae.noarch How reproducible: Always the "ip route get to" print some other output. Steps to Reproduce: 1. 2. 3. Actual results: /etc/resolv.conf is not updated and pointing to upstream DNS Expected results: /etc/resolv.conf is updated and pointing to node dnsmasq Additional info: For more accurate and expected value, modify the 99-origin-dns.sh as follows. def_route_int=$(/sbin/ip route get to ${def_route} | awk -F 'dev' '{print $2}' | awk '{print $1}') def_route_ip=$(/sbin/ip route get to ${def_route} | awk -F 'src' '{print $5}' | awk '{print $1}')
Above workaround has typo, the following one is correct one. ~~ def_route_ip=$(/sbin/ip route get to ${def_route} | awk -F 'src' '{print $2}' | awk '{print $1}') ~~ And I've opened a PR here: https://github.com/openshift/openshift-ansible/pull/9448 For suppressing issues due to dependences of output order, we should filter to get required values.
Do you know what needs to be done to get a host into the condition where it fails? I'm not familiar enough with routing tables to know what introduced the problematic line but it'd be helpful for QE to verify the fix.
@Scott, That's good point. Usual "ip route get to" output format is as follows. # /sbin/ip route get to 10.0.1.68 10.0.1.68 dev eth0 src 10.0.1.68 cache --- return the network interface # /sbin/ip route get to 10.0.1.68 | awk '{print $3}' eth0 --- return the src ip address(eth0 ip address) # /sbin/ip route get to 10.0.1.68 | awk '{print $5}' 10.0.1.68 My case is as follows. # /sbin/ip route get to 10.0.1.68 local 10.0.1.68 dev lo src 10.0.1.68 cache <local> --- not returning device name # /sbin/ip route get to 10.0.1.68 | awk '{print $3}' dev --- not returning ip address using the device. # /sbin/ip route get to 10.0.1.68 | awk '{print $5}' src
https://github.com/openshift/openshift-ansible/pull/9448
Should be in openshift-ansible-3.11.0-0.15.0
Go through all the comments, seem like this is related to user specific network environment, user is running node install on a host, and the host's eth0 is the Gateway of local network. In QE's environment, the network Gateway is always located in other place (out of my control), never run a node install on network Gateway, I am not a expert of network, I have no way to run a real verification just like customer's environment. I only run some regression testing against the latest playbook to avoid introduce some new install, result seem good to me. Version: openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch In my env, my eth0 ip is "172.18.14.180", and Gateway is "172.18.0.1" #[root@ip-172-18-14-180 ~]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000 link/ether 0e:d7:a3:c0:67:28 brd ff:ff:ff:ff:ff:ff inet 172.18.14.180/20 brd 172.18.15.255 scope global noprefixroute dynamic eth0 valid_lft 2651sec preferred_lft 2651sec inet6 fe80::cd7:a3ff:fec0:6728/64 scope link valid_lft forever preferred_lft forever 3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:a5:71:e3:d4 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 scope global docker0 valid_lft forever preferred_lft forever [root@ip-172-18-14-180 ~]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 172.18.0.1 0.0.0.0 UG 100 0 0 eth0 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 172.18.0.0 0.0.0.0 255.255.240.0 U 100 0 0 eth0 [root@ip-172-18-14-180 ~]# cat /etc/resolv.conf # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search cluster.local ec2.internal nameserver 172.18.14.180 /etc/resolv.conf is updated and pointing to node dnsmasq successfully. On the env, run some command to add Gateway IP to the eth0 interface to emulate the customer env. [root@ip-172-18-14-180 ~]# ip addr add 172.18.0.1 dev eth0 [root@ip-172-18-14-180 ~]# ip addr <--snip--> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000 link/ether 0e:d7:a3:c0:67:28 brd ff:ff:ff:ff:ff:ff inet 172.18.14.180/20 brd 172.18.15.255 scope global noprefixroute dynamic eth0 valid_lft 2561sec preferred_lft 2561sec inet 172.18.0.1/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::cd7:a3ff:fec0:6728/64 scope link valid_lft forever preferred_lft forever <--snip--> [root@ip-172-18-14-180 ~]# /sbin/ip route get to 172.18.0.1 local 172.18.0.1 dev lo src 172.18.0.1 cache <local> Now the output is the same as user case, restart NetworkManager, /etc/resolv.conf is updated, but not pointing to node dnsmasq as expected result. [root@ip-172-18-14-180 ~]# cat /etc/resolv.conf # Generated by NetworkManager search ec2.internal nameserver 172.18.0.2 Adding some echo into /etc/NetworkManager/dispatcher.d/99-origin-dns.sh to do some debug. <--snip--> def_route=$(/sbin/ip route list match 0.0.0.0/0 | awk '{print $3 }') def_route_int=$(/sbin/ip route get to ${def_route} | awk -F 'dev' '{print $2}' | head -n1 | awk '{print $1}') def_route_ip=$(/sbin/ip route get to ${def_route} | awk -F 'src' '{print $2}' | head -n1 | awk '{print $1}') echo "def_route_int=${def_route_int} def_route_ip=${def_route_ip} DEVICE_IFACE=${DEVICE_IFACE}" >/tmp/test <--snip--> [root@ip-172-18-14-180 ~]# cat /tmp/test def_route_int=lo def_route_ip=172.18.0.1 DEVICE_IFACE=eth0 The PR is working well as expectation, but ${DEVICE_IFACE} != ${def_route_int}, that would lead the following code would be skip, no any update to /etc/resolv.conf. I am not sure what is the network environment in customer case, only could be double-confirmed by customer (according initial report and comment 1, seem like the PR is working well against customer env) PR is working well like what is expected by reporter, and no regression is introduced, so I move this bug to VERIFIED, if not that case, feel free to move back and provide more info about how to re-create such special network env.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652