Bug 1613094

Summary:	99-origin-dns.sh cannot handle unexpected order of ip command columns
Product:	OpenShift Container Platform	Reporter:	Daein Park <dapark>
Component:	Installer	Assignee:	Michael Gugino <mgugino>
Status:	CLOSED ERRATA	QA Contact:	Johnny Liu <jialiu>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	aos-bugs, dapark, jokerman, mmccomas
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-11 07:24:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daein Park 2018-08-07 00:57:58 UTC

Description of problem:

/etc/resolv.conf is not updated correctly when the ip command prints output which has unexpected columns order. 

* "route -n" output
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.0.1.68       0.0.0.0         UG    100    0        0 eth0
10.0.1.0        0.0.0.0         255.255.255.0   U     100    0        0 eth0
10.128.0.0      0.0.0.0         255.252.0.0     U     0      0        0 tun0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.30.0.0      0.0.0.0         255.255.0.0     U     0      0        0 tun0

* "ip route show table all" output, you should focus the 'local' prefix line.
$ cat ./ip_route_show_table_all
default via 10.0.1.68 dev eth0 proto static metric 100 
10.0.1.0/24   dev eth0 proto kernel scope link src 10.0.1.68 metric 100 
10.128.0.0/14 dev tun0 scope link 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
172.30.0.0/16 dev tun0 
broadcast 10.0.1.0 dev eth0 table local proto kernel scope link src 10.0.1.68

local 10.0.1.68 dev eth0 table local proto kernel scope host src 10.0.1.68 

broadcast 10.0.1.255 dev eth0 table local proto kernel scope link src 10.0.1.68
...

* Affected codes in "99-origin-dns.sh"
~~~
######################################################################
  # couldn't find an existing method to determine if the interface owns the
  # default route
  def_route=$(/sbin/ip route list match 0.0.0.0/0 | awk '{print $3 }')
  def_route_int=$(/sbin/ip route get to ${def_route} | awk '{print $3}')
  def_route_ip=$(/sbin/ip route get to ${def_route} | awk '{print $5}')
  if [[ ${DEVICE_IFACE} == ${def_route_int} ]]; then
    if [ ! -f /etc/dnsmasq.d/origin-dns.conf ]; then
      cat << EOF > /etc/dnsmasq.d/origin-dns.conf
~~~

The def_route has been assigned unexpected columns order as follows.
~~~
# /sbin/ip route list match 0.0.0.0/0 | awk '{print $3}'
10.0.1.68

--- def_route is allocated as below output.
# /sbin/ip route get to 10.0.1.68
local 10.0.1.68 dev lo src 10.0.1.68
    cache <local>

# /sbin/ip route get to 10.0.1.68 | awk '{print $3}'
dev

# /sbin/ip route get to 10.0.1.68 | awk '{print $5}'
src

~~~


Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.9.33-1.git.56.19ba16e.el7.noarch

rpm -q ansible
ansible-2.4.6.0-1.el7ae.noarch


How reproducible:

Always the "ip route get to" print some other output.

Steps to Reproduce:
1.
2.
3.

Actual results:

/etc/resolv.conf is not updated and pointing to upstream DNS


Expected results:

/etc/resolv.conf is updated and pointing to node dnsmasq


Additional info:

For more accurate and expected value, modify the 99-origin-dns.sh as follows.


  def_route_int=$(/sbin/ip route get to ${def_route} | awk -F 'dev' '{print $2}' | awk '{print $1}')
  def_route_ip=$(/sbin/ip route get to ${def_route} | awk -F 'src' '{print $5}' | awk '{print $1}')

Comment 1 Daein Park 2018-08-07 01:33:36 UTC

Above workaround has typo, the following one is correct one.
~~
def_route_ip=$(/sbin/ip route get to ${def_route} | awk -F 'src' '{print $2}' | awk '{print $1}')
~~

And I've opened a PR here: https://github.com/openshift/openshift-ansible/pull/9448

For suppressing issues due to dependences of output order, we should filter to get required values.

Comment 3 Scott Dodson 2018-08-07 14:31:22 UTC

Do you know what needs to be done to get a host into the condition where it fails? I'm not familiar enough with routing tables to know what introduced the problematic line but it'd be helpful for QE to verify the fix.

Comment 4 Daein Park 2018-08-08 08:06:01 UTC

@Scott,

That's good point.

Usual "ip route get to" output format is as follows.

# /sbin/ip route get to 10.0.1.68
10.0.1.68 dev eth0 src 10.0.1.68 
    cache 

--- return the network interface
# /sbin/ip route get to 10.0.1.68 | awk '{print $3}'
eth0

--- return the src ip address(eth0 ip address)
# /sbin/ip route get to 10.0.1.68 | awk '{print $5}'
10.0.1.68

My case is as follows.

# /sbin/ip route get to 10.0.1.68
local 10.0.1.68 dev lo src 10.0.1.68
    cache <local>

--- not returning device name
# /sbin/ip route get to 10.0.1.68 | awk '{print $3}'
dev

--- not returning ip address using the device.
# /sbin/ip route get to 10.0.1.68 | awk '{print $5}'
src

Comment 5 Scott Dodson 2018-08-09 13:28:12 UTC

https://github.com/openshift/openshift-ansible/pull/9448

Comment 6 Scott Dodson 2018-08-14 21:24:44 UTC

Should be in openshift-ansible-3.11.0-0.15.0

Comment 7 Johnny Liu 2018-08-20 09:35:10 UTC

Go through all the comments, seem like this is related to user specific network environment, user is running node install on a host, and the host's eth0 is the  Gateway of local network. In QE's environment, the network Gateway is always located in other place (out of my control), never run a node install on network Gateway, I am not a expert of network, I have no way to run a real verification  just like customer's environment. 


I only run some regression testing against the latest playbook to avoid introduce some new install, result seem good to me.

Version:
openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch

In my env, my eth0 ip is "172.18.14.180", and Gateway is "172.18.0.1"
#[root@ip-172-18-14-180 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0e:d7:a3:c0:67:28 brd ff:ff:ff:ff:ff:ff
    inet 172.18.14.180/20 brd 172.18.15.255 scope global noprefixroute dynamic eth0
       valid_lft 2651sec preferred_lft 2651sec
    inet6 fe80::cd7:a3ff:fec0:6728/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:a5:71:e3:d4 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

[root@ip-172-18-14-180 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.18.0.1      0.0.0.0         UG    100    0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.18.0.0      0.0.0.0         255.255.240.0   U     100    0        0 eth0

[root@ip-172-18-14-180 ~]# cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Generated by NetworkManager
search cluster.local ec2.internal
nameserver 172.18.14.180


/etc/resolv.conf is updated and pointing to node dnsmasq successfully.


On the env, run some command to add Gateway IP to the eth0 interface to emulate the customer env.
[root@ip-172-18-14-180 ~]# ip addr add 172.18.0.1 dev eth0

[root@ip-172-18-14-180 ~]# ip addr
<--snip-->
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0e:d7:a3:c0:67:28 brd ff:ff:ff:ff:ff:ff
    inet 172.18.14.180/20 brd 172.18.15.255 scope global noprefixroute dynamic eth0
       valid_lft 2561sec preferred_lft 2561sec
    inet 172.18.0.1/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::cd7:a3ff:fec0:6728/64 scope link 
       valid_lft forever preferred_lft forever
<--snip-->

[root@ip-172-18-14-180 ~]# /sbin/ip route get to  172.18.0.1
local 172.18.0.1 dev lo src 172.18.0.1 
    cache <local> 

Now the output is the same as user case, restart NetworkManager, /etc/resolv.conf is updated, but not pointing to node dnsmasq as expected result.
[root@ip-172-18-14-180 ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
search ec2.internal
nameserver 172.18.0.2

Adding some echo into /etc/NetworkManager/dispatcher.d/99-origin-dns.sh to do some debug.
<--snip-->
  def_route=$(/sbin/ip route list match 0.0.0.0/0 | awk '{print $3 }')
  def_route_int=$(/sbin/ip route get to ${def_route} | awk -F 'dev' '{print $2}' | head -n1 | awk '{print $1}')
  def_route_ip=$(/sbin/ip route get to ${def_route}  | awk -F 'src' '{print $2}' | head -n1 | awk '{print $1}')
  echo "def_route_int=${def_route_int} def_route_ip=${def_route_ip} DEVICE_IFACE=${DEVICE_IFACE}" >/tmp/test
<--snip-->

[root@ip-172-18-14-180 ~]# cat /tmp/test 
def_route_int=lo def_route_ip=172.18.0.1 DEVICE_IFACE=eth0

The PR is working well as expectation, but ${DEVICE_IFACE} != ${def_route_int}, that would lead the following code would be skip, no any update to /etc/resolv.conf. I am not sure what is the network environment in customer case, only could be double-confirmed by customer (according initial report and comment 1, seem like the PR is working well against customer env)

PR is working well like what is expected by reporter, and no regression is introduced, so I move this bug to VERIFIED, if not that case, feel free to move back and provide more info about how to re-create such special network env.

Comment 9 errata-xmlrpc 2018-10-11 07:24:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652