Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2012932

Summary:	VM unable to contact metadata due to ARP problem
Product:	Red Hat Enterprise Linux Fast Datapath	Reporter:	Eduardo Olivares <eolivare>
Component:	ovn-2021	Assignee:	Ihar Hrachyshka <ihrachys>
Status:	CLOSED ERRATA	QA Contact:	Jianlin Shi <jishi>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	FDP 21.I	CC:	ctrautma, ihrachys, jiji, kfida, mmichels
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ovn-2021-21.09.0-12	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-12-09 15:37:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eduardo Olivares 2021-10-11 16:19:34 UTC

Description of problem:
This issue was initially reproduced by the OSP job that is triggered with intermediate OVN versions (I mean, it's not officially released yet).
The job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.2_director-rhel-virthost-3cont_2comp_3net-ipv4-geneve-composable-vlan-provider-network/10/
The intermediate ovn version: ovn-2021.21.09.0-5



I have reproduced this issue in an openstack environment twice using the same method, although I don't understand some things yet:
- I ran two tempest tests in parallel that are very similar. They created a VM connected to a tenant network and with a FIP. I added a breakpoint before they try to connect to the VM via ssh to their FIPs.
- Both times, one of the VMs can be accessed via ssh and the other one can't. The problem is in the metadata. The faulty VM couldn't connect to the metadata and couldn't obtain the authorized public key.
- Some commands executed at the VM that works fine (169.254.169.254 is the metadata IP):
[root@tempest-broadcast-sender-258843568 ~]# ip r
default via 10.100.0.1 dev eth0 proto dhcp metric 100
10.100.0.0/28 dev eth0 proto kernel scope link src 10.100.0.14 metric 100
169.254.169.254 via 10.100.0.2 dev eth0 proto dhcp metric 100
[root@tempest-broadcast-sender-258843568 ~]# ping 10.100.0.2 -c1                                                                                                                                                                            
PING 10.100.0.2 (10.100.0.2) 56(84) bytes of data.
64 bytes from 10.100.0.2: icmp_seq=1 ttl=64 time=3.90 ms

--- 10.100.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.899/3.899/3.899/0.000 ms
[root@tempest-broadcast-sender-258843568 ~]# arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
10.100.0.2               ether   fa:16:3e:63:dd:56   C                     eth0
10.100.0.1               ether   fa:16:3e:80:81:fc   C                     eth0

- The same commands, executed at the faulty VM (169.254.169.254 is the metadata IP):
[root@localhost ~]# ip r
default via 10.100.0.1 dev eth0 proto dhcp metric 100
10.100.0.0/28 dev eth0 proto kernel scope link src 10.100.0.9 metric 100
169.254.169.254 via 10.100.0.2 dev eth0 proto dhcp metric 100
[root@localhost ~]# ping 10.100.0.2 -c1                                                                                                                                                                                                      
PING 10.100.0.2 (10.100.0.2) 56(84) bytes of data.
From 10.100.0.9 icmp_seq=1 Destination Host Unreachable

--- 10.100.0.2 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

[root@localhost ~]# arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
10.100.0.2                       (incomplete)                              eth0
10.100.0.1               ether   fa:16:3e:7d:b1:a4   C                     eth0

- The arp table from the "successful" ovnmetadata namespace includes an entry for the "successful" VM:
[root@compute-0 ~]# ip netns e ovnmeta-cba5f3c1-397c-47cb-b1af-da45a73e1b22 arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
10.100.0.14              ether   fa:16:3e:4a:0b:9c   C                     tapcba5f3c1-31

- The arp table from the "faulty" ovnmetadata namespace is empty:
[root@compute-1 ~]# ip netns e ovnmeta-12ff7d70-03a1-49c5-8bcd-98e28f0a4858 arp -n
[root@compute-1 ~]# 

- I don't understand why it affects only one VM/ovnmetadata namespace. Maybe a race condition? I fixed the problem by sending a ping from the "faulty" ovnmetadata namespace to the faulty VM.







Version-Release number of selected component (if applicable):
ovn-2021.21.09.0-5

How reproducible:
not always reproducible, but it affected 26 tempest tests from the job

Steps to Reproduce:
1. run tempest tests - I didn't reproduce it manually, but I guess creating several VMs on different tenant networks should reproduce the issue too


Additional info:

Comment 1 Ihar Hrachyshka 2021-10-14 22:58:18 UTC

Upstream fix: https://patchwork.ozlabs.org/project/ovn/patch/20211014225506.1405437-1-ihrachys@redhat.com/

Comment 2 Ihar Hrachyshka 2021-10-14 23:05:57 UTC

A better version (fixed test case): https://patchwork.ozlabs.org/project/ovn/patch/20211014230511.1412921-1-ihrachys@redhat.com/

Comment 5 Jianlin Shi 2021-10-20 01:51:50 UTC

tested with following script:

systemctl start openvswitch                                                                           
systemctl start ovn-northd                                                                            
ovn-nbctl set-connection ptcp:6641                                                                    
ovn-sbctl set-connection ptcp:6642                                                                    
ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.40.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.40.25
systemctl restart ovn-controller


ovn-nbctl ls-add ls1
ovn-nbctl set logical_switch ls1 other_config:vlan-passthru=true

ovn-nbctl lsp-add ls1 ls1p1
ovn-nbctl lsp-set-addresses ls1p1 "00:00:00:01:01:11 192.168.1.11"

ovn-nbctl lsp-add ls1 ls1lp
ovn-nbctl lsp-set-type ls1lp localport
ovn-nbctl lsp-set-addresses ls1lp "00:00:00:01:01:02 192.168.1.2"

ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external_ids:iface-id=ls1p1
ip netns add ls1p1
ip link set ls1p1 netns ls1p1
ip netns exec ls1p1 ip link set ls1p1 address 00:00:00:01:01:11
ip netns exec ls1p1 ip link set ls1p1 up
ip netns exec ls1p1 ip addr add 192.168.1.11/24 dev ls1p1

ovs-vsctl add-port br-int ls1lp -- set interface ls1lp type=internal external_ids:iface-id=ls1lp
ip netns add ls1lp
ip link set ls1lp netns ls1lp
ip netns exec ls1lp ip link set ls1lp address 00:00:00:01:01:02
ip netns exec ls1lp ip link set ls1lp up
ip netns exec ls1lp ip addr add 192.168.1.2/24 dev ls1lp

ip netns exec ls1p1 ping 192.168.1.2 -c 2

reproduced on ovn21.09-21.09.0-13:

[root@dell-per740-12 bz2021932]# rpm -qa | grep -E "openvswitch2.16|ovn21.09"
ovn21.09-21.09.0-13.el8fdp.x86_64                                                                     
ovn21.09-central-21.09.0-13.el8fdp.x86_64                                                             
python3-openvswitch2.16-2.16.0-16.el8fdp.x86_64                                                       
ovn21.09-host-21.09.0-13.el8fdp.x86_64                                                                
openvswitch2.16-2.16.0-16.el8fdp.x86_64                                                               
openvswitch2.16-test-2.16.0-16.el8fdp.noarch

+ ip netns exec ls1p1 ping 192.168.1.2 -c 2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1038ms

<=== ping failed

Verified on ovn-2021-21.09.0-12:

[root@dell-per740-12 bz2021932]# rpm -qa | grep -E "openvswitch2.16|ovn-2021"
ovn-2021-21.09.0-12.el8fdp.x86_64
ovn-2021-central-21.09.0-12.el8fdp.x86_64
ovn-2021-host-21.09.0-12.el8fdp.x86_64
python3-openvswitch2.16-2.16.0-16.el8fdp.x86_64
openvswitch2.16-2.16.0-16.el8fdp.x86_64
openvswitch2.16-test-2.16.0-16.el8fdp.noarch

+ ip netns exec ls1p1 ping 192.168.1.2 -c 2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=1010 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=1.23 ms

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1009ms
rtt min/avg/max/mdev = 1.234/505.615/1009.996/504.381 ms, pipe 2

<=== ping passed

Comment 7 errata-xmlrpc 2021-12-09 15:37:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (ovn bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:5059