Bug 1994427

Summary: [OVN] SR-IOV port cannot connect to metadata
Product: Red Hat OpenStack Reporter: Eduardo Olivares <eolivare>
Component: python-networking-ovnAssignee: Elvira <egarciar>
Status: ON_DEV --- QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: apevec, jamsmith, jlibosva, lhh, lmartins, majopela, mlavalle, mtomaska, scohen
Target Milestone: z6Keywords: TestOnly, Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovn22.03-22.03.0-95.el8fdp ovn22.03-22.03.0-95.el9fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2196286, 1974898, 2189267    
Bug Blocks:    

Description Eduardo Olivares 2021-08-17 09:12:26 UTC
Description of problem:
This issue is similar to BZ1985717, which was verified recently. I think it is a different issue because it happens very infrequently. While BZ1985717 seemed to be a deterministic issue, this bug seems to be related to some race condition.

This warning message from the VM console logs shows that the metadata service could not be contacted:
[   38.974877] cloud-init[812]: Cloud-init v. 18.5 running 'init' at Mon, 16 Aug 2021 11:15:24 +0000. Up 28.88 seconds.
[   38.977105] cloud-init[812]: ci-info: ++++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++++
[   38.979302] cloud-init[812]: ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+
[   38.981469] cloud-init[812]: ci-info: | Device |  Up  |           Address            |       Mask      | Scope  |     Hw-Address    |
[   38.983646] cloud-init[812]: ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+
[   38.985801] cloud-init[812]: ci-info: |  eth0  | True |         10.46.21.241         | 255.255.255.192 | global | fa:16:3e:a2:5c:c9 |
[   38.987963] cloud-init[812]: ci-info: |  eth0  | True | fe80::f816:3eff:fea2:5cc9/64 |        .        |  link  | fa:16:3e:a2:5c:c9 |
[   38.990136] cloud-init[812]: ci-info: |   lo   | True |          127.0.0.1           |    255.0.0.0    |  host  |         .         |
[   38.992279] cloud-init[812]: ci-info: |   lo   | True |           ::1/128            |        .        |  host  |         .         |
[   38.994445] cloud-init[812]: ci-info: +--------+------+------------------------------+-----------------+--------+-------------------+
[   38.996601] cloud-init[812]: ci-info: ++++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++++
[   38.998586] cloud-init[812]: ci-info: +-------+-----------------+--------------+-----------------+-----------+-------+
[   39.000690] cloud-init[812]: ci-info: | Route |   Destination   |   Gateway    |     Genmask     | Interface | Flags |
[   39.002692] cloud-init[812]: ci-info: +-------+-----------------+--------------+-----------------+-----------+-------+
[   39.004696] cloud-init[812]: ci-info: |   0   |     0.0.0.0     | 10.46.21.254 |     0.0.0.0     |    eth0   |   UG  |
[   39.006704] cloud-init[812]: ci-info: |   1   |   10.46.21.192  |   0.0.0.0    | 255.255.255.192 |    eth0   |   U   |
[   39.008691] cloud-init[812]: ci-info: |   2   | 169.254.169.254 | 10.46.21.213 | 255.255.255.255 |    eth0   |  UGH  |
[   39.010681] cloud-init[812]: ci-info: +-------+-----------------+--------------+-----------------+-----------+-------+
[   39.012669] cloud-init[812]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[   39.014339] cloud-init[812]: ci-info: +-------+-------------+---------+-----------+-------+
[   39.015997] cloud-init[812]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[   39.017661] cloud-init[812]: ci-info: +-------+-------------+---------+-----------+-------+
[   39.019341] cloud-init[812]: ci-info: |   1   |  fe80::/64  |    ::   |    eth0   |   U   |
[   39.021013] cloud-init[812]: ci-info: |   3   |   ff00::/8  |    ::   |    eth0   |   U   |
[   39.022677] cloud-init[812]: ci-info: +-------+-------------+---------+-----------+-------+
[   39.024349] cloud-init[812]: 2021-08-16 11:15:34,391 - util.py[WARNING]: No active metadata service found



The corresponding neutron port is 2bdd32d4-efea-477c-92d3-ec84cbdaa61c:
$ openstack port list --fixed-ip ip-address=10.46.21.241 -c ID -f value
2bdd32d4-efea-477c-92d3-ec84cbdaa61c


For some reason, that port was bound and unbound from the chassis for several times. That took more than a minute and I think this is the reason why the metadata information for this port was not ready yet:
2021-08-16 11:14:12.734 111507 INFO networking_ovn.agent.metadata.agent [-] Port 14f0abf0-4704-40e7-8a02-2b58bf584fa0 in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 unbound from our chassis                                              
2021-08-16 11:14:13.473 111507 INFO networking_ovn.agent.metadata.agent [-] Port 65446967-02b4-4a76-991d-a1c2ec1410d2 in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 unbound from our chassis                                              
2021-08-16 11:14:13.483 111507 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-7e3c462c-7771-49ff-afb9-e6d7ecd620c6 namespace which is not needed anymore                                                                   
2021-08-16 11:14:46.288 111507 INFO networking_ovn.agent.metadata.agent [-] Port 2bdd32d4-efea-477c-92d3-ec84cbdaa61c in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 bound to our chassis
2021-08-16 11:14:48.616 111507 INFO networking_ovn.agent.metadata.agent [-] Port 2bdd32d4-efea-477c-92d3-ec84cbdaa61c in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 unbound from our chassis
2021-08-16 11:14:48.802 111507 INFO networking_ovn.agent.metadata.agent [-] Port 2bdd32d4-efea-477c-92d3-ec84cbdaa61c in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 bound to our chassis
2021-08-16 11:14:48.984 111507 INFO networking_ovn.agent.metadata.agent [-] Port 2bdd32d4-efea-477c-92d3-ec84cbdaa61c in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 unbound from our chassis
2021-08-16 11:14:49.155 111507 INFO networking_ovn.agent.metadata.agent [-] Port 2bdd32d4-efea-477c-92d3-ec84cbdaa61c in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 bound to our chassis
2021-08-16 11:14:49.330 111507 INFO networking_ovn.agent.metadata.agent [-] Port 2bdd32d4-efea-477c-92d3-ec84cbdaa61c in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 unbound from our chassis
...
2021-08-16 11:15:36.888 111507 INFO networking_ovn.agent.metadata.agent [-] Port 2bdd32d4-efea-477c-92d3-ec84cbdaa61c in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 unbound from our chassis
2021-08-16 11:15:37.056 111507 INFO networking_ovn.agent.metadata.agent [-] Port 2bdd32d4-efea-477c-92d3-ec84cbdaa61c in datapath 7e3c462c-7771-49ff-afb9-e6d7ecd620c6 bound to our chassis



When I rebooted the VM, the metadata information was obtained successfully:
[   42.184307] cloud-init[787]: Cloud-init v. 18.5 running 'init' at Mon, 16 Aug 2021 11:29:40 +0000. Up 30.23 seconds.
...
[[0;32m  OK  [0m] Started Initial cloud-init job (metadata service crawler).

As we can see in the ovn-metadata-agent logs, several requests were received from the VM IP:
2021-08-16 11:29:41.426 111911 INFO eventlet.wsgi.server [-] 10.46.21.241,<local> "GET /openstack HTTP/1.1" status: 200  len: 230 time: 0.4844251                                                                                            
2021-08-16 11:29:41.907 111911 INFO eventlet.wsgi.server [-] 10.46.21.241,<local> "GET /openstack HTTP/1.1" status: 200  len: 230 time: 0.4769990                                                                                            
2021-08-16 11:29:43.014 111911 INFO eventlet.wsgi.server [-] 10.46.21.241,<local> "GET /openstack/2018-08-27/meta_data.json HTTP/1.1" status: 200  len: 2187 time: 1.1038783
2021-08-16 11:29:43.514 111911 INFO eventlet.wsgi.server [-] 10.46.21.241,<local> "GET /openstack/2018-08-27/user_data HTTP/1.1" status: 200  len: 135 time: 0.4969020
2021-08-16 11:29:44.043 111911 INFO eventlet.wsgi.server [-] 10.46.21.241,<local> "GET /openstack/2018-08-27/vendor_data.json HTTP/1.1" status: 200  len: 128 time: 0.5247002
2021-08-16 11:29:45.139 111911 INFO eventlet.wsgi.server [-] 10.46.21.241,<local> "GET /openstack/2018-08-27/network_data.json HTTP/1.1" status: 200  len: 451 time: 1.0909605
...





I will provide the full console logs from two VMs that were spawned almost simultaneously and failed to contact the metadata service and the ovn metadata agent log file.






Version-Release number of selected component (if applicable):
RHOS-16.2-RHEL-8-20210811.n.1
[root@controller-1 ~]# podman exec -ituroot ovn_metadata_agent rpm -qa | grep ovn
python3-networking-ovn-7.4.2-2.20210601204824.el8ost.10.noarch
python3-networking-ovn-metadata-agent-7.4.2-2.20210601204824.el8ost.10.noarch
[root@controller-1 ~]# podman exec -ituroot ovn_controller rpm -qa | grep ovn                                                                                                                                                                
ovn-2021-21.06.0-17.el8fdp.x86_64
rhosp-ovn-2021-4.el8ost.1.noarch
ovn-2021-host-21.06.0-17.el8fdp.x86_64
rhosp-ovn-host-2021-4.el8ost.1.noarch


How reproducible:
It doesn't happen often. I had to rerun all our tempest SR-IOV tests several times in order to reproduce it once.

Steps to Reproduce:
1. Create a direct type port, connected to the external network
2. Create a VM with that port
3. Check the VM could not connect to the metadata service (hostname is wrong, authorized_keys file is empty)

Actual results:
Metadata service is not accessible from a VM so SSH key can not be obtained. It is not possible to connect to VM using SSH.

Expected results:
Metadata service is accessible from VM. It is possible to connect to VM using SSH.