Bug 2103149 - compute tempest tests are failing with ssh connection timed out
Summary: compute tempest tests are failing with ssh connection timed out
Keywords:
Status: CLOSED DUPLICATE of bug 1846393
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 17.0 (Wallaby)
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Jakub Libosvar
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-01 14:49 UTC by myadla
Modified: 2022-08-17 18:08 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-17 18:08:17 UTC
Target Upstream Version:
Embargoed:
elicohen: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-16207 0 None None None 2022-07-01 14:56:16 UTC

Description myadla 2022-07-01 14:49:33 UTC
Description of problem:

2022-06-28 21:29:31,655 366044 INFO     [tempest.lib.common.ssh] Creating ssh connection to '10.46.47.191:22' as 'cloud-user' with public key authentication
2022-06-28 21:29:34,730 366044 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.47.191 ([Errno None] Unable to connect to port 22 on 10.46.47.191). Number attempts: 1. Retry after 2 seconds
...
2022-06-28 21:34:34,058 366044 ERROR    [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.47.191 after 20 attempts. Proxy client: no proxy client
2022-06-28 21:34:34.058 366044 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2022-06-28 21:34:34.058 366044 ERROR tempest.lib.common.ssh   File "/usr/lib/python3.9/site-packages/tempest/lib/common/ssh.py", line 131, in _get_ssh_connection
2022-06-28 21:34:34.058 366044 ERROR tempest.lib.common.ssh     ssh.connect(self.host, port=self.port, username=self.username,
2022-06-28 21:34:34.058 366044 ERROR tempest.lib.common.ssh   File "/usr/lib/python3.9/site-packages/paramiko/client.py", line 368, in connect
2022-06-28 21:34:34.058 366044 ERROR tempest.lib.common.ssh     raise NoValidConnectionsError(errors)
2022-06-28 21:34:34.058 366044 ERROR tempest.lib.common.ssh paramiko.ssh_exception.NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 10.46.47.191

Tempest log:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/enterprise/view/scenario/job/DFG-enterprise-baremetal-scenario-17.0-3control_2compute_1freeipavm_externalceph-anycluster_tls/3/testReport/tempest.api.compute.admin.test_create_server/ServersWithSpecificFlavorTestJSON/test_verify_created_server_ephemeral_disk_id_b3c7bcfc_bb5b_4e22_b517_c7f686b802ca_/

Logs:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/enterprise/view/scenario/job/DFG-enterprise-baremetal-scenario-17.0-3control_2compute_1freeipavm_externalceph-anycluster_tls/3/testReport/

Failed tests:
http://pastebin.test.redhat.com/1062420

Version-Release number of selected component (if applicable):
core_puddle: RHOS-17.0-RHEL-9-20220623.n.1

How reproducible:
100%

Steps to Reproduce:
1. Deploy openstack 17.0 in jenkins
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/enterprise/view/scenario/job/DFG-enterprise-baremetal-scenario-17.0-3control_2compute_1freeipavm_externalceph-anycluster_tls/
2. At the stage "Second Tempest Run", tempest tests will fail

Actual results: 25 tests with compute are failing


Expected results: All the tempest runs should pass


Additional info:

Comment 7 smooney 2022-07-15 15:08:24 UTC
I'm moving this to the networking dfg and neutron component
I believe what is happening here is caused by a delta in behaviour with how ml2/ovs works vs ml2/ovn.

I think there is a regression in how ovn works with regards to when it creates the metadata proxy vs when
the metadata proxy is created.

for ml2/ovs or ml2/linux bridge the metadata proxy is configurable as provided either via the l3 agent or 
the DHCP agent. when a neutron network or router is create the metadata proxy for that network/subnet is
also created.

as a result the metadata paroxy when using ml2/ovs and ml2/linuxbridg and most other backend is create before
the port is create and before it is ever bound.

for ml2/ovn i belive this is not the case, the metadata proxy is only instancated after the port is created and bound to 
a host. as such there is not a gurentee that it is provisioned before the VM is booted because the ovn mech driver is
not ensuring the metadata proxy is operation before it send the network-vif-plugged event.

this would intoducece a race between the metadata proxy being provisioned and the VM being unpasused.

i say would because while i have obsverd som bug report and had irc conversation that suggest this is infact what is happening
but i have not proved that form the logs in this case.

can someone form the networking dfg look at this case and review the provisioning blocks code in neutron to ensure that when
neutron sends network-vif plugged that the ovn mech driver as ensured the metadata proxy is provisioned and functional
to ensure we cannot race on its creation when nova start the VM.

the contract between nova and neutron is that neutron must not return network-vif-plugged until all networking for the port is configured.
that include the metadata proxy even if that was an implicit requirement previously due to the architecture of other driver.
for ovn we may need to make this an explicit depnecy to ensure there is no race.

if you think we have overlooked something feel free to send this back to us.

Comment 8 Jakub Libosvar 2022-07-19 14:29:34 UTC
I can't see anything wrong that would prevent DHCP server not working for the given port.

2022-06-28T21:29:25.720Z|18232|binding|INFO|Claiming lport 6770bdf5-5aaf-45d7-9051-5aeea2eda8a2 for this chassis.
2022-06-28T21:29:25.720Z|18233|binding|INFO|6770bdf5-5aaf-45d7-9051-5aeea2eda8a2: Claiming fa:16:3e:b1:98:e7 10.100.0.7
2022-06-28T21:29:25.762Z|18245|binding|INFO|Setting lport 6770bdf5-5aaf-45d7-9051-5aeea2eda8a2 ovn-installed in OVS
2022-06-28T21:29:25.762Z|18246|binding|INFO|Setting lport 6770bdf5-5aaf-45d7-9051-5aeea2eda8a2 up in Southbound

THe port was claimed and openflow installed. I'll probably need to see this live to see if there is an issue with flows.

I noticed the environment has a little misconfiguration on OVN side. The bridge mappings on compute node is set for br-ex but the br-ex bridge is not created and DVR is not enabled. If this is non-dvr environment that the bridge mappings shouldn't be set on the compute nodes.

Comment 17 Jakub Libosvar 2022-08-17 18:08:17 UTC
I executed the failing test with another RHEL 8.2 image that contains fix for the bug 1846393 - rhel-guest-image-8.2-326.x86_64.qcow2 - and it passes. I'm closing it as a duplicate.

*** This bug has been marked as a duplicate of bug 1846393 ***


Note You need to log in before you can comment on or make changes to this bug.