Bug 2123273

Summary: Low probability metadata+connection failure
Product: Red Hat OpenStack Reporter: Attila Fazekas <afazekas>
Component: openstack-neutronAssignee: Elvira <egarciar>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Eran Kuris <ekuris>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 17.1 (Wallaby)CC: chrisw, scohen, twilson
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-28 16:05:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Attila Fazekas 2022-09-01 09:25:28 UTC
Description of problem:
currently loaded modules: 8139cp 8390 9pnet 9pnet_virtio ahci drm drm_kms_helper e1000 failover fb_sys_fops hid hid_generic ip_tables isofs libahci mii ne2k_pci net_failover nls_ascii nls_iso8859_1 nls_utf8 pcnet32 qemu_fw_cfg syscopyarea sysfillrect sysimgblt ttm usbhid virtio_blk virtio_gpu virtio_input virtio_net virtio_rng virtio_scsi x_tables 
info: copying initramfs to /dev/vda1
info: initramfs loading root from /dev/vda1
info: /etc/init.d/rc.sysinit: up at 8.19
info: container: none
currently loaded modules: 8139cp 8390 9pnet 9pnet_virtio ahci drm drm_kms_helper e1000 failover fb_sys_fops hid hid_generic ip_tables isofs libahci mii ne2k_pci net_failover nls_ascii nls_iso8859_1 nls_utf8 pcnet32 qemu_fw_cfg syscopyarea sysfillrect sysimgblt ttm usbhid virtio_blk virtio_gpu virtio_input virtio_net virtio_rng virtio_scsi x_tables 
Initializing random number generator... done.
Starting acpid: OK
Starting network: udhcpc: started, v1.29.3
udhcpc: sending discover
udhcpc: sending select for 10.100.0.10
udhcpc: lease of 10.100.0.10 obtained, lease time 43200
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1"
OK
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 8.40. request failed
failed 2/20: up 57.44. request failed
failed 3/20: up 106.47. request failed
failed 4/20: up 155.49. request failed
failed 5/20: up 204.52. request failed
failed 6/20: up 253.54. request failed

Version-Release number of selected component (if applicable):
17.1

How reproducible:
0.01% .. 0.1%

Steps to Reproduce:
1. deploy openstack
2. run tempest
3.

Actual results:
tempest fails to ssh vm.

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/tempest/lib/common/ssh.py", line 131, in _get_ssh_connection
    ssh.connect(self.host, port=self.port, username=self.username,
  File "/usr/lib/python3.9/site-packages/paramiko/client.py", line 368, in connect
    raise NoValidConnectionsError(errors)
paramiko.ssh_exception.NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 10.0.0.180

Expected results:
tempest can ssh the vm.


Additional info:
In the example tampest output claims it is connection issue, not just key issue.
ovn/geneve .
test_cross_tenant_traffic[compute,id-e79f879e-debb-440c-a7e4-efeda05b6848,network]

Comment 2 Elvira 2022-09-20 12:51:41 UTC
Hi, from the logs I can see that the requests arrive to the nova metadata API as expected and arrive back to the neutron-metadata-server, but it seems like the request never arrive to the VM itself, because this sequence repeated until timeout:

In compute-0/var/log/containers/neutron/ovn-metadata-agent.log:

2022-08-17 13:42:04.456 26777 DEBUG neutron.agent.ovn.metadata.server [-] Request: GET /2009-04-04/user-data HTTP/1.0
Accept: */*                                                                
Connection: close                                                     
Content-Type: text/plain                                                  
Host: 169.254.169.254                                                    
User-Agent: curl/7.64.1                                                    
X-Forwarded-For: 10.100.0.7                                                 
X-Ovn-Network-Id: ee6d7961-8d65-463e-8ac8-c62df9ca0f65 __call__ /usr/lib/python3.9/site-packages/neutron/agent/ovn/metadata/server.py:84
2022-08-17 13:42:04.592 26777 DEBUG neutron.agent.ovn.metadata.server [-] <Response [200]> _proxy_request /usr/lib/python3.9/site-packages/neutron/agent/ovn/metadata/server.py:162
2022-08-17 13:42:04.597 26777 INFO eventlet.wsgi.server [-] 10.100.0.7,<local> "GET /2009-04-04/user-data HTTP/1.1" status: 200  len: 231 time: 0.1411240
2022-08-17 13:42:04.603 26776 DEBUG eventlet.wsgi.server [-] (26776) accepted '' server /usr/lib/python3.9/site-packages/eventlet/wsgi.py:992
2022-08-17 13:42:04.604 26776 DEBUG neutron.agent.ovn.metadata.server [-] Request: GET /2009-04-04/meta-data/block-device-mapping HTTP/1.0

The reason why this is happening is either because the neutron metadata-agent cannot find the way back to the VM (which doesn't make much sense since it does get the petitions), or that there is something within the VM itself when processing the metadata.

I think it would be useful to have a live environment for this, if possible.

Comment 3 Attila Fazekas 2022-09-20 14:22:58 UTC
I can create similar deployment, but unlikely I can save one where the issue happened.
I also can show how to run tempest-stress with the same test on multiple threads.

Comment 4 Elvira 2022-10-13 07:51:24 UTC
Hi Attila!
I'm interested in learning how to run tempest-stress with the same test on multiple threads as you said so that I can reproduce it. I sent you an email related to it. Is there any new run where you could see this happen again?