1966593 – Connection timed out in paramiko client on the second instance

Bug 1966593 - Connection timed out in paramiko client on the second instance

Summary: Connection timed out in paramiko client on the second instance

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tempest
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Lukas Piwowarski
QA Contact:	Martin Kopec
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-01 13:15 UTC by Martin Kopec
Modified:	2023-08-21 08:36 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-08-21 08:34:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-4329	0	None	None	None	2022-06-08 21:05:45 UTC

Description Martin Kopec 2021-06-01 13:15:32 UTC

Description of problem:
2 tempest scenario tests fail on a baremetal deployment on the same time out issue.

Failing tests:
tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern[compute,id-557cd2c2-4eb8-4dce-98be-f86765ff311b,image,slow,volume]
tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern[compute,id-608e604b-1d63-4a82-8e3e-91bc665c90b4,image,network,slow]

Version-Release number of selected component (if applicable):
[stack@sealusa39 tempest-dir]$ rpm -qa | grep tempest
python3-tempest-24.0.0-1.20201113224606.c73e6b1.el8ost.noarch
python3-cinder-tests-tempest-0.3.1-1.20201113142636.6f0b4fc.el8ost.noarch
openstack-tempest-24.0.0-1.20201113224606.c73e6b1.el8ost.noarch
python3-tempestconf-3.0.2-1.20201114013033.c377168.el8ost.noarch

NOTE: I've reproduced the error also with the tempest master.


tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern[compute,id-557cd2c2-4eb8-4dce-98be-f86765ff311b,image,slow,volume] 
-----------------------------------------------------------------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 112, in _get_ssh_connection
    sock=proxy_chan)
  File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 343, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python3.6/site-packages/paramiko/util.py", line 280, in retry_on_signal
    return function()
  File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 343, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/test_volume_boot_pattern.py", line 103, in test_volume_boot_pattern
    server=instance_2nd)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 770, in get_timestamp
    server=server)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 501, in get_remote_client
    linux_client.validate_authentication()
  File "/usr/lib/python3.6/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 60, in wrapper
    six.reraise(*original_exception)
  File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise
    raise value
  File "/usr/lib/python3.6/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 33, in wrapper
    return function(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 116, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 216, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 128, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 10.1.162.153 via SSH timed out.
User: cloud-user, Password: None


tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern[compute,id-608e604b-1d63-4a82-8e3e-91bc665c90b4,image,network,slow]
----------------------------------------------------------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):
  File "/home/stack/tempest-master/tempest/lib/common/ssh.py", line 112, in _get_ssh_connection
    sock=proxy_chan)
  File "/home/stack/tempest-master/.tox/py36/lib/python3.6/site-packages/paramiko/client.py", line 349, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/home/stack/tempest-master/.tox/py36/lib/python3.6/site-packages/paramiko/util.py", line 283, in retry_on_signal
    return function()
  File "/home/stack/tempest-master/.tox/py36/lib/python3.6/site-packages/paramiko/client.py", line 349, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/stack/tempest-master/tempest/common/utils/__init__.py", line 70, in wrapper
    return f(*func_args, **func_kwargs)
  File "/home/stack/tempest-master/tempest/scenario/test_snapshot_pattern.py", line 83, in test_snapshot_pattern
    server=server_from_snapshot)
  File "/home/stack/tempest-master/tempest/scenario/manager.py", line 1100, in get_timestamp
    username=username)
  File "/home/stack/tempest-master/tempest/scenario/manager.py", line 726, in get_remote_client
    linux_client.validate_authentication()
  File "/home/stack/tempest-master/tempest/lib/common/utils/linux/remote_client.py", line 31, in wrapper
    return function(self, *args, **kwargs)
  File "/home/stack/tempest-master/tempest/lib/common/utils/linux/remote_client.py", line 114, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/home/stack/tempest-master/tempest/lib/common/ssh.py", line 216, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/home/stack/tempest-master/tempest/lib/common/ssh.py", line 128, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 10.1.162.178 via SSH timed out.
User: cloud-user, Password: None


Both of the tests create 2 instances and the second one is the one which the test can't ssh to. From the test perspective it happens during self.get_timestamp call:
https://opendev.org/openstack/tempest/src/commit/3e127640657b294c9d4a941b6a6bb287aa496539/tempest/scenario/test_volume_boot_pattern.py#L101
https://opendev.org/openstack/tempest/src/commit/3e127640657b294c9d4a941b6a6bb287aa496539/tempest/scenario/test_snapshot_pattern.py#L81

The self.get_timestamp call results at the end in calling _get_ssh_connection method from tempest/lib/common/ssh.py:
https://opendev.org/openstack/tempest/src/commit/3e127640657b294c9d4a941b6a6bb287aa496539/tempest/lib/common/ssh.py#L85

I can see in the traceback that _get_ssh_connection method was retrying the connection several times:
2021-05-31 14:50:38,749 744490 INFO     [tempest.lib.common.ssh] Creating ssh connection to '10.1.162.178:22' as 'cloud-user' with public key authentication
2021-05-31 14:52:49,976 744490 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 1. Retry after 2 seconds.
2021-05-31 14:55:01,048 744490 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 2. Retry after 3 seconds.
2021-05-31 14:57:16,216 744490 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 3. Retry after 4 seconds.
2021-05-31 14:59:31,384 744490 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 4. Retry after 5 seconds.
2021-05-31 15:01:46,552 744490 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 5. Retry after 6 seconds.
2021-05-31 15:04:01,720 744490 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 6. Retry after 7 seconds.
2021-05-31 15:06:20,984 744490 ERROR    [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 after 6 attempts. Proxy client: no proxy client



How reproducible:
Run the tests on a baremetal deployment, f.e. by:
$ tempest run --regex tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
$ tempest run --regex tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern


Additional info:
While I was debugging the issue for test_volume_boot_pattern test I've noticed that when I put a pudb breakpoint before the second instance creation the test sometimes passed - I think it's because the breakpoint slowed down the test execution - that would suggest there is a race issue somewhere. However it's not a rule, it didn't help the test everytime. Also a 15 sec sleep before the second instance creation didn't help at all.

Also the fact that the timeout error occures only in the second instance supports the fact that there is a race issue.
However when I tried to reproduce the test (test_volume_boot_pattern) steps manually I wasn't able to reprodcue the issue.

Semi automated script I used:
openstack keypair create my_private_keypair > MY_KEY.pem
chmod 600 MY_KEY.pem
openstack security group create my_sec_group
openstack security group rule create --protocol icmp my_sec_group
openstack security group rule create my_sec_group --protocol tcp --dst-port 22:22 --remote-ip 0.0.0.0/0

# create volume from image , compute.image_ref from tempest.conf
# should wait until it's available
openstack volume create --size 40 --image 580a52a1-14c5-4790-ad98-a8880b12effb my_vol

# create private network
openstack network create my_net
openstack subnet create --network my_net --subnet-range 192.0.2.0/28 my_subnet

# create a router
openstack router create my_router
openstack router set --external-gateway public my_router
openstack router add subnet my_router my_subnet

# create instance1
openstack server create --flavor 200 --key-name my_private_keypair --security-group my_sec_group --volume my_vol my_instance1 --network my_net --wait
# create floating ip on network.public_network_id from tempest.conf
my_floating_1=$(openstack floating ip create --subnet fdf4386e-8db8-4ca2-9afd-3f19098faf92 f61780f0-b78a-41cd-960d-f69810fdeeaf -f value -c name)
openstack server add floating ip my_instance1 $my_floating_1

# ping the instance1
ping $my_floating_1 -c3
while true; do ssh cloud-user@$my_floating_1 -i ./MY_KEY.pem exit && break; sleep 1; done

# delete the instance1
openstack server delete my_instance1

# create instance2
openstack server create --flavor 200 --key-name my_private_keypair --security-group my_sec_group --volume my_vol my_instance2 --network my_net --wait
my_floating_2=$(openstack floating ip create --subnet fdf4386e-8db8-4ca2-9afd-3f19098faf92 f61780f0-b78a-41cd-960d-f69810fdeeaf -f value -c name)
openstack server add floating ip my_instance2 $my_floating_2

# ping the instance2
ping $my_floating_2 -c3
while true; do ssh cloud-user@$my_floating_2 -i ./MY_KEY.pem exit && break; sleep 1; done

# delete instance2
openstack server delete my_instance2

# teardown
openstack floating ip delete $my_floating_1
openstack floating ip delete $my_floating_2
openstack router delete my_router
openstack subnet delete my_subnet
openstack network delete my_net
openstack volume delete my_vol
openstack security group delete my_sec_group
openstack keypair delete my_private_keypair
rm ./MY_KEY.pem

Comment 4 Martin Kopec 2021-06-07 07:49:25 UTC

I paused the test execution right after the second instance was created and before it fail on the timeout and discovered that the instance is not pingable even from CLI.

I checked the floating ip and compared it to the floating ip of the first instance and to the floating ip of an instance I created from CLI and I had not found anything unordinary - no errors, no weird property set or any missing ...
I did the same with the instance itself, port, router, security group ...

Also routing checked out, floating instance of the instance I created manually was 10.1.162.138 and the unreachable one was 10.1.162.149
$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.9.76.254     0.0.0.0         UG    100    0        0 eno1
10.1.162.0      0.0.0.0         255.255.255.0   U     0      0        0 vlan320
10.9.76.0       0.0.0.0         255.255.255.0   U     100    0        0 eno1
169.254.0.0     0.0.0.0         255.255.0.0     U     1005   0        0 eno2
169.254.0.0     0.0.0.0         255.255.0.0     U     1007   0        0 br-ctlplane
192.168.24.0    0.0.0.0         255.255.255.0   U     0      0        0 br-ctlplane

So the next question and area to focus on is why the second instance's floating ip is not pingable? Or maybe the instance wasn't started correctly and therefore didn't reply back on pings?

Comment 5 qibin 2022-10-08 09:13:11 UTC

I have the same issue too.

Comment 6 Martin Kopec 2023-08-21 08:34:58 UTC

Considering this was reported long time ago and tempest changed since then, plus the fact that we weren't able to clearly reproduce that, I'm gonna mark this as Closed due to Insufficient data. If you come across this issue again, feel free to reopen or report a new bug providing all the details - such as openstack version, package versions, job where it happened (if there was one), etc

Note You need to log in before you can comment on or make changes to this bug.