Description of problem: 2 tempest scenario tests fail on a baremetal deployment on the same time out issue. Failing tests: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern[compute,id-557cd2c2-4eb8-4dce-98be-f86765ff311b,image,slow,volume] tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern[compute,id-608e604b-1d63-4a82-8e3e-91bc665c90b4,image,network,slow] Version-Release number of selected component (if applicable): [stack@sealusa39 tempest-dir]$ rpm -qa | grep tempest python3-tempest-24.0.0-1.20201113224606.c73e6b1.el8ost.noarch python3-cinder-tests-tempest-0.3.1-1.20201113142636.6f0b4fc.el8ost.noarch openstack-tempest-24.0.0-1.20201113224606.c73e6b1.el8ost.noarch python3-tempestconf-3.0.2-1.20201114013033.c377168.el8ost.noarch NOTE: I've reproduced the error also with the tempest master. tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern[compute,id-557cd2c2-4eb8-4dce-98be-f86765ff311b,image,slow,volume] ----------------------------------------------------------------------------------------------------------------------------------------------------------- Captured traceback: ~~~~~~~~~~~~~~~~~~~ Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 112, in _get_ssh_connection sock=proxy_chan) File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 343, in connect retry_on_signal(lambda: sock.connect(addr)) File "/usr/lib/python3.6/site-packages/paramiko/util.py", line 280, in retry_on_signal return function() File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 343, in <lambda> retry_on_signal(lambda: sock.connect(addr)) socket.timeout: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper return f(*func_args, **func_kwargs) File "/usr/lib/python3.6/site-packages/tempest/scenario/test_volume_boot_pattern.py", line 103, in test_volume_boot_pattern server=instance_2nd) File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 770, in get_timestamp server=server) File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 501, in get_remote_client linux_client.validate_authentication() File "/usr/lib/python3.6/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 60, in wrapper six.reraise(*original_exception) File "/usr/lib/python3.6/site-packages/six.py", line 675, in reraise raise value File "/usr/lib/python3.6/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 33, in wrapper return function(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 116, in validate_authentication self.ssh_client.test_connection_auth() File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 216, in test_connection_auth connection = self._get_ssh_connection() File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 128, in _get_ssh_connection password=self.password) tempest.lib.exceptions.SSHTimeout: Connection to the 10.1.162.153 via SSH timed out. User: cloud-user, Password: None tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern[compute,id-608e604b-1d63-4a82-8e3e-91bc665c90b4,image,network,slow] ---------------------------------------------------------------------------------------------------------------------------------------------------- Captured traceback: ~~~~~~~~~~~~~~~~~~~ Traceback (most recent call last): File "/home/stack/tempest-master/tempest/lib/common/ssh.py", line 112, in _get_ssh_connection sock=proxy_chan) File "/home/stack/tempest-master/.tox/py36/lib/python3.6/site-packages/paramiko/client.py", line 349, in connect retry_on_signal(lambda: sock.connect(addr)) File "/home/stack/tempest-master/.tox/py36/lib/python3.6/site-packages/paramiko/util.py", line 283, in retry_on_signal return function() File "/home/stack/tempest-master/.tox/py36/lib/python3.6/site-packages/paramiko/client.py", line 349, in <lambda> retry_on_signal(lambda: sock.connect(addr)) TimeoutError: [Errno 110] Connection timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/stack/tempest-master/tempest/common/utils/__init__.py", line 70, in wrapper return f(*func_args, **func_kwargs) File "/home/stack/tempest-master/tempest/scenario/test_snapshot_pattern.py", line 83, in test_snapshot_pattern server=server_from_snapshot) File "/home/stack/tempest-master/tempest/scenario/manager.py", line 1100, in get_timestamp username=username) File "/home/stack/tempest-master/tempest/scenario/manager.py", line 726, in get_remote_client linux_client.validate_authentication() File "/home/stack/tempest-master/tempest/lib/common/utils/linux/remote_client.py", line 31, in wrapper return function(self, *args, **kwargs) File "/home/stack/tempest-master/tempest/lib/common/utils/linux/remote_client.py", line 114, in validate_authentication self.ssh_client.test_connection_auth() File "/home/stack/tempest-master/tempest/lib/common/ssh.py", line 216, in test_connection_auth connection = self._get_ssh_connection() File "/home/stack/tempest-master/tempest/lib/common/ssh.py", line 128, in _get_ssh_connection password=self.password) tempest.lib.exceptions.SSHTimeout: Connection to the 10.1.162.178 via SSH timed out. User: cloud-user, Password: None Both of the tests create 2 instances and the second one is the one which the test can't ssh to. From the test perspective it happens during self.get_timestamp call: https://opendev.org/openstack/tempest/src/commit/3e127640657b294c9d4a941b6a6bb287aa496539/tempest/scenario/test_volume_boot_pattern.py#L101 https://opendev.org/openstack/tempest/src/commit/3e127640657b294c9d4a941b6a6bb287aa496539/tempest/scenario/test_snapshot_pattern.py#L81 The self.get_timestamp call results at the end in calling _get_ssh_connection method from tempest/lib/common/ssh.py: https://opendev.org/openstack/tempest/src/commit/3e127640657b294c9d4a941b6a6bb287aa496539/tempest/lib/common/ssh.py#L85 I can see in the traceback that _get_ssh_connection method was retrying the connection several times: 2021-05-31 14:50:38,749 744490 INFO [tempest.lib.common.ssh] Creating ssh connection to '10.1.162.178:22' as 'cloud-user' with public key authentication 2021-05-31 14:52:49,976 744490 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 1. Retry after 2 seconds. 2021-05-31 14:55:01,048 744490 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 2. Retry after 3 seconds. 2021-05-31 14:57:16,216 744490 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 3. Retry after 4 seconds. 2021-05-31 14:59:31,384 744490 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 4. Retry after 5 seconds. 2021-05-31 15:01:46,552 744490 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 5. Retry after 6 seconds. 2021-05-31 15:04:01,720 744490 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 ([Errno 110] Connection timed out). Number attempts: 6. Retry after 7 seconds. 2021-05-31 15:06:20,984 744490 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cloud-user.162.178 after 6 attempts. Proxy client: no proxy client How reproducible: Run the tests on a baremetal deployment, f.e. by: $ tempest run --regex tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern $ tempest run --regex tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern Additional info: While I was debugging the issue for test_volume_boot_pattern test I've noticed that when I put a pudb breakpoint before the second instance creation the test sometimes passed - I think it's because the breakpoint slowed down the test execution - that would suggest there is a race issue somewhere. However it's not a rule, it didn't help the test everytime. Also a 15 sec sleep before the second instance creation didn't help at all. Also the fact that the timeout error occures only in the second instance supports the fact that there is a race issue. However when I tried to reproduce the test (test_volume_boot_pattern) steps manually I wasn't able to reprodcue the issue. Semi automated script I used: openstack keypair create my_private_keypair > MY_KEY.pem chmod 600 MY_KEY.pem openstack security group create my_sec_group openstack security group rule create --protocol icmp my_sec_group openstack security group rule create my_sec_group --protocol tcp --dst-port 22:22 --remote-ip 0.0.0.0/0 # create volume from image , compute.image_ref from tempest.conf # should wait until it's available openstack volume create --size 40 --image 580a52a1-14c5-4790-ad98-a8880b12effb my_vol # create private network openstack network create my_net openstack subnet create --network my_net --subnet-range 192.0.2.0/28 my_subnet # create a router openstack router create my_router openstack router set --external-gateway public my_router openstack router add subnet my_router my_subnet # create instance1 openstack server create --flavor 200 --key-name my_private_keypair --security-group my_sec_group --volume my_vol my_instance1 --network my_net --wait # create floating ip on network.public_network_id from tempest.conf my_floating_1=$(openstack floating ip create --subnet fdf4386e-8db8-4ca2-9afd-3f19098faf92 f61780f0-b78a-41cd-960d-f69810fdeeaf -f value -c name) openstack server add floating ip my_instance1 $my_floating_1 # ping the instance1 ping $my_floating_1 -c3 while true; do ssh cloud-user@$my_floating_1 -i ./MY_KEY.pem exit && break; sleep 1; done # delete the instance1 openstack server delete my_instance1 # create instance2 openstack server create --flavor 200 --key-name my_private_keypair --security-group my_sec_group --volume my_vol my_instance2 --network my_net --wait my_floating_2=$(openstack floating ip create --subnet fdf4386e-8db8-4ca2-9afd-3f19098faf92 f61780f0-b78a-41cd-960d-f69810fdeeaf -f value -c name) openstack server add floating ip my_instance2 $my_floating_2 # ping the instance2 ping $my_floating_2 -c3 while true; do ssh cloud-user@$my_floating_2 -i ./MY_KEY.pem exit && break; sleep 1; done # delete instance2 openstack server delete my_instance2 # teardown openstack floating ip delete $my_floating_1 openstack floating ip delete $my_floating_2 openstack router delete my_router openstack subnet delete my_subnet openstack network delete my_net openstack volume delete my_vol openstack security group delete my_sec_group openstack keypair delete my_private_keypair rm ./MY_KEY.pem
I paused the test execution right after the second instance was created and before it fail on the timeout and discovered that the instance is not pingable even from CLI. I checked the floating ip and compared it to the floating ip of the first instance and to the floating ip of an instance I created from CLI and I had not found anything unordinary - no errors, no weird property set or any missing ... I did the same with the instance itself, port, router, security group ... Also routing checked out, floating instance of the instance I created manually was 10.1.162.138 and the unreachable one was 10.1.162.149 $ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.9.76.254 0.0.0.0 UG 100 0 0 eno1 10.1.162.0 0.0.0.0 255.255.255.0 U 0 0 0 vlan320 10.9.76.0 0.0.0.0 255.255.255.0 U 100 0 0 eno1 169.254.0.0 0.0.0.0 255.255.0.0 U 1005 0 0 eno2 169.254.0.0 0.0.0.0 255.255.0.0 U 1007 0 0 br-ctlplane 192.168.24.0 0.0.0.0 255.255.255.0 U 0 0 0 br-ctlplane So the next question and area to focus on is why the second instance's floating ip is not pingable? Or maybe the instance wasn't started correctly and therefore didn't reply back on pings?
I have the same issue too.
Considering this was reported long time ago and tempest changed since then, plus the fact that we weren't able to clearly reproduce that, I'm gonna mark this as Closed due to Insufficient data. If you come across this issue again, feel free to reopen or report a new bug providing all the details - such as openstack version, package versions, job where it happened (if there was one), etc