Description of problem: This is a osp16.2 ML2/OVN environment. Perf&Scale team is running scale tests and observe intermittent ssh to VM failures. ssh to the VM floating ip from undercloud is failing (overcloud) [stack@undercloud browbeat]$ ssh -i /home/stack/browbeat/privkey.pem centos.12.47 ssh: connect to host 172.31.12.47 port 22: Connection timed out tcpdump on the VM's tap interface (running from compute node) is able to capture the packets [root@compute1029utn10rt-1 heat-admin]# tcpdump -vv -n -e -i tapdb53886c-48 dropped privs to tcpdump tcpdump: listening on tapdb53886c-48, link-type EN10MB (Ethernet), capture size 262144 bytes 12:02:22.286060 fa:16:3e:0c:2c:cc > fa:16:3e:e5:54:15, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 49308, offset 0, flags [DF], proto TCP (6), length 60) 172.31.13.0.41380 > 10.2.54.239.ssh: Flags [S], cksum 0x6cfd (correct), seq 1931361656, win 29200, options [mss 1460,sackOK,TS val 9 0904358 ecr 0,nop,wscale 7], length 0 12:02:22.286188 fa:16:3e:e5:54:15 > fa:16:3e:0c:2c:cc, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xfa3e (incorrect -> 0xbd20), seq 4263452647, ack 1931361657, win 27800, opti ons [mss 1402,sackOK,TS val 4101947 ecr 90904358,nop,wscale 4], length 0 12:02:22.296793 fa:16:3e:0c:2c:cc > fa:16:3e:e5:54:15, ethertype IPv4 (0x0800), length 102: (tos 0x0, ttl 254, id 0, offset 0, flags [DF ], proto ICMP (1), length 88) 172.31.12.28 > 10.2.54.239: ICMP time exceeded in-transit, length 68 (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xbd20 (correct), seq 4263452647, ack 1931361657, win 27800, options [mss 140 2,sackOK,TS val 4101947 ecr 90904358,nop,wscale 4], length 0 12:02:23.290309 fa:16:3e:0c:2c:cc > fa:16:3e:e5:54:15, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 49309, offset 0, flags [ DF], proto TCP (6), length 60) 172.31.13.0.41380 > 10.2.54.239.ssh: Flags [S], cksum 0x6910 (correct), seq 1931361656, win 29200, options [mss 1460,sackOK,TS val 9 0905363 ecr 0,nop,wscale 7], length 0 12:02:23.290431 fa:16:3e:e5:54:15 > fa:16:3e:0c:2c:cc, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xfa3e (incorrect -> 0x4551), seq 4279144173, ack 1931361657, win 27800, opti ons [mss 1402,sackOK,TS val 4102952 ecr 90905363,nop,wscale 4], length 0 12:02:24.292441 fa:16:3e:e5:54:15 > fa:16:3e:0c:2c:cc, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xfa3e (incorrect -> 0x4167), seq 4279144173, ack 1931361657, win 27800, opti ons [mss 1402,sackOK,TS val 4103954 ecr 90905363,nop,wscale 4], length 0 ssh to the VM using private key and VM's private ip from the the ovn metadatadata namespace is working without any issues. [root@compute1029utn10rt-1 heat-admin]# ip netns exec ovnmeta-305f4f43-1e41-44e8-a317-b3da4f939057 ssh -i /home/heat-admin/privkey.pem centos.54.239 The authenticity of host '10.2.54.239 (10.2.54.239)' can't be established. ECDSA key fingerprint is SHA256:Mcp64sCCT1Tcr7spxTgJbXuE6YRpALYy21xoYiCIsIg. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '10.2.54.239' (ECDSA) to the list of known hosts. [centos@s-rally-5b9c3f7b-r9quf60w ~]$
Created attachment 1858865 [details] The link in the attachment has the sosreports from the 3 controller hosts, and a tcpdump.pcap file from the compute host where the instance is present.
I have added the sosreport from the compute host to the link in the attachment. http://rdu-storage01.scalelab.redhat.com/schari/ssh_timeout_multiext/
Hi Anil, Is it possible to stop the test before the resources are deleted? We would like to be able to access the environment so that we can better see what's the real problem in this case. Thanks a lot.
Hi Elvira, Cleanup of resources has been disabled in the current run. The undercloud IP is 10.1.40.7. We are seeing this issue with the following server. (overcloud) [stack@undercloud browbeat]$ openstack server list --all --long | grep 10.2.31.205 | 5f546ef7-3e08-42b4-a812-bcd1c9c11f60 | s_rally_3b7fb8d4_wxiAbO3c | ACTIVE | None | Running | s_rally_3b7fb8d4_VWCvkE6Z=10.2.31.205, 172.31.1.102 | centos7 | d3ad9c3b-7f16-4946-bcbd-9cbea1f17dba | | | nova | compute1029p-3.redhat.local (overcloud) [stack@undercloud browbeat]$ ssh -i privkey.pem centos.1.102 ssh: connect to host 172.31.1.102 port 22: Connection timed out [root@compute1029p-3 heat-admin]# ip netns exec ovnmeta-fb03cb4e-b8b4-4797-9663-56b28cb11414 ssh -i privkey.pem centos.31.205 The authenticity of host '10.2.31.205 (10.2.31.205)' can't be established. ECDSA key fingerprint is SHA256:pg5bAkfcY/d086e5NhdNKhOuA6N3h1xzrMUHe3aJiPg. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '10.2.31.205' (ECDSA) to the list of known hosts. [centos@s-rally-3b7fb8d4-wxiabo3c ~]$ uname Linux I will share the credentials to access the undercloud through mail.
Arnau, Jakub and I were looking into this problem and the main reason why this is failing is the existence of a subnet in the environment which contains the undercloud IP (172.31.1.0). This causes undesired behavior because two identical IPs co-exist. (overcloud) [stack@undercloud ~]$ openstack subnet show 98f0f517-0a2c-4843-ace7-1c673b5d1368 | Field | Value | allocation_pools | 172.31.0.2-172.31.1.254 | cidr | 172.31.0.0/23 (overcloud) [stack@undercloud ~]$ openstack server list --all | grep 172.31.1.0 | 789ea459-7e5c-496d-b56a-42d5157ccc83 | s_rally_3b7fb8d4_s9LdDndi | ACTIVE | s_rally_3b7fb8d4_VWCvkE6Z=10.2.31.145, 172.31.1.0 | centos7 | | Closing this as not a bug, please, reach back if there's something you think we need to check!
When the allocation pools range is 172.31.0.2-172.31.1.254, how come neutron is allocating floating ip with the address 172.31.1.0 (which is not within the allocation range)?
Please ignore my comment (I misread 172.31.1.0 as 172.31.0.0)
ovn-2021-21.12.0-82.el8fdp.x86_64 is now available in puddle RHOS-16.2-RHEL-8-20220902.n.1. Moving to modified