Bug 2050171 - ssh to VM intermittently failing on scale environment
Summary: ssh to VM intermittently failing on scale environment
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 16.2 (Train)
Hardware: All
OS: All
high
high
Target Milestone: z5
: 16.2 (Train on RHEL 8.4)
Assignee: Elvira
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 2069783
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-03 12:07 UTC by anil venkata
Modified: 2023-07-28 16:12 UTC (History)
9 users (show)

Fixed In Version: ovn-2021-21.12.0-82.el8fdp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2069783 (view as bug list)
Environment:
Last Closed: 2023-07-28 16:09:09 UTC
Target Upstream Version:
Embargoed:
schari: needinfo-


Attachments (Terms of Use)
The link in the attachment has the sosreports from the 3 controller hosts, and a tcpdump.pcap file from the compute host where the instance is present. (69 bytes, text/plain)
2022-02-03 12:39 UTC, schari
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-12456 0 None None None 2022-02-03 12:13:50 UTC

Description anil venkata 2022-02-03 12:07:39 UTC
Description of problem:
This is a osp16.2 ML2/OVN environment.
Perf&Scale team is running scale tests and observe intermittent ssh to VM failures.

ssh to the VM floating ip from undercloud is failing
(overcloud) [stack@undercloud browbeat]$ ssh -i /home/stack/browbeat/privkey.pem centos.12.47                                   

ssh: connect to host 172.31.12.47 port 22: Connection timed out



tcpdump on the VM's tap interface (running from compute node) is able to capture the packets
[root@compute1029utn10rt-1 heat-admin]# tcpdump -vv -n -e -i tapdb53886c-48
dropped privs to tcpdump                                                                                                                tcpdump: listening on tapdb53886c-48, link-type EN10MB (Ethernet), capture size 262144 bytes
12:02:22.286060 fa:16:3e:0c:2c:cc > fa:16:3e:e5:54:15, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 49308, offset 0, flags [DF], proto TCP (6), length 60)                                                                                                         
    172.31.13.0.41380 > 10.2.54.239.ssh: Flags [S], cksum 0x6cfd (correct), seq 1931361656, win 29200, options [mss 1460,sackOK,TS val 9
0904358 ecr 0,nop,wscale 7], length 0                                   
12:02:22.286188 fa:16:3e:e5:54:15 > fa:16:3e:0c:2c:cc, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
 proto TCP (6), length 60) 
    10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xfa3e (incorrect -> 0xbd20), seq 4263452647, ack 1931361657, win 27800, opti
ons [mss 1402,sackOK,TS val 4101947 ecr 90904358,nop,wscale 4], length 0
12:02:22.296793 fa:16:3e:0c:2c:cc > fa:16:3e:e5:54:15, ethertype IPv4 (0x0800), length 102: (tos 0x0, ttl 254, id 0, offset 0, flags [DF
], proto ICMP (1), length 88) 
    172.31.12.28 > 10.2.54.239: ICMP time exceeded in-transit, length 68
        (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xbd20 (correct), seq 4263452647, ack 1931361657, win 27800, options [mss 140
2,sackOK,TS val 4101947 ecr 90904358,nop,wscale 4], length 0
12:02:23.290309 fa:16:3e:0c:2c:cc > fa:16:3e:e5:54:15, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 49309, offset 0, flags [
DF], proto TCP (6), length 60)
    172.31.13.0.41380 > 10.2.54.239.ssh: Flags [S], cksum 0x6910 (correct), seq 1931361656, win 29200, options [mss 1460,sackOK,TS val 9
0905363 ecr 0,nop,wscale 7], length 0
12:02:23.290431 fa:16:3e:e5:54:15 > fa:16:3e:0c:2c:cc, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
 proto TCP (6), length 60)
    10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xfa3e (incorrect -> 0x4551), seq 4279144173, ack 1931361657, win 27800, opti
ons [mss 1402,sackOK,TS val 4102952 ecr 90905363,nop,wscale 4], length 0
12:02:24.292441 fa:16:3e:e5:54:15 > fa:16:3e:0c:2c:cc, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
 proto TCP (6), length 60)
    10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xfa3e (incorrect -> 0x4167), seq 4279144173, ack 1931361657, win 27800, opti
ons [mss 1402,sackOK,TS val 4103954 ecr 90905363,nop,wscale 4], length 0



ssh to the VM using private key and VM's private ip from the the ovn metadatadata namespace is working without any issues.
[root@compute1029utn10rt-1 heat-admin]# ip netns exec ovnmeta-305f4f43-1e41-44e8-a317-b3da4f939057 ssh -i /home/heat-admin/privkey.pem centos.54.239
The authenticity of host '10.2.54.239 (10.2.54.239)' can't be established.                                                             
ECDSA key fingerprint is SHA256:Mcp64sCCT1Tcr7spxTgJbXuE6YRpALYy21xoYiCIsIg.                                                           
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.2.54.239' (ECDSA) to the list of known hosts.                                                           
[centos@s-rally-5b9c3f7b-r9quf60w ~]$

Comment 1 schari 2022-02-03 12:39:44 UTC
Created attachment 1858865 [details]
The link in the attachment has the sosreports from the 3 controller hosts, and a tcpdump.pcap file from the compute host where the instance is present.

Comment 2 schari 2022-02-03 13:17:26 UTC
I have added the sosreport from the compute host to the link in the attachment.
http://rdu-storage01.scalelab.redhat.com/schari/ssh_timeout_multiext/

Comment 3 Elvira 2022-02-15 14:12:03 UTC
Hi Anil,

Is it possible to stop the test before the resources are deleted?
We would like to be able to access the environment so that we can better see what's the real problem in this case.
Thanks a lot.

Comment 4 schari 2022-02-16 06:07:33 UTC
Hi Elvira,

Cleanup of resources has been disabled in the current run. The undercloud IP is 10.1.40.7.

We are seeing this issue with the following server.
(overcloud) [stack@undercloud browbeat]$ openstack server list --all --long | grep 10.2.31.205                                                   
| 5f546ef7-3e08-42b4-a812-bcd1c9c11f60 | s_rally_3b7fb8d4_wxiAbO3c | ACTIVE | None       | Running     | s_rally_3b7fb8d4_VWCvkE6Z=10.2.31.205, 172.31.1.102   | centos7    | d3ad9c3b-7f16-4946-bcbd-9cbea1f17dba |             |           | nova              | compute1029p-3.redhat.local

(overcloud) [stack@undercloud browbeat]$ ssh -i privkey.pem centos.1.102                                                                  
ssh: connect to host 172.31.1.102 port 22: Connection timed out

[root@compute1029p-3 heat-admin]# ip netns exec ovnmeta-fb03cb4e-b8b4-4797-9663-56b28cb11414 ssh -i privkey.pem centos.31.205               
The authenticity of host '10.2.31.205 (10.2.31.205)' can't be established.                                                                       
ECDSA key fingerprint is SHA256:pg5bAkfcY/d086e5NhdNKhOuA6N3h1xzrMUHe3aJiPg.                                                                     
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.2.31.205' (ECDSA) to the list of known hosts.                                                                     
[centos@s-rally-3b7fb8d4-wxiabo3c ~]$ uname
Linux

I will share the credentials to access the undercloud through mail.

Comment 5 Elvira 2022-02-17 17:04:56 UTC
Arnau, Jakub and I were looking into this problem and the main reason why this is failing is the existence of a subnet in the environment which contains the undercloud IP (172.31.1.0). This causes undesired behavior because two identical IPs co-exist.



(overcloud) [stack@undercloud ~]$ openstack subnet show 98f0f517-0a2c-4843-ace7-1c673b5d1368                                                                             

| Field             | Value  
    
| allocation_pools  | 172.31.0.2-172.31.1.254  
| cidr              | 172.31.0.0/23    


(overcloud) [stack@undercloud ~]$ openstack server list --all | grep 172.31.1.0 
                                                                                             
| 789ea459-7e5c-496d-b56a-42d5157ccc83 | s_rally_3b7fb8d4_s9LdDndi | ACTIVE | s_rally_3b7fb8d4_VWCvkE6Z=10.2.31.145, 172.31.1.0     | centos7 |        |

Closing this as not a bug, please, reach back if there's something you think we need to check!

Comment 6 anil venkata 2022-02-21 05:06:15 UTC
When the allocation pools range is 172.31.0.2-172.31.1.254, how come neutron is allocating floating ip with the address 172.31.1.0 (which is not within the allocation range)?

Comment 7 anil venkata 2022-02-21 05:30:23 UTC
Please ignore my comment (I misread 172.31.1.0 as 172.31.0.0)

Comment 34 Elvira 2022-09-07 13:50:57 UTC
ovn-2021-21.12.0-82.el8fdp.x86_64 is now available in puddle RHOS-16.2-RHEL-8-20220902.n.1. Moving to modified


Note You need to log in before you can comment on or make changes to this bug.