Bug 2050171

Summary: ssh to VM intermittently failing on scale environment
Product: Red Hat OpenStack Reporter: anil venkata <vkommadi>
Component: openstack-neutronAssignee: Elvira <egarciar>
Status: CLOSED CURRENTRELEASE QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: high    
Version: 16.2 (Train)CC: ccamposr, chrisw, egarciar, jraju, jschluet, mblue, scohen, twilson, ykarel
Target Milestone: z5Keywords: Reopened, Scale, TestOnly, Triaged
Target Release: 16.2 (Train on RHEL 8.4)Flags: schari: needinfo-
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: ovn-2021-21.12.0-82.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2069783 (view as bug list) Environment:
Last Closed: 2023-07-28 16:09:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2069783    
Bug Blocks:    
Attachments:
Description Flags
The link in the attachment has the sosreports from the 3 controller hosts, and a tcpdump.pcap file from the compute host where the instance is present. none

Description anil venkata 2022-02-03 12:07:39 UTC
Description of problem:
This is a osp16.2 ML2/OVN environment.
Perf&Scale team is running scale tests and observe intermittent ssh to VM failures.

ssh to the VM floating ip from undercloud is failing
(overcloud) [stack@undercloud browbeat]$ ssh -i /home/stack/browbeat/privkey.pem centos.12.47                                   

ssh: connect to host 172.31.12.47 port 22: Connection timed out



tcpdump on the VM's tap interface (running from compute node) is able to capture the packets
[root@compute1029utn10rt-1 heat-admin]# tcpdump -vv -n -e -i tapdb53886c-48
dropped privs to tcpdump                                                                                                                tcpdump: listening on tapdb53886c-48, link-type EN10MB (Ethernet), capture size 262144 bytes
12:02:22.286060 fa:16:3e:0c:2c:cc > fa:16:3e:e5:54:15, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 49308, offset 0, flags [DF], proto TCP (6), length 60)                                                                                                         
    172.31.13.0.41380 > 10.2.54.239.ssh: Flags [S], cksum 0x6cfd (correct), seq 1931361656, win 29200, options [mss 1460,sackOK,TS val 9
0904358 ecr 0,nop,wscale 7], length 0                                   
12:02:22.286188 fa:16:3e:e5:54:15 > fa:16:3e:0c:2c:cc, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
 proto TCP (6), length 60) 
    10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xfa3e (incorrect -> 0xbd20), seq 4263452647, ack 1931361657, win 27800, opti
ons [mss 1402,sackOK,TS val 4101947 ecr 90904358,nop,wscale 4], length 0
12:02:22.296793 fa:16:3e:0c:2c:cc > fa:16:3e:e5:54:15, ethertype IPv4 (0x0800), length 102: (tos 0x0, ttl 254, id 0, offset 0, flags [DF
], proto ICMP (1), length 88) 
    172.31.12.28 > 10.2.54.239: ICMP time exceeded in-transit, length 68
        (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xbd20 (correct), seq 4263452647, ack 1931361657, win 27800, options [mss 140
2,sackOK,TS val 4101947 ecr 90904358,nop,wscale 4], length 0
12:02:23.290309 fa:16:3e:0c:2c:cc > fa:16:3e:e5:54:15, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 49309, offset 0, flags [
DF], proto TCP (6), length 60)
    172.31.13.0.41380 > 10.2.54.239.ssh: Flags [S], cksum 0x6910 (correct), seq 1931361656, win 29200, options [mss 1460,sackOK,TS val 9
0905363 ecr 0,nop,wscale 7], length 0
12:02:23.290431 fa:16:3e:e5:54:15 > fa:16:3e:0c:2c:cc, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
 proto TCP (6), length 60)
    10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xfa3e (incorrect -> 0x4551), seq 4279144173, ack 1931361657, win 27800, opti
ons [mss 1402,sackOK,TS val 4102952 ecr 90905363,nop,wscale 4], length 0
12:02:24.292441 fa:16:3e:e5:54:15 > fa:16:3e:0c:2c:cc, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
 proto TCP (6), length 60)
    10.2.54.239.ssh > 172.31.13.0.41380: Flags [S.], cksum 0xfa3e (incorrect -> 0x4167), seq 4279144173, ack 1931361657, win 27800, opti
ons [mss 1402,sackOK,TS val 4103954 ecr 90905363,nop,wscale 4], length 0



ssh to the VM using private key and VM's private ip from the the ovn metadatadata namespace is working without any issues.
[root@compute1029utn10rt-1 heat-admin]# ip netns exec ovnmeta-305f4f43-1e41-44e8-a317-b3da4f939057 ssh -i /home/heat-admin/privkey.pem centos.54.239
The authenticity of host '10.2.54.239 (10.2.54.239)' can't be established.                                                             
ECDSA key fingerprint is SHA256:Mcp64sCCT1Tcr7spxTgJbXuE6YRpALYy21xoYiCIsIg.                                                           
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.2.54.239' (ECDSA) to the list of known hosts.                                                           
[centos@s-rally-5b9c3f7b-r9quf60w ~]$

Comment 1 schari 2022-02-03 12:39:44 UTC
Created attachment 1858865 [details]
The link in the attachment has the sosreports from the 3 controller hosts, and a tcpdump.pcap file from the compute host where the instance is present.

Comment 2 schari 2022-02-03 13:17:26 UTC
I have added the sosreport from the compute host to the link in the attachment.
http://rdu-storage01.scalelab.redhat.com/schari/ssh_timeout_multiext/

Comment 3 Elvira 2022-02-15 14:12:03 UTC
Hi Anil,

Is it possible to stop the test before the resources are deleted?
We would like to be able to access the environment so that we can better see what's the real problem in this case.
Thanks a lot.

Comment 4 schari 2022-02-16 06:07:33 UTC
Hi Elvira,

Cleanup of resources has been disabled in the current run. The undercloud IP is 10.1.40.7.

We are seeing this issue with the following server.
(overcloud) [stack@undercloud browbeat]$ openstack server list --all --long | grep 10.2.31.205                                                   
| 5f546ef7-3e08-42b4-a812-bcd1c9c11f60 | s_rally_3b7fb8d4_wxiAbO3c | ACTIVE | None       | Running     | s_rally_3b7fb8d4_VWCvkE6Z=10.2.31.205, 172.31.1.102   | centos7    | d3ad9c3b-7f16-4946-bcbd-9cbea1f17dba |             |           | nova              | compute1029p-3.redhat.local

(overcloud) [stack@undercloud browbeat]$ ssh -i privkey.pem centos.1.102                                                                  
ssh: connect to host 172.31.1.102 port 22: Connection timed out

[root@compute1029p-3 heat-admin]# ip netns exec ovnmeta-fb03cb4e-b8b4-4797-9663-56b28cb11414 ssh -i privkey.pem centos.31.205               
The authenticity of host '10.2.31.205 (10.2.31.205)' can't be established.                                                                       
ECDSA key fingerprint is SHA256:pg5bAkfcY/d086e5NhdNKhOuA6N3h1xzrMUHe3aJiPg.                                                                     
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.2.31.205' (ECDSA) to the list of known hosts.                                                                     
[centos@s-rally-3b7fb8d4-wxiabo3c ~]$ uname
Linux

I will share the credentials to access the undercloud through mail.

Comment 5 Elvira 2022-02-17 17:04:56 UTC
Arnau, Jakub and I were looking into this problem and the main reason why this is failing is the existence of a subnet in the environment which contains the undercloud IP (172.31.1.0). This causes undesired behavior because two identical IPs co-exist.



(overcloud) [stack@undercloud ~]$ openstack subnet show 98f0f517-0a2c-4843-ace7-1c673b5d1368                                                                             

| Field             | Value  
    
| allocation_pools  | 172.31.0.2-172.31.1.254  
| cidr              | 172.31.0.0/23    


(overcloud) [stack@undercloud ~]$ openstack server list --all | grep 172.31.1.0 
                                                                                             
| 789ea459-7e5c-496d-b56a-42d5157ccc83 | s_rally_3b7fb8d4_s9LdDndi | ACTIVE | s_rally_3b7fb8d4_VWCvkE6Z=10.2.31.145, 172.31.1.0     | centos7 |        |

Closing this as not a bug, please, reach back if there's something you think we need to check!

Comment 6 anil venkata 2022-02-21 05:06:15 UTC
When the allocation pools range is 172.31.0.2-172.31.1.254, how come neutron is allocating floating ip with the address 172.31.1.0 (which is not within the allocation range)?

Comment 7 anil venkata 2022-02-21 05:30:23 UTC
Please ignore my comment (I misread 172.31.1.0 as 172.31.0.0)

Comment 34 Elvira 2022-09-07 13:50:57 UTC
ovn-2021-21.12.0-82.el8fdp.x86_64 is now available in puddle RHOS-16.2-RHEL-8-20220902.n.1. Moving to modified