Bug 1925925 - rhel79 worker cannot communicated with rhcos worker on ovn ipsec cluster
Summary: rhel79 worker cannot communicated with rhcos worker on ovn ipsec cluster
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Ben Bennett
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-07 12:34 UTC by zhaozhanqi
Modified: 2024-04-30 18:04 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-04-30 18:04:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description zhaozhanqi 2021-02-07 12:34:42 UTC
Description of problem:
setup ovn with ipsec cluster and then scale up 2 rhel 7.9 worker. Found pods on rhel worker cannot communicated with others worker

Version-Release number of selected component (if applicable):
rhcos ovs version: openvswitch2.13-2.13.0-79.el8fdp.x86_64
rhel 7 ovs version: openvswitch2.13-2.13.0-72.el7fdp.x86_64

4.7.0-0.nightly-2021-02-06-084550

How reproducible:
always

Steps to Reproduce:
1. setup ovn ipsec cluster and then scale up rhel79 worker

$ oc get node -o wide 
NAME                                        STATUS   ROLES    AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-51-242.us-east-2.compute.internal   Ready    worker   6h13m   v1.20.0+ba45583   10.0.51.242   <none>        Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
ip-10-0-52-216.us-east-2.compute.internal   Ready    worker   4h59m   v1.20.0+ba45583   10.0.52.216   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.15.2.el7.x86_64    cri-o://1.20.0-0.rhaos4.7.git78527db.el7.49
ip-10-0-55-236.us-east-2.compute.internal   Ready    master   6h28m   v1.20.0+ba45583   10.0.55.236   <none>        Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
ip-10-0-57-74.us-east-2.compute.internal    Ready    worker   4h59m   v1.20.0+ba45583   10.0.57.74    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.15.2.el7.x86_64    cri-o://1.20.0-0.rhaos4.7.git78527db.el7.49
ip-10-0-59-63.us-east-2.compute.internal    Ready    worker   6h14m   v1.20.0+ba45583   10.0.59.63    <none>        Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
ip-10-0-60-90.us-east-2.compute.internal    Ready    master   6h28m   v1.20.0+ba45583   10.0.60.90    <none>        Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
ip-10-0-71-122.us-east-2.compute.internal   Ready    worker   6h14m   v1.20.0+ba45583   10.0.71.122   <none>        Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49
ip-10-0-74-54.us-east-2.compute.internal    Ready    master   6h28m   v1.20.0+ba45583   10.0.74.54    <none>        Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49

2. create test pod on all worker
3. From rhcos worker pod access rhel worker pod

hello-564r8     1/1     Running   0          3h53m   10.131.2.26    ip-10-0-57-74.us-east-2.compute.internal        ----> this is rhel worker pod

hello-lzmwq     1/1     Running   0          3h53m   10.128.2.51    ip-10-0-71-122.us-east-2.compute.internal    ---> this is rhcos worker pod


#####pod cannot be accessed from rhcos --> rhel pod####
$ oc exec -n default hello-lzmwq -- curl --connect-timeout 10 10.131.2.26:8080
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
curl: (28) Connection timed out after 10001 milliseconds
command terminated with exit code 28

From the following capture, seems the rhcos worker cannot receive the packet from rhel worker.

###capture the packet on rhcos worker ####

sh-4.4# tcpdump -i genev_sys_6081 -nn host 10.131.2.26
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
12:29:26.957209 IP 10.128.2.51.43868 > 10.131.2.26.8080: Flags [S], seq 675932780, win 26445, options [mss 8815,sackOK,TS val 2427203229 ecr 0,nop,wscale 7], length 0
12:29:28.000335 IP 10.128.2.51.43868 > 10.131.2.26.8080: Flags [S], seq 675932780, win 26445, options [mss 8815,sackOK,TS val 2427204273 ecr 0,nop,wscale 7], length 0
12:29:30.048377 IP 10.128.2.51.43868 > 10.131.2.26.8080: Flags [S], seq 675932780, win 26445, options [mss 8815,sackOK,TS val 2427206321 ecr 0,nop,wscale 7], length 0
12:29:34.081320 IP 10.128.2.51.43868 > 10.131.2.26.8080: Flags [S], seq 675932780, win 26445, options [mss 8815,sackOK,TS val 2427210354 ecr 0,nop,wscale 7], length 0

#####capture the packet on rhel worker ###

sh-4.4#  tcpdump -i genev_sys_6081 -nn host 10.131.2.26
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes
12:27:26.902372 IP 10.128.2.51.41568 > 10.131.2.26.8080: Flags [S], seq 4027051775, win 26445, options [mss 8815,sackOK,TS val 2427083166 ecr 0,nop,wscale 7], length 0
12:27:26.903072 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18221576 ecr 2427083166,nop,wscale 7], length 0
12:27:27.936740 IP 10.128.2.51.41568 > 10.131.2.26.8080: Flags [S], seq 4027051775, win 26445, options [mss 8815,sackOK,TS val 2427084201 ecr 0,nop,wscale 7], length 0
12:27:27.937093 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18222611 ecr 2427083166,nop,wscale 7], length 0
12:27:29.104961 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18223779 ecr 2427083166,nop,wscale 7], length 0
12:27:29.985655 IP 10.128.2.51.41568 > 10.131.2.26.8080: Flags [S], seq 4027051775, win 26445, options [mss 8815,sackOK,TS val 2427086250 ecr 0,nop,wscale 7], length 0
12:27:29.985721 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18224659 ecr 2427083166,nop,wscale 7], length 0
12:27:32.104954 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18226779 ecr 2427083166,nop,wscale 7], length 0
12:27:34.016848 IP 10.128.2.51.41568 > 10.131.2.26.8080: Flags [S], seq 4027051775, win 26445, options [mss 8815,sackOK,TS val 2427090281 ecr 0,nop,wscale 7], length 0
12:27:34.017112 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18228691 ecr 2427083166,nop,wscale 7], length 0
12:27:38.104953 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18232779 ecr 2427083166,nop,wscale 7], length 0
12:27:46.104952 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18240779 ecr 2427083166,nop,wscale 7], length 0
12:28:02.106311 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18256780 ecr 2427083166,nop,wscale 7], length 0






Actual results: 


Expected results:


Additional info:

Comment 1 Mark Gray 2021-02-08 08:12:59 UTC
What kernel version is running on the node?

Comment 2 Mark Gray 2021-02-08 08:19:41 UTC
No matter, I have just seen the kernel version in the output above:

3.10.0-1160.15.2.el7.x86_64 

This will require the following kernel version as it has a patch that fixes an issue with RHEL7's Geneve implementation:

kernel-3.10.0-1160.18.1.el7

Comment 3 zhaozhanqi 2021-02-08 10:11:19 UTC
ok, thanks Mark

since the current release kernel version is 3.10.0-1160.15.2.el7.x86_64 , So we need to add this issue in 4.7 release note.

Comment 4 zhaozhanqi 2021-02-09 08:15:15 UTC
have a test using 3.10.0-1160.18.1.el7.x86_64 kernel, it works well.

Comment 6 zhaozhanqi 2021-02-22 09:51:21 UTC
yes, Mark Gray,  since this issue has been verified on 3.10.0-1160.18.1.el7.x86_64 kernel.  

Move this bug to 'verified'

Comment 9 Rory Thrasher 2024-04-30 18:04:53 UTC
OCP is no longer using Bugzilla and this bug appears to have been left in an orphaned state. If the bug is still relevant, please open a new issue in the OCPBUGS Jira project: https://issues.redhat.com/projects/OCPBUGS/summary


Note You need to log in before you can comment on or make changes to this bug.