Description of problem: setup ovn with ipsec cluster and then scale up 2 rhel 7.9 worker. Found pods on rhel worker cannot communicated with others worker Version-Release number of selected component (if applicable): rhcos ovs version: openvswitch2.13-2.13.0-79.el8fdp.x86_64 rhel 7 ovs version: openvswitch2.13-2.13.0-72.el7fdp.x86_64 4.7.0-0.nightly-2021-02-06-084550 How reproducible: always Steps to Reproduce: 1. setup ovn ipsec cluster and then scale up rhel79 worker $ oc get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-51-242.us-east-2.compute.internal Ready worker 6h13m v1.20.0+ba45583 10.0.51.242 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49 ip-10-0-52-216.us-east-2.compute.internal Ready worker 4h59m v1.20.0+ba45583 10.0.52.216 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.15.2.el7.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el7.49 ip-10-0-55-236.us-east-2.compute.internal Ready master 6h28m v1.20.0+ba45583 10.0.55.236 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49 ip-10-0-57-74.us-east-2.compute.internal Ready worker 4h59m v1.20.0+ba45583 10.0.57.74 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.15.2.el7.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el7.49 ip-10-0-59-63.us-east-2.compute.internal Ready worker 6h14m v1.20.0+ba45583 10.0.59.63 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49 ip-10-0-60-90.us-east-2.compute.internal Ready master 6h28m v1.20.0+ba45583 10.0.60.90 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49 ip-10-0-71-122.us-east-2.compute.internal Ready worker 6h14m v1.20.0+ba45583 10.0.71.122 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49 ip-10-0-74-54.us-east-2.compute.internal Ready master 6h28m v1.20.0+ba45583 10.0.74.54 <none> Red Hat Enterprise Linux CoreOS 47.83.202102060438-0 (Ootpa) 4.18.0-240.10.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.git78527db.el8.49 2. create test pod on all worker 3. From rhcos worker pod access rhel worker pod hello-564r8 1/1 Running 0 3h53m 10.131.2.26 ip-10-0-57-74.us-east-2.compute.internal ----> this is rhel worker pod hello-lzmwq 1/1 Running 0 3h53m 10.128.2.51 ip-10-0-71-122.us-east-2.compute.internal ---> this is rhcos worker pod #####pod cannot be accessed from rhcos --> rhel pod#### $ oc exec -n default hello-lzmwq -- curl --connect-timeout 10 10.131.2.26:8080 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0 curl: (28) Connection timed out after 10001 milliseconds command terminated with exit code 28 From the following capture, seems the rhcos worker cannot receive the packet from rhel worker. ###capture the packet on rhcos worker #### sh-4.4# tcpdump -i genev_sys_6081 -nn host 10.131.2.26 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes 12:29:26.957209 IP 10.128.2.51.43868 > 10.131.2.26.8080: Flags [S], seq 675932780, win 26445, options [mss 8815,sackOK,TS val 2427203229 ecr 0,nop,wscale 7], length 0 12:29:28.000335 IP 10.128.2.51.43868 > 10.131.2.26.8080: Flags [S], seq 675932780, win 26445, options [mss 8815,sackOK,TS val 2427204273 ecr 0,nop,wscale 7], length 0 12:29:30.048377 IP 10.128.2.51.43868 > 10.131.2.26.8080: Flags [S], seq 675932780, win 26445, options [mss 8815,sackOK,TS val 2427206321 ecr 0,nop,wscale 7], length 0 12:29:34.081320 IP 10.128.2.51.43868 > 10.131.2.26.8080: Flags [S], seq 675932780, win 26445, options [mss 8815,sackOK,TS val 2427210354 ecr 0,nop,wscale 7], length 0 #####capture the packet on rhel worker ### sh-4.4# tcpdump -i genev_sys_6081 -nn host 10.131.2.26 dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on genev_sys_6081, link-type EN10MB (Ethernet), capture size 262144 bytes 12:27:26.902372 IP 10.128.2.51.41568 > 10.131.2.26.8080: Flags [S], seq 4027051775, win 26445, options [mss 8815,sackOK,TS val 2427083166 ecr 0,nop,wscale 7], length 0 12:27:26.903072 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18221576 ecr 2427083166,nop,wscale 7], length 0 12:27:27.936740 IP 10.128.2.51.41568 > 10.131.2.26.8080: Flags [S], seq 4027051775, win 26445, options [mss 8815,sackOK,TS val 2427084201 ecr 0,nop,wscale 7], length 0 12:27:27.937093 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18222611 ecr 2427083166,nop,wscale 7], length 0 12:27:29.104961 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18223779 ecr 2427083166,nop,wscale 7], length 0 12:27:29.985655 IP 10.128.2.51.41568 > 10.131.2.26.8080: Flags [S], seq 4027051775, win 26445, options [mss 8815,sackOK,TS val 2427086250 ecr 0,nop,wscale 7], length 0 12:27:29.985721 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18224659 ecr 2427083166,nop,wscale 7], length 0 12:27:32.104954 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18226779 ecr 2427083166,nop,wscale 7], length 0 12:27:34.016848 IP 10.128.2.51.41568 > 10.131.2.26.8080: Flags [S], seq 4027051775, win 26445, options [mss 8815,sackOK,TS val 2427090281 ecr 0,nop,wscale 7], length 0 12:27:34.017112 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18228691 ecr 2427083166,nop,wscale 7], length 0 12:27:38.104953 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18232779 ecr 2427083166,nop,wscale 7], length 0 12:27:46.104952 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18240779 ecr 2427083166,nop,wscale 7], length 0 12:28:02.106311 IP 10.131.2.26.8080 > 10.128.2.51.41568: Flags [S.], seq 1276875272, ack 4027051776, win 26409, options [mss 8815,sackOK,TS val 18256780 ecr 2427083166,nop,wscale 7], length 0 Actual results: Expected results: Additional info:
What kernel version is running on the node?
No matter, I have just seen the kernel version in the output above: 3.10.0-1160.15.2.el7.x86_64 This will require the following kernel version as it has a patch that fixes an issue with RHEL7's Geneve implementation: kernel-3.10.0-1160.18.1.el7
ok, thanks Mark since the current release kernel version is 3.10.0-1160.15.2.el7.x86_64 , So we need to add this issue in 4.7 release note.
have a test using 3.10.0-1160.18.1.el7.x86_64 kernel, it works well.
yes, Mark Gray, since this issue has been verified on 3.10.0-1160.18.1.el7.x86_64 kernel. Move this bug to 'verified'
OCP is no longer using Bugzilla and this bug appears to have been left in an orphaned state. If the bug is still relevant, please open a new issue in the OCPBUGS Jira project: https://issues.redhat.com/projects/OCPBUGS/summary