1961063 – Hostnetwork pod to service backed by hostnetwork is not working with OVN Kubernetes

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1961063 - Hostnetwork pod to service backed by hostnetwork is not working with OVN Kubernetes

Summary: Hostnetwork pod to service backed by hostnetwork is not working with OVN Kube...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	OVN
Sub Component:
Version:	FDP 20.A
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Marcelo Ricardo Leitner
QA Contact:	Jianlin Shi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1908570 (view as bug list)
Depends On:	1881824 1924608 1946986 1953278 1955136 1956740 1980532 1980537
Blocks:	1983894 2014673
TreeView+	depends on / blocked

Reported:	2021-05-17 07:01 UTC by zenghui.shi
Modified:	2023-06-15 06:51 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-21 11:51:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
full ovs flow dump (99.53 KB, text/plain) 2021-05-17 07:01 UTC, zenghui.shi	no flags	Details
full ovs flow dump (193.85 KB, text/plain) 2021-05-25 04:05 UTC, zenghui.shi	no flags	Details
ovs flow dump with -m (204.34 KB, text/plain) 2021-05-25 13:40 UTC, zenghui.shi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-1313	0	None	None	None	2021-08-10 20:24:55 UTC

Description zenghui.shi 2021-05-17 07:01:56 UTC

Created attachment 1783946 [details]
full ovs flow dump

Description of problem:

Hostnetwork pod to external traffic is not working with ovn-kubernetes, reply packet gets dropped.


Version-Release number of selected component (if applicable):

OS: 4.18.0-305.2.1.el8_4.x86_64
OVN: ovn2.13-20.12.0-115.el8fdp.x86_64
OVS: openvswitch2.15-2.15.0-15.el8fdp.x86_64
OVN-K8s master: 19fc45c2aad19065070c4622292b5f962a245357


//------------------- ORIG DIRECTION ----------------------//

recirc_id(0),in_port(br-ex),eth_type(0x0800),ipv4(dst=172.30.0.0/255.255.0.0,frag=no), packets:197, bytes:16326, used:0.270s, actions:ct(commit,zone=64001,nat(src=169.254.169.2)),recirc(0x1c763c)

CT Zone 64001

recirc_id(0x1c763c),in_port(br-ex),ct_state(+new-est+trk),eth(src=98:03:9b:97:38:df,dst=3c:fd:fe:a0:d7:e1),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=172.30.0.1,proto=6,ttl=64,frag=no), packets:130, bytes:9620, used:0.277s, flags:S, actions:ct_clear,set(eth(dst=98:03:9b:97:38:df)),ct(zone=21),recirc(0x1c763d)

recirc_id(0x1c763d),in_port(br-ex),ct_state(+new+trk),eth(),eth_type(0x0800),ipv4(dst=172.30.0.1,proto=6,frag=no),tcp(dst=443), packets:130, bytes:9620, used:0.277s, flags:S, actions:hash(l4(0)),recirc(0x1cf7e5)

recirc_id(0x1cf7e5),dp_hash(0xe/0xf),in_port(br-ex),eth(),eth_type(0x0800),ipv4(frag=no), packets:5, bytes:370, used:7.957s, flags:S, actions:ct(commit,zone=21,label=0x2/0x2,nat(dst=10.0.1.12:6443)),recirc(0x1c763f)

CT ZONE 21

recirc_id(0x1c763f),in_port(br-ex),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(src=98:03:9b:97:38:df,dst=98:03:9b:97:38:df),eth_type(0x0800),ipv4(dst=10.0.1.12,ttl=64,frag=no), packets:9, bytes:666, used:7.446s, flags:S, actions:set(eth(dst=3c:fd:fe:b5:80:ac)),set(ipv4(ttl=63)),ct(commit,nat(src=10.0.1.13)),recirc(0x1cf7e6)

CT ZONE 0

recirc_id(0x1cf7e6),in_port(br-ex),ct_state(+new-est+trk),eth(src=98:03:9b:97:38:df,dst=3c:fd:fe:b5:80:ac),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:9, bytes:666, used:7.445s, flags:S, actions:ct_clear,ct(commit,zone=64000),ens801f1


//------------------- REPLY DIRECTION ----------------------//


recirc_id(0),in_port(ens801f1),eth_type(0x0800),ipv4(proto=6,frag=no), packets:245178, bytes:115601664, used:0.000s, actions:ct(zone=64000),recirc(0x9)

CT ZONE 64000

ct_state(+est+trk),recirc_id(0x9),in_port(ens801f1),eth(src=3c:fd:fe:b5:80:ac,dst=98:03:9b:97:38:df),eth_type(0x0800),ipv4(src=10.0.1.12,dst=10.0.1.13,proto=6,ttl=64,frag=no), packets:804, bytes:60778, used:0.140s, actions:ct_clear,ct(nat),recirc(0x1bfd57)

CT ZONE 0

recirc_id(0x1bfd57),in_port(ens801f1),ct_state(-new-est-trk),eth(),eth_type(0x0800),ipv4(frag=no), packets:102977, bytes:8012828, used:0.086s, flags:SFPR., actions:ct(zone=21,nat),recirc(0x1bfd58)

CT ZONE 21

recirc_id(0x1bfd58),in_port(ens801f1),ct_state(-new-est-rel-rpl+inv+trk),ct_label(0/0x1),eth(src=3c:fd:fe:b5:18:8c,dst=98:03:00:00:00:00/ff:ff:00:00:00:00),eth_type(0x0800),ipv4(dst=10.0.1.13,ttl=64,frag=no), packets:6422, bytes:475220, used:0.085s, flags:SF., actions:drop

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

CX-5 ovs hardware offload:

[root@sriov-worker-0 core]# ethtool -i ens801f1
driver: mlx5e_rep
version: 4.18.0-305.2.1.el8_4.x86_64
firmware-version: 16.29.2002 (MT_0000000012)
expansion-rom-version: 
bus-info: 0000:b0:00.1
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[root@sriov-worker-0 core]# lspci -vv -nn -mm -s 0000:b0:00.1
Slot:	b0:00.1
Class:	Ethernet controller [0200]
Vendor:	Mellanox Technologies [15b3]
Device:	MT27800 Family [ConnectX-5] [1017]
SVendor:	Mellanox Technologies [15b3]
SDevice:	Mellanox ConnectX®-5 MCX516A-CCAT [0007]
NUMANode:

Comment 1 zenghui.shi 2021-05-17 07:32:41 UTC

> 
> 
> //------------------- REPLY DIRECTION ----------------------//
> 
> 
> recirc_id(0),in_port(ens801f1),eth_type(0x0800),ipv4(proto=6,frag=no),
> packets:245178, bytes:115601664, used:0.000s,
> actions:ct(zone=64000),recirc(0x9)
> 
> CT ZONE 64000
> 
> ct_state(+est+trk),recirc_id(0x9),in_port(ens801f1),eth(src=3c:fd:fe:b5:80:
> ac,dst=98:03:9b:97:38:df),eth_type(0x0800),ipv4(src=10.0.1.12,dst=10.0.1.13,
> proto=6,ttl=64,frag=no), packets:804, bytes:60778, used:0.140s,
> actions:ct_clear,ct(nat),recirc(0x1bfd57)
> 
> CT ZONE 0
> 
> recirc_id(0x1bfd57),in_port(ens801f1),ct_state(-new-est-trk),eth(),
> eth_type(0x0800),ipv4(frag=no), packets:102977, bytes:8012828, used:0.086s,
> flags:SFPR., actions:ct(zone=21,nat),recirc(0x1bfd58)
> 
> CT ZONE 21
> 
> recirc_id(0x1bfd58),in_port(ens801f1),ct_state(-new-est-rel-rpl+inv+trk),
> ct_label(0/0x1),eth(src=3c:fd:fe:b5:18:8c,dst=98:03:00:00:00:00/ff:ff:00:00:
> 00:00),eth_type(0x0800),ipv4(dst=10.0.1.13,ttl=64,frag=no), packets:6422,
> bytes:475220, used:0.085s, flags:SF., actions:drop
> 

Pasted the wrong flow, should be:

recirc_id(0x1bfd58),in_port(ens801f1),ct_state(-new-est-rel-rpl+inv+trk),ct_label(0/0x1),eth(src=3c:fd:fe:b5:80:ac,dst=98:03:00:00:00:00/ff:ff:00:00:00:00),eth_type(0x0800),ipv4(dst=10.0.1.13,ttl=64,frag=no), packets:794, bytes:59234, used:0.146s, flags:SFP., actions:drop

The above flow doesn't pass the packet either.

Comment 2 Marcelo Ricardo Leitner 2021-05-19 18:56:53 UTC

Considering the impacts of https://bugzilla.redhat.com/show_bug.cgi?id=1961097, I do believe this is a dupe/side effect of that one. Needs retesting after we backport that fix.

Comment 3 zenghui.shi 2021-05-25 04:04:14 UTC

(In reply to Marcelo Ricardo Leitner from comment #2)
> Considering the impacts of
> https://bugzilla.redhat.com/show_bug.cgi?id=1961097, I do believe this is a
> dupe/side effect of that one. Needs retesting after we backport that fix.

Tested with kernel 4.18.0-305.3.1.el8_4.mr634_210522_0128.x86_64, the issue can be reproduced.


OS: 4.18.0-305.3.1.el8_4.mr634_210522_0128.x86_64
OVN: ovn2.13-20.12.0-115.el8fdp.x86_64
OVS: openvswitch2.15-2.15.0-15.el8fdp.x86_64
OVN-K8s master: 58b09851bfd564a09d7358b552a9d60bd25a7508


Hostnetwork pod IP: 10.0.1.13
Service backend IPs: 10.0.1.10/10.0.1.11/10.0.1.12


//------------------- ORIG DIRECTION ----------------------//

recirc_id(0),in_port(br-ex),eth_type(0x0800),ipv4(dst=172.30.0.0/255.255.0.0,frag=no), packets:6, bytes:444, used:0.670s, actions:ct(commit,zone=64001,nat(src=169.254.169.2)),recirc(0x15178)

CT ZONE 64001

ct_state(+new-est+trk),recirc_id(0x15178),in_port(br-ex),eth(src=98:03:9b:97:38:df,dst=3c:fd:fe:a0:d7:e1),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=172.30.0.1,proto=6,ttl=64,frag=no), packets:5, bytes:370, used:0.670s, actions:ct_clear,set(eth(dst=98:03:9b:97:38:df)),ct(zone=22),recirc(0x15179)

recirc_id(0x15179),in_port(br-ex),ct_state(+new+trk),eth(),eth_type(0x0800),ipv4(dst=172.30.0.1,proto=6,frag=no),tcp(dst=443), packets:6, bytes:444, used:0.686s, flags:S, actions:hash(l4(0)),recirc(0x1517a)

recirc_id(0x1517a),dp_hash(0x6/0xf),in_port(br-ex),eth(),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:ct(commit,zone=22,label=0x2/0x2,nat(dst=10.0.1.10:6443)),recirc(0x1517b)

CT ZONE 22

ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),recirc_id(0x1517b),in_port(br-ex),eth(src=98:03:9b:97:38:df,dst=98:03:9b:97:38:df),eth_type(0x0800),ipv4(dst=10.0.1.10,proto=6,ttl=64,frag=no), packets:0, bytes:0, used:4.510s, actions:set(eth(dst=3c:fd:fe:b5:18:8c)),set(ipv4(ttl=63)),ct(commit,nat(src=10.0.1.13)),recirc(0x17f30)

CT ZONE 0

ct_state(+new-est+trk),recirc_id(0x17f30),in_port(br-ex),eth(src=98:03:9b:97:38:df,dst=3c:fd:fe:b5:18:8c),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:0, bytes:0, used:4.510s, actions:ct_clear,ct(commit,zone=64000),ens801f1


//------------------- REPLY DIRECTION ----------------------//

recirc_id(0),in_port(ens801f1),eth_type(0x0800),ipv4(proto=6,frag=no), packets:383937, bytes:360298773, used:0.000s, actions:ct(zone=64000),recirc(0x8)

CT ZONE 64000

ct_state(+est+trk),recirc_id(0x8),in_port(ens801f1),eth(src=3c:fd:fe:b5:18:8c,dst=98:03:9b:97:38:df),eth_type(0x0800),ipv4(src=10.0.1.8/255.255.255.252,dst=10.0.1.13,proto=6,ttl=64,frag=no), packets:22, bytes:2783, used:0.290s, actions:ct_clear,ct(nat),recirc(0x10)

CT ZONE 0

ct_state(+new-est+trk),recirc_id(0x10),in_port(ens801f1),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:0, bytes:0, used:10.320s, actions:ct(zone=22,nat),recirc(0x11)
ct_state(-new-est+trk),recirc_id(0x10),in_port(ens801f1),eth_type(0x0800),ipv4(frag=no), packets:935292, bytes:915539357, used:0.010s, actions:ct(zone=22,nat),recirc(0x11)

CT ZONE 22

ct_state(-new-est-rel-rpl+inv+trk),ct_label(0/0x1),recirc_id(0x11),in_port(ens801f1),eth(src=3c:fd:fe:b5:18:8c,dst=98:03:00:00:00:00/ff:ff:00:00:00:00),eth_type(0x0800),ipv4(dst=10.0.1.13,ttl=64,frag=no), packets:8, bytes:480, used:0.290s, actions:drop

Comment 4 zenghui.shi 2021-05-25 04:05:28 UTC

Created attachment 1786697 [details]
full ovs flow dump

full ovs flows for comment #3

Comment 5 zenghui.shi 2021-05-25 13:40:11 UTC

Created attachment 1786859 [details]
ovs flow dump with -m

`ovs-appctl dpctl/dump-flows -m` flow dump on the same environment as comment #3

Comment 6 zenghui.shi 2021-05-25 15:38:51 UTC

Flow analysis from attachment in comment #5

OS: 4.18.0-305.3.1.el8_4.mr634_210522_0128.x86_64
OVN: ovn2.13-20.12.0-115.el8fdp.x86_64
OVS: openvswitch2.15-2.15.0-15.el8fdp.x86_64
OVN-K8s master: 58b09851bfd564a09d7358b552a9d60bd25a7508


Hostnetwork pod IP: 10.0.1.13
Service backend IPs: 10.0.1.10/10.0.1.11/10.0.1.12


//------------------- ORIG DIRECTION ----------------------//

ufid:48cd5b5a-dde6-446f-8184-59d22e8db57b, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(br-ex),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=172.30.0.0/255.255.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:1325, bytes:104130, used:0.050s, dp:tc, actions:ct(commit,zone=64001,nat(src=169.254.169.2)),recirc(0x9b04)

CT ZONE 64001

ufid:83fe8393-8166-47d5-b89d-a1fc585f6dcc, skb_priority(0/0),skb_mark(0/0),ct_state(0x21/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0x9b04),dp_hash(0/0),in_port(br-ex),packet_type(ns=0/0,id=0/0),eth(src=98:03:9b:97:38:df,dst=3c:fd:fe:a0:d7:e1),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=172.30.0.1,proto=6,tos=0/0,ttl=64,frag=no),tcp(src=0/0,dst=0/0), packets:1050, bytes:77700, used:1.170s, dp:tc, actions:ct_clear,set(eth(dst=98:03:9b:97:38:df)),ct(zone=22),recirc(0x8c77)

CT ZONE 22

ufid:57ac030a-5800-4619-a3a4-f6c3377ee989, recirc_id(0x8c77),dp_hash(0/0),skb_priority(0/0),in_port(br-ex),skb_mark(0/0),ct_state(0x21/0x21),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=172.30.0.1,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=443),tcp_flags(0/0), packets:1052, bytes:77848, used:1.182s, flags:S, dp:ovs, actions:hash(l4(0)),recirc(0x8c78)

CT ZONE 0

ufid:ad539f8d-71a2-4a33-8e3b-6571e386d418, recirc_id(0x8c78),dp_hash(0xb/0xf),skb_priority(0/0),in_port(br-ex),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:0, bytes:0, used:never, dp:ovs, actions:ct(commit,zone=22,label=0x2/0x2,nat(dst=10.0.1.12:6443)),recirc(0x8c79)


CT ZONE 22

ufid:7c65a0c3-1d58-4553-b1bf-4f1b8e22614b, skb_priority(0/0),skb_mark(0/0),ct_state(0x21/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),recirc_id(0x8c79),dp_hash(0/0),in_port(br-ex),packet_type(ns=0/0,id=0/0),eth(src=98:03:9b:97:38:df,dst=98:03:9b:97:38:df),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=10.0.1.12,proto=6,tos=0/0,ttl=64,frag=no),tcp(src=0/0,dst=0/0), packets:0, bytes:0, used:1.810s, dp:tc, actions:set(eth(dst=3c:fd:fe:b5:80:ac)),set(ipv4(ttl=63)),ct(commit,nat(src=10.0.1.13)),recirc(0xcd08)

CT ZONE 0

ufid:48b66e8f-82e7-4558-8abc-e8c9a6118833, skb_priority(0/0),skb_mark(0/0),ct_state(0x21/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0xcd08),dp_hash(0/0),in_port(br-ex),packet_type(ns=0/0,id=0/0),eth(src=98:03:9b:97:38:df,dst=3c:fd:fe:b5:80:ac),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/128.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:0, bytes:0, used:1.810s, dp:tc, actions:ct_clear,ct(commit,zone=64000),ens801f1


//------------------- REPLY DIRECTION ----------------------//

ufid:1e81a69b-0a8a-4dba-aa27-5304508c8af9, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens801f1),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=6,tos=0/0,ttl=0/0,frag=no),tcp(src=0/0,dst=0/0), packets:9974662, bytes:1548850913, used:0.010s, offloaded:yes, dp:tc, actions:ct(zone=64000),recirc(0x9)

CT ZONE 64000

ufid:35896391-d93c-4e4a-9571-cd78baff869c, skb_priority(0/0),skb_mark(0/0),ct_state(0x22/0x22),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0x9),dp_hash(0/0),in_port(ens801f1),packet_type(ns=0/0,id=0/0),eth(src=3c:fd:fe:b5:80:ac,dst=98:03:9b:97:38:df),eth_type(0x0800),ipv4(src=10.0.1.12,dst=10.0.1.13,proto=6,tos=0/0,ttl=64,frag=no),tcp(src=0/0,dst=0/0), packets:54621, bytes:56577533, used:0.490s, offloaded:yes, dp:tc, actions:ct_clear,ct(nat),recirc(0x6)


CT ZONE 0

ufid:cf5092dc-ab93-4f8d-af78-1cdf73a3893d, skb_priority(0/0),skb_mark(0/0),ct_state(0x20/0x23),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0x6),dp_hash(0/0),in_port(ens801f1),packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:3341399, bytes:756969215, used:0.070s, offloaded:yes, dp:tc, actions:ct(zone=22,nat),recirc(0x7)

CT ZONE 22

ufid:efc7f282-b9f4-4c84-9d48-e2706903f4a2, skb_priority(0/0),skb_mark(0/0),ct_state(0x30/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),recirc_id(0x7),dp_hash(0/0),in_port(ens801f1),packet_type(ns=0/0,id=0/0),eth(src=3c:fd:fe:b5:80:ac,dst=98:03:00:00:00:00/ff:ff:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=10.0.1.13,proto=0/0,tos=0/0,ttl=64,frag=no), packets:1049, bytes:62940, used:1.810s, dp:tc, actions:drop

Comment 7 zenghui.shi 2021-05-25 15:47:42 UTC

> 
> //------------------- REPLY DIRECTION ----------------------//
> 

> CT ZONE 0
> 
> ufid:cf5092dc-ab93-4f8d-af78-1cdf73a3893d,
> skb_priority(0/0),skb_mark(0/0),ct_state(0x20/0x23),ct_zone(0/0),ct_mark(0/
                                           ^^^^^^^^^^ -new-est+trk should not be offloaded

> 0),ct_label(0/0),recirc_id(0x6),dp_hash(0/0),in_port(ens801f1),
> packet_type(ns=0/0,id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,
> dst=00:00:00:00:00:00/00:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.
> 0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no),
> packets:3341399, bytes:756969215, used:0.070s, offloaded:yes, dp:tc,
                                                 ^^^^^^^^^^^^^ marked as offloaded:yes by ovs

> actions:ct(zone=22,nat),recirc(0x7)

Could the above flow results in a drop flow as below?

> 
> CT ZONE 22
> 
> ufid:efc7f282-b9f4-4c84-9d48-e2706903f4a2,
> skb_priority(0/0),skb_mark(0/0),ct_state(0x30/0x3f),ct_zone(0/0),ct_mark(0/
> 0),ct_label(0/0x1),recirc_id(0x7),dp_hash(0/0),in_port(ens801f1),
> packet_type(ns=0/0,id=0/0),eth(src=3c:fd:fe:b5:80:ac,dst=98:03:00:00:00:00/
                                     ^^^^^^^^^^^^^^^^ match the src mac

> ff:ff:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=10.0.1.13,
> proto=0/0,tos=0/0,ttl=64,frag=no), packets:1049, bytes:62940, used:1.810s,
> dp:tc, actions:drop
                 ^^^^ dropped

Comment 8 Marcelo Ricardo Leitner 2021-05-25 17:45:49 UTC

(In reply to zenghui.shi from comment #7)
> > 
> > //------------------- REPLY DIRECTION ----------------------//
> > 
> 
> > CT ZONE 0
> > 
> > ufid:cf5092dc-ab93-4f8d-af78-1cdf73a3893d,
> > skb_priority(0/0),skb_mark(0/0),ct_state(0x20/0x23),ct_zone(0/0),ct_mark(0/
>                                            ^^^^^^^^^^ -new-est+trk should
> not be offloaded

This should be fine, actually.
+new, +inv and +rel are the non-offloadable ones.

Comment 9 Marcelo Ricardo Leitner 2021-05-25 18:27:38 UTC

Although I can't explain why the action:drop, I guess Zenghui is also hitting the issues being fixed on
https://patchwork.ozlabs.org/project/ovn/patch/20210520230114.3697365-1-numans@ovn.org/
which is https://bugzilla.redhat.com/show_bug.cgi?id=1953278 and https://bugzilla.redhat.com/show_bug.cgi?id=1956740
due to ct zone swinging 22/0/22/0.

If it makes sense, OVN team, can we have a test package please?

Comment 10 Marcelo Ricardo Leitner 2021-05-25 18:34:30 UTC

(In reply to zenghui.shi from comment #6)
> Flow analysis from attachment in comment #5
> 
> OS: 4.18.0-305.3.1.el8_4.mr634_210522_0128.x86_64

For easy reference and FTR, this is https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/634/commits
https://gitlab.com/redhat/red-hat-ci-tools/kernel/cki-internal-pipelines/cki-internal-contributors/-/jobs/1284676077

Comment 11 zenghui.shi 2021-05-26 06:08:07 UTC

(In reply to zenghui.shi from comment #6)
> Flow analysis from attachment in comment #5
> 
> OS: 4.18.0-305.3.1.el8_4.mr634_210522_0128.x86_64
> OVN: ovn2.13-20.12.0-115.el8fdp.x86_64
> OVS: openvswitch2.15-2.15.0-15.el8fdp.x86_64
> OVN-K8s master: 58b09851bfd564a09d7358b552a9d60bd25a7508
> 
> 
> Hostnetwork pod IP: 10.0.1.13
> Service backend IPs: 10.0.1.10/10.0.1.11/10.0.1.12
> 
> 
> //------------------- ORIG DIRECTION ----------------------//
> 
> ufid:48cd5b5a-dde6-446f-8184-59d22e8db57b,
> skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),
> ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(br-ex),packet_type(ns=0/0,
> id=0/0),eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:
> 00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=172.30.0.0/255.
> 255.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:1325, bytes:104130,
> used:0.050s, dp:tc,
> actions:ct(commit,zone=64001,nat(src=169.254.169.2)),recirc(0x9b04)
> 
> CT ZONE 64001
> 
> ufid:83fe8393-8166-47d5-b89d-a1fc585f6dcc,
> skb_priority(0/0),skb_mark(0/0),ct_state(0x21/0x23),ct_zone(0/0),ct_mark(0/
> 0),ct_label(0/0),recirc_id(0x9b04),dp_hash(0/0),in_port(br-ex),
> packet_type(ns=0/0,id=0/0),eth(src=98:03:9b:97:38:df,dst=3c:fd:fe:a0:d7:e1),
> eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=172.30.0.1,proto=6,tos=0/0,
> ttl=64,frag=no),tcp(src=0/0,dst=0/0), packets:1050, bytes:77700,
> used:1.170s, dp:tc,
> actions:ct_clear,set(eth(dst=98:03:9b:97:38:df)),ct(zone=22),recirc(0x8c77)
> 
> CT ZONE 22
> 
> ufid:57ac030a-5800-4619-a3a4-f6c3377ee989,
> recirc_id(0x8c77),dp_hash(0/0),skb_priority(0/0),in_port(br-ex),skb_mark(0/
> 0),ct_state(0x21/0x21),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=00:00:
> 00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:00:00:00),
> eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=172.30.0.1,proto=6,tos=0/0,
> ttl=0/0,frag=no),tcp(src=0/0,dst=443),tcp_flags(0/0), packets:1052,
> bytes:77848, used:1.182s, flags:S, dp:ovs, actions:hash(l4(0)),recirc(0x8c78)
> 
> CT ZONE 0
> 
> ufid:ad539f8d-71a2-4a33-8e3b-6571e386d418,
> recirc_id(0x8c78),dp_hash(0xb/0xf),skb_priority(0/0),in_port(br-ex),
> skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
> eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:
> 00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,
> proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:0, bytes:0, used:never, dp:ovs,
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ no pkt hit this rule
> actions:ct(commit,zone=22,label=0x2/0x2,nat(dst=10.0.1.12:6443)),
> recirc(0x8c79)
> 

From conntrack entries, pkt is not dnated on the original direction:

Service IP: 172.30.0.1
Backend endpoint IPs: 10.0.1.10/10.0.1.11/10.0.1.12
Node IP: 10.0.1.13


ipv4     2 tcp      6 116 SYN_SENT src=169.254.169.2 dst=172.30.0.1 sport=12346 dport=443 [UNREPLIED] src=172.30.0.1 dst=169.254.169.2 sport=443 dport=29409 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64001 use=2
                                        ^^^^^^^^^^^ It is expected that pkg goes to zone 22 and gets dnated, but the entry shows it is still in zone 64000

                                                    The expected entry is something like:

                                                    ipv4     2 tcp      6 8 CLOSE src=169.254.169.2 dst=172.30.0.1 sport=12345 dport=443 src=10.0.1.12 dst=169.254.169.2 \
                                                    sport=6443 dport=12345 zone=22

ipv4     2 tcp      6 116 SYN_SENT src=10.0.1.13 dst=172.30.0.1 sport=12346 dport=443 [UNREPLIED] src=172.30.0.1 dst=169.254.169.2 sport=443 dport=12346 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64001 use=2

ipv4     2 tcp      6 116 SYN_SENT src=10.0.1.13 dst=172.30.0.1 sport=12346 dport=443 [UNREPLIED] src=172.30.0.1 dst=10.0.1.13 sport=443 dport=12346 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2

Comment 12 Marcelo Ricardo Leitner 2021-05-27 02:15:54 UTC

(In reply to zenghui.shi from comment #11)
> (In reply to zenghui.shi from comment #6)
...
> > CT ZONE 0
       ^^^^^^
> > 
> > ufid:ad539f8d-71a2-4a33-8e3b-6571e386d418,
> > recirc_id(0x8c78),dp_hash(0xb/0xf),skb_priority(0/0),in_port(br-ex),
> > skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
> > eth(src=00:00:00:00:00:00/00:00:00:00:00:00,dst=00:00:00:00:00:00/00:00:00:
> > 00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,
> > proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:0, bytes:0, used:never, dp:ovs,
>                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ no pkt
> hit this rule
> > actions:ct(commit,zone=22,label=0x2/0x2,nat(dst=10.0.1.12:6443)),
> > recirc(0x8c79)
> > 
> 
> From conntrack entries, pkt is not dnated on the original direction:
                          ^^^^^^^^^^^^^^^^^

(omitted the rest for simplicity)
Super! This matches the issue Ariel just root caused:
https://bugzilla.redhat.com/show_bug.cgi?id=1881824#c14
I'll have a new test kernel soon with his original patch.

Comment 13 Marcelo Ricardo Leitner 2021-05-27 04:06:49 UTC

New test kernel with Ariel's patch available here:
yum repo file: https://s3.upshift.redhat.com/DH-PROD-CKI/internal/310282800/repo-x86_64
Built from: https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/680

Comment 14 zenghui.shi 2021-05-27 07:03:06 UTC

(In reply to Marcelo Ricardo Leitner from comment #13)
> New test kernel with Ariel's patch available here:
> yum repo file:
> https://s3.upshift.redhat.com/DH-PROD-CKI/internal/310282800/repo-x86_64
> Built from:
> https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/680


Issue remains with the new test kernel 4.18.0-305.4.1.el8_4.mr680_210527_0238.x86_64

From conntrack entries, pkt is not dnated correctly on the original direction:

ipv4     2 tcp      6 4 SYN_RECV src=169.254.169.2 dst=172.30.0.1 sport=12346 dport=443 src=172.30.0.1 dst=169.254.169.2 sport=443 dport=62378 mark=0 secctx=system_u:object_r:unlabeled_t:s0 
                        ^^^^^^^^ pkt is not dnated correctly, but we got SYN_RECV?
zone=64001 use=2

ipv4     2 tcp      6 64 SYN_SENT src=10.0.1.13 dst=172.30.0.1 sport=12346 dport=443 [UNREPLIED] src=172.30.0.1 dst=169.254.169.2 sport=443 dport=12346 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64001 use=2

ipv4     2 tcp      6 64 SYN_SENT src=10.0.1.13 dst=172.30.0.1 sport=12346 dport=443 [UNREPLIED] src=172.30.0.1 dst=10.0.1.13 sport=443 dport=12346 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2

Comment 15 Dumitru Ceara 2021-05-27 14:07:12 UTC

@zshi, as discussed in the meeting earlier, this is a scratch ovn2.13 build including Mark's and Numan's v3 [0]:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=37028480

Regards,
Dumitru

[0] https://github.com/numansiddique/ovn/commit/97574a3844527e801ad01c4f3b5a5de6ce6abfec

Comment 16 zenghui.shi 2021-05-31 04:18:21 UTC

(In reply to Dumitru Ceara from comment #15)
> @zshi, as discussed in the meeting earlier, this is a scratch
> ovn2.13 build including Mark's and Numan's v3 [0]:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=37028480
> 
> Regards,
> Dumitru
> 
> [0]
> https://github.com/numansiddique/ovn/commit/
> 97574a3844527e801ad01c4f3b5a5de6ce6abfec

Dumitru, thanks for the build!

The issue remains after upgrading ovn to ovn2.13-20.12.0-136.el8fdp.x86_64 and kernel to kernel 4.18.0-305.4.1.el8_4.mr680_210527_0238.x86_64

symptom is the same as in comment #14.

Comment 17 Adrian Chiris 2021-06-01 15:23:51 UTC

Internally we are also seeing issues with host network pod to service backed by host network . tested with Mark's/Numan's V4 OVN patch

running : 

# curl https://10.96.0.1/livez --insecure

first time it works OK, the next execution takes a long time to complete
on the wire I see packets without SNAT & DNAT applied.

10.96.0.1.https > 169.254.169.2.43752

Conntrack output:

tcp      6 src=169.254.169.2 dst=10.96.0.1 sport=37154 dport=443 src=10.11.0.10 dst=169.254.169.2 sport=6443 dport=37154 [ASSURED] mark=0 secctx=null zone=8 use=2
tcp      6 116 SYN_SENT src=10.11.0.11 dst=10.96.0.1 sport=37172 dport=443 [UNREPLIED] src=10.96.0.1 dst=169.254.169.2 sport=443 dport=37172 mark=0 secctx=null zone=64001 use=1
tcp      6 116 SYN_SENT src=10.11.0.11 dst=10.96.0.1 sport=37172 dport=443 [UNREPLIED] src=10.96.0.1 dst=10.11.0.11 sport=443 dport=37172 mark=0 secctx=null use=1
tcp      6 56 SYN_RECV src=169.254.169.2 dst=10.11.0.10 sport=42125 dport=6443 src=10.11.0.10 dst=10.11.0.11 sport=6443 dport=42125 mark=0 secctx=null use=1
tcp      6 src=169.254.169.2 dst=10.11.0.10 sport=37154 dport=6443 src=10.11.0.10 dst=10.11.0.11 sport=6443 dport=37154 [ASSURED] mark=0 secctx=null use=2
tcp      6 56 SYN_RECV src=10.11.0.11 dst=10.11.0.10 sport=42125 dport=6443 src=10.11.0.10 dst=10.11.0.11 sport=6443 dport=42125 mark=0 secctx=null zone=64000 use=1
tcp      6 56 SYN_RECV src=169.254.169.2 dst=10.96.0.1 sport=42125 dport=443 src=10.11.0.10 dst=169.254.169.2 sport=6443 dport=42125 mark=0 secctx=null zone=8 use=1
tcp      6 56 SYN_RECV src=169.254.169.2 dst=10.96.0.1 sport=37172 dport=443 src=10.96.0.1 dst=169.254.169.2 sport=443 dport=42125 mark=0 secctx=null zone=64001 use=1

Comment 18 Marcelo Ricardo Leitner 2021-06-01 15:56:03 UTC

(In reply to Adrian Chiris from comment #17)
> Internally we are also seeing issues with host network pod to service backed
> by host network . tested with Mark's/Numan's V4 OVN patch

With upstream or downstream kernel? I would assume upstream at this stage, but please confirm. Thanks

Comment 19 Adrian Chiris 2021-06-02 10:01:26 UTC

> With upstream or downstream kernel? I would assume upstream at this stage, but please confirm. Thanks

its a downstream kernel but not RH. However it contains all needed upstream kernel fixes/support that were mapped.

Comment 20 Alaa Hleihel (NVIDIA Mellanox) 2021-06-13 10:18:53 UTC

note to self: internal RM ticket 2648680

Comment 21 Marcelo Ricardo Leitner 2021-07-07 23:54:24 UTC

I have a theory on this.

After hours troubleshooting this with Zenghui today, one thing caught my eye:
taking tcpdumps on br-ex is showing that the SYN packet already has the 1st SNAT done.
That happens because since
95255018a83e ("ovs-tc: allow offloading TC rules to egress qdiscs")
ovs will use egress rules, to cope with the lack of representor ports.
As TC rules on egress are executed BEFORE the taps, that means some TC rule got executed.

On today's tests, we always took a big while to re-run the test, so datapath flows would end up expiring. No specific reason for doing the tests like this, though. Yet, some other traffic could be lighting flows in the background.

Then, as Adrian noted:
(In reply to Adrian Chiris from comment #17)
> first time it works OK, the next execution takes a long time to complete
> on the wire I see packets without SNAT & DNAT applied.

Now my theory:
That's likely because the very first one was an upcall, handled entirely by vswitchd, while the subsequent ones, well, triggers an unexpected situation. We didn't see un-NATed packets on the wire, but we did see a missing one on br-ex on ingress side, the last SNAT was not undone. Because:

When the egress rules attached to br-ex by the above hit a miss (like after doing the 1st SNAT), I don't see a way that it can tell OVS "hey I went up to chain X" like we have on the ingress side of it. So vswitchd handles it as if nothing ever happened over TC land!

__dev_queue_xmit
  sch_handle_egress
    tcf_classify     <--- listed below
  dev_hard_start_xmit
    xmit_one
      dev_queue_xmit_nit  <-- our tap point for tcpdump
    netdev_start_xmit
      __netdev_start_xmit
        ops->ndo_start_xmit  (which is internal_dev_xmit , from vport-internal_dev.c)
          ovs_vport_receive
            ovs_dp_process_packet
              ovs_flow_tbl_lookup_stats returns NULL, no flow in dp:ovs, so
              ovs_dp_upcall    <--- with 0 knowledge that dp:tc already SNATed this packet.


  int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
                   struct tcf_result *res, bool compat_mode)
  {
          u32 last_executed_chain = 0;

          return __tcf_classify(skb, tp, tp, res, compat_mode,
                                &last_executed_chain);
  }

Which explains why we saw 2 conntrack entries on today's tests for the same connection:
ipv4     2 tcp      6 59 SYN_RECV src=169.254.169.2 dst=172.30.0.1 sport=12345 dport=443 src=172.30.0.1 dst=169.254.169.2 sport=443 dport=56804 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64001 use=2
ipv4     2 tcp      6 119 SYN_SENT src=192.168.111.25 dst=172.30.0.1 sport=12345 dport=443 [UNREPLIED] src=172.30.0.1 dst=169.254.169.2 sport=443 dport=12345 mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64001 use=2

Note that both are on zone=64001, and the 1st one has a bogus src ip.
The 2nd one is the right one. The 1st one is created out of the issue above, when vswitchd handles it unaware of previous handling.

Sounds like we have 2 features colliding here. tc on egress and tc chain fallback for CT are not integrated in here.

Comment 22 Marcelo Ricardo Leitner 2021-07-08 00:42:48 UTC

Adding more people for awareness. The above is likely fixable with a kernel patch on tc/core stack. (doesn't mean it's an easy fix!)
kernel ovs will already handle properly the chain information once available to it.

Comment 23 Alaa Hleihel (NVIDIA Mellanox) 2021-07-08 13:47:34 UTC

Hi, Marcelo.

Following your analysis, Roi suggested this fix, can you try it?

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3973,7 +3973,8 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
        qdisc_skb_cb(skb)->post_ct = false;
        mini_qdisc_bstats_cpu_update(miniq, skb);
 
-       switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
+       switch (tcf_classify_ingress(skb, miniq->block, miniq->filter_list,
+                                    &cl_res, false)) {
        case TC_ACT_OK:
        case TC_ACT_RECLASSIFY:
                skb->tc_index = TC_H_MIN(cl_res.classid);

Comment 24 Marcelo Ricardo Leitner 2021-07-08 13:50:50 UTC

Ariel shared on the mtg today that they found another bug around this.
When mirred sends a packet towards an internal port, there is no scrubbing, and thus the skb may carry a previous conntrack state on it.

tcf_mirred_forward        (skb with CT info on it)
  netif_receive_skb
    netif_receive_skb_internal
      __netif_receive_skb
        ...

This affects packets going from a representor to an internal port.

Comment 25 Marcelo Ricardo Leitner 2021-07-08 13:55:41 UTC

(In reply to Alaa Hleihel (NVIDIA Mellanox) from comment #23)
> Hi, Marcelo.
> 
> Following your analysis, Roi suggested this fix, can you try it?

Nice! Thanks. Yes. I'll build a test kernel as soon as Ariel share the patch for the scrubbing issue.

Comment 26 Alaa Hleihel (NVIDIA Mellanox) 2021-07-08 14:05:20 UTC

He probably refered to this fix, Roi said it fixed  host to pod.
the upstream patch might be different, but it would be nice to get initial testing.



    net: sched: act_mirred: Reset ct when reinserting skb into queue
    
    When we reinsert an skb back we should reset ct for reclassification.
    
    Signed-off-by: Roi Dayan <roid>

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 5ae3e3197fb5..65560032b496 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -286,6 +286,8 @@ static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a,
 
                /* let's the caller reinsert the packet, if possible */
                if (use_reinsert) {
+                       if (want_ingress)
+                               nf_reset_ct(skb);
                        res->ingress = want_ingress;
                        if (skb_tc_reinsert(skb, res))
                                tcf_action_inc_overlimit_qstats(&m->common);

Comment 27 Marcelo Ricardo Leitner 2021-07-08 14:12:07 UTC

Yes, this one. Thanks.

Comment 28 Marcelo Ricardo Leitner 2021-07-08 14:21:45 UTC

(In reply to Alaa Hleihel (NVIDIA Mellanox) from comment #26)
>     net: sched: act_mirred: Reset ct when reinserting skb into queue
>     
>     When we reinsert an skb back we should reset ct for reclassification.
...
> @@ -286,6 +286,8 @@ static int tcf_mirred_act(struct sk_buff *skb, const
> struct tc_action *a,
>  
>                 /* let's the caller reinsert the packet, if possible */
>                 if (use_reinsert) {
> +                       if (want_ingress)
> +                               nf_reset_ct(skb);

Btw I wonder why not just call skb_scrub_packet() here.
It will overwrite skb->pkt_type , but that's what is used today in OVS cases, at least.

Comment 29 Marcelo Ricardo Leitner 2021-07-08 14:31:21 UTC

Test kernel is being built. It will be available in the draft MR below:
Commit list: https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/942/commits

This is based on latest 8.5 and also added the fix for ct_label 0.

Comment 30 Marcelo Ricardo Leitner 2021-07-09 20:13:52 UTC

(In reply to Marcelo Ricardo Leitner from comment #29)
> Test kernel is being built. It will be available in the draft MR below:
> Commit list:
> https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/942/commits
> 
> This is based on latest 8.5 and also added the fix for ct_label 0.

Infrastructure issues are preventing this build from completing.

Comment 31 Alaa Hleihel (NVIDIA Mellanox) 2021-07-12 14:46:28 UTC

(In reply to Marcelo Ricardo Leitner from comment #28)
> (In reply to Alaa Hleihel (NVIDIA Mellanox) from comment #26)
> >     net: sched: act_mirred: Reset ct when reinserting skb into queue
> >     
> >     When we reinsert an skb back we should reset ct for reclassification.
> ...
> > @@ -286,6 +286,8 @@ static int tcf_mirred_act(struct sk_buff *skb, const
> > struct tc_action *a,
> >  
> >                 /* let's the caller reinsert the packet, if possible */
> >                 if (use_reinsert) {
> > +                       if (want_ingress)
> > +                               nf_reset_ct(skb);
> 
> Btw I wonder why not just call skb_scrub_packet() here.
> It will overwrite skb->pkt_type , but that's what is used today in OVS
> cases, at least.

I talked to Roi, he said that they started with scrub and saw that it fixed the issue.
However, they weren't sure if that was too much or not, so they thought about doing something smaller and then they switched to the reset.
Anyway, they didn't get a chance to fully test both ways and really decide about the right approach (that's why they didn't post it yet).

But if OVS does scrub, it sounds reasonable to do it here too.
It will be great if you could try it, and then you could even post the fix upstream, we'll ack it :)

Thanks for the help!
Alaa

Comment 32 zenghui.shi 2021-07-13 02:00:11 UTC

I applied the kernel builds[1] to the openshift worker nodes and it fixed the issue (host networked pod cannot access k8s service backed by hostnetworked pod).

ovs: openvswitch2.15-2.15.0-24.el8fdp.x86_64
ovn: ovn2.13-20.12.0-140.el8fdp.x86_64
kernel: 4.18.0-322.el8.mr942_210708_1548.x86_64

[1]: https://s3.upshift.redhat.com/DH-PROD-CKI/internal/333962177/repo-x86_64.repo

Comment 33 Marcelo Ricardo Leitner 2021-07-15 01:50:37 UTC

(In reply to Alaa Hleihel (NVIDIA Mellanox) from comment #31)
> (In reply to Marcelo Ricardo Leitner from comment #28)
> > (In reply to Alaa Hleihel (NVIDIA Mellanox) from comment #26)
> > >     net: sched: act_mirred: Reset ct when reinserting skb into queue
> > >     
> > >     When we reinsert an skb back we should reset ct for reclassification.
> > ...
> > > @@ -286,6 +286,8 @@ static int tcf_mirred_act(struct sk_buff *skb, const
> > > struct tc_action *a,
> > >  
> > >                 /* let's the caller reinsert the packet, if possible */
> > >                 if (use_reinsert) {
> > > +                       if (want_ingress)
> > > +                               nf_reset_ct(skb);
> > 
> > Btw I wonder why not just call skb_scrub_packet() here.
> > It will overwrite skb->pkt_type , but that's what is used today in OVS
> > cases, at least.
> 
> I talked to Roi, he said that they started with scrub and saw that it fixed
> the issue.
> However, they weren't sure if that was too much or not, so they thought
> about doing something smaller and then they switched to the reset.
> Anyway, they didn't get a chance to fully test both ways and really decide
> about the right approach (that's why they didn't post it yet).
> 
> But if OVS does scrub, it sounds reasonable to do it here too.

I think it does only when crossing net namespaces, but not interfaces.
https://elixir.bootlin.com/linux/latest/source/net/openvswitch/vport.c#L443

Now I'm wondering if:
- this is really an issue
- or an expected (and weird) behavior,
- or if OvS is also affected.
- or I am missing something :D

I seem to recall many flows having a ct_clear before hitting mirred. Maybe that's why.

> It will be great if you could try it, and then you could even post the fix
> upstream, we'll ack it :)

Yup, ok. Thanks!

Comment 34 Marcelo Ricardo Leitner 2021-07-15 01:53:05 UTC

(lets move this discussion to the other bug, bz1980532)

Comment 35 Alaa Hleihel (NVIDIA Mellanox) 2021-07-15 07:08:07 UTC

what about the egress chains restore?
are we good?
can we submit it upstream?

Comment 36 Marcelo Ricardo Leitner 2021-07-15 13:17:24 UTC

we're good, as in, we're working on it :)
Davide is working on an upstreamable version of it at:
https://bugzilla.redhat.com/show_bug.cgi?id=1980537

Comment 37 Alaa Hleihel (NVIDIA Mellanox) 2021-07-15 14:43:24 UTC

Thank a lot guys :)

Comment 38 Marcelo Ricardo Leitner 2021-08-10 20:24:24 UTC

Folks, AFAICT this bz is only waiting for bz1980537 and bz1980532 to get applied downstream now.
I built a MR with both fixes so Zenghui can try again, now with the final fixes, and confirm that it works.
With that, I'll take this bug for now.
Zenghui, I'll share a more direct URL once the kernel is built.
https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/1127

Comment 39 Marcelo Ricardo Leitner 2021-08-10 21:51:56 UTC

There it is:
https://s3.upshift.redhat.com/DH-PROD-CKI/internal/351141150/repo-x86_64.repo
https://gitlab.com/redhat/red-hat-ci-tools/kernel/cki-internal-pipelines/cki-internal-contributors/-/jobs/1493253795/artifacts/browse/artifacts/repo/4.18.0-330.el8.mr1127_210810_2027.x86_64/

Comment 40 zenghui.shi 2021-08-17 10:43:01 UTC

(In reply to Marcelo Ricardo Leitner from comment #39)
> There it is:
> https://s3.upshift.redhat.com/DH-PROD-CKI/internal/351141150/repo-x86_64.repo


Marcelo, the kernel (4.18.0-330.el8.mr1127_210810_2027.x86_64) doesn't fix this bug.
When I applied the above kernel, host network pod failed to communicate with service backed by host network pod.

I then reverted back to previous kernel (4.18.0-322.el8.mr942_210708_1548.x86_64) which works.

What's the difference between these two kernels?

Comment 41 Marcelo Ricardo Leitner 2021-08-17 14:04:00 UTC

That is some unexpected news, Zenghui. I'm not sure why that happened.
The only related difference is that now I used the patches that got accepted upstream.
Hmm...

Comment 42 Marcelo Ricardo Leitner 2021-08-17 22:05:30 UTC

(In reply to zenghui.shi from comment #0)
> //------------------- ORIG DIRECTION ----------------------//
...
> recirc_id(0x1cf7e6),in_port(br-ex),ct_state(+new-est+trk),eth(src=98:03:9b:
                      ^^^^^^^^^^^^^^
> 97:38:df,dst=3c:fd:fe:b5:80:ac),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,
> frag=no), packets:9, bytes:666, used:7.445s, flags:S,
> actions:ct_clear,ct(commit,zone=64000),ens801f1
                                         ^^^^^^^^

And I don't see a log on the bz but it should be simply swapped for reply direction.

Original patch for clearing CT info was:
@@ -303,6 +306,8 @@ static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a,

                /* let's the caller reinsert the packet, if possible */
                if (use_reinsert) {
+                       if (want_ingress)
+                               nf_reset(skb);
                        res->ingress = want_ingress;
                        err = tcf_mirred_forward(res->ingress, skb);
                        if (err)

While now it is:
@@ -278,6 +278,9 @@ static int tcf_mirred_act(struct sk_buff *skb, const struct tc_action *a,
                        goto out;
        }

+       /* All mirred/redirected skbs should clear previous ct info */
+       nf_reset(skb2);
+
        want_ingress = tcf_mirred_act_wants_ingress(m_eaction);

        expects_nh = want_ingress || !m_mac_header_xmit;


Thing is, nothing should be using this information by when mirred is done with this setup, and it doesn't affect misses from dp:tc. Yet, this is the biggest delta between the test kernels.

Checking how OVS configs mirred:
nl_msg_put_flower_acts()
                  if (i == flower->action_count - 1) {
                      if (ingress) {
                          nl_msg_put_act_mirred(request, ifindex, TC_ACT_STOLEN,
                                                TCA_INGRESS_REDIR);               <---- hit this on reply
                      } else {
                          nl_msg_put_act_mirred(request, ifindex, TC_ACT_STOLEN,
                                                TCA_EGRESS_REDIR);                <---- hit this on orig
                      }
                  } else {
                      if (ingress) {
                          nl_msg_put_act_mirred(request, ifindex, TC_ACT_PIPE,
                                                TCA_INGRESS_MIRROR);
                      } else {
                          nl_msg_put_act_mirred(request, ifindex, TC_ACT_PIPE,
                                                TCA_EGRESS_MIRROR);
                      }
                  }

because the output action is the last one on the action list. With that:
          is_redirect = tcf_mirred_is_act_redirect(m_eaction);   // true in both cases
          use_reinsert = skb_at_tc_ingress(skb) && is_redirect &&
                         tcf_mirred_can_reinsert(retval);        // false for orig, true for reply
          if (!use_reinsert) {
                  skb2 = skb_clone(skb, GFP_ATOMIC);

with the new patch:
          nf_reset(skb2);   // clears CT info on cloned packet if orig, or on the same packet if reply
                            // due to difference in skb_at_tc_ingress()

          want_ingress = tcf_mirred_act_wants_ingress(m_eaction);  // false for orig, true for reply

then with the previous patch:
                /* let's the caller reinsert the packet, if possible */
                if (use_reinsert) {                   // false for orig traffic, so the change had no effect
+                       if (want_ingress)             // true for reply traffic
+                               nf_reset(skb);
                        res->ingress = want_ingress;
                        err = tcf_mirred_forward(res->ingress, skb);
                        ...
                }


          err = tcf_mirred_forward(want_ingress, skb2);  // triggered on orig traffic

The only difference I can tell, is that the new patch is clearing CT info on the orig direction and it wasn't on the previous patch. Yet, as I mentioned earlier in the comment, this shouldn't affect this test. This is puzzling.

Comment 43 Marcelo Ricardo Leitner 2021-08-18 12:54:34 UTC

I'm afraid we either need another test or more debug info. I can revert one of the patches to the previous version, the one on clearing CT info above, and see how it goes.

Or, we need dumps of datapath flows and so, like we had on comment #0, for the good and the bad kernel, so we can spot differences. This one is more promising.

Zenghui, please let me know which one you prefer.

Comment 44 Marcelo Ricardo Leitner 2021-08-18 16:35:16 UTC

I wrote a reproducer here as close as I could, it failed on kernel-core-4.18.0-329.el8.x86_64. Then tested on the new test kernel, and it works.

OVS flows:
addf="ovs-ofctl add-flow ${ovs_name}"

# client -> server
$addf "table=0,ipv4,in_port=$ovs_name,actions=ct(zone=1,table=1,nat)"
$addf "table=1,ipv4,ct_state=+trk+new,actions=ct(zone=1,commit,nat(dst=12.0.0.1)),ct_clear,resubmit(,2)"
$addf "table=1,ipv4,ct_state=+trk+est,actions=ct_clear,resubmit(,2)"
$addf "table=2,ipv4,tcp,in_port=$ovs_name,actions=ct(zone=2,table=3,nat)"
$addf "table=3,ipv4,tcp,tcp_dst=3000,actions=drop"
$addf "table=3,ipv4,tcp,ct_state=+trk+new,actions=ct(zone=2,commit,nat(src=12.0.0.2)),resubmit(,4)"
$addf "table=3,ipv4,tcp,ct_state=+trk+est,actions=resubmit(,4)"
$addf "table=4,ipv4,in_port=$ovs_name,action=output:$REP"

# server -> client
$addf "table=0,ipv4,in_port=$REP,actions=ct(zone=2,table=11,nat)"
$addf "table=11,ipv4,tcp,ct_state=+trk+est,actions=ct_clear,ct(zone=1,table=12,nat)"
$addf "table=12,ipv4,in_port=$REP,action=output:$ovs_name"

$addf "table=0,ipv6,actions=drop"
$addf "table=0,actions=NORMAL"


And test to trigger a miss after a NAT:
sleep 10 | ip netns exec $VF nc -l 5000 &
pid=$!
ip netns exec $VF tcpdump -n -i $VF -w $VF.cap &
pid="$pid $!"
tcpdump -n -i $ovs_name -w $ovs_name.cap &
pid="$pid $!"
sleep 2

echo | nc -w 1 11.0.0.1 3000  || :
ovs-appctl dpctl/dump-flows -m
sleep 10 | nc 11.0.0.1 5000 &
sleep 1
grep 'zone=[12]' /proc/net/nf_conntrack || :
ovs-appctl dpctl/dump-flows -m
sleep 1
grep 'zone=[12]' /proc/net/nf_conntrack || :
sleep 1
grep 'zone=[12]' /proc/net/nf_conntrack || :


# uname -r
4.18.0-330.el8.mr1127_210810_2027.x86_64

# tshark -r br0.cap ip
Running as user "root" and group "root". This could be dangerous.
    1   0.000000     11.0.0.2 → 11.0.0.1     TCP 74 36864 → 3000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2031774104 TSecr=0 WS=128
    2   1.012988     11.0.0.2 → 12.0.0.1     TCP 74 38066 → 5000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2031775117 TSecr=0 WS=128
    3   1.043074     11.0.0.1 → 11.0.0.2     TCP 74 5000 → 38066 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=3731674343 TSecr=2031775117 WS=128
    4   1.043106     11.0.0.2 → 12.0.0.1     TCP 66 38066 → 5000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=2031775147 TSecr=3731674343

#1: warm up packet
#2: DNAT already applied by TC
#3: we got a SYN/ACK

# tshark -r enp130s0f0v0.cap ip
Running as user "root" and group "root". This could be dangerous.
    6   2.480917     12.0.0.2 → 12.0.0.1     TCP 74 38066 → 5000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=2031775117 TSecr=0 WS=128
    7   2.480941     12.0.0.1 → 12.0.0.2     TCP 74 5000 → 38066 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=3731674343 TSecr=2031775117 WS=128
    8   2.511118     12.0.0.2 → 12.0.0.1     TCP 66 38066 → 5000 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=2031775147 TSecr=3731674343


While with -330.el8:
# tshark -r br0.cap ip
Running as user "root" and group "root". This could be dangerous.
    1   0.000000     11.0.0.2 → 11.0.0.1     TCP 74 54732 → 3000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=1895223696 TSecr=0 WS=128
    2   1.012617     11.0.0.2 → 12.0.0.1     TCP 74 47548 → 5000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=1895224709 TSecr=0 WS=128
    3   2.067455     11.0.0.2 → 12.0.0.1     TCP 74 [TCP Retransmission] 47548 → 5000 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=1895225764 TSecr=0 WS=128

#1: warm up packet
#2: packet with DNAT applied by TC
#3: retrans

Comment 45 zenghui.shi 2021-08-19 04:14:27 UTC

Created attachment 1815420 [details]
bz1961063/worker-3-flows-m.txt

OVS DATAPATH FLOW comparison:

worker-2  - working

kernel: 4.18.0-322.el8.mr942_210708_1548.x86_64

//------------------- ORIG DIRECTION ----------------------//

recirc_id(0),in_port(br-ex),eth_type(0x0800),ipv4(dst=172.30.0.0/255.255.0.0,frag=no), packets:51, bytes:5447, used:2.170s, actions:ct(commit,zone=64001,nat(src=169.254.169.2)),recirc(0x3450d)

Zone 64001

ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x3),recirc_id(0x3450d),in_port(br-ex),eth(src=0c:42:a1:00:b6:9c,dst=52:54:00:a0:9d:83),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=172.30.0.1,proto=6,ttl=64,frag=no), packets:0, bytes:0, used:2.180s, actions:ct_clear,set(eth(dst=0c:42:a1:00:b6:9c)),ct(zone=45),recirc(0x3450e)

recirc_id(0x3450e),in_port(br-ex),ct_state(+new-est+trk),eth(),eth_type(0x0800),ipv4(dst=172.30.0.1,proto=6,frag=no),tcp(dst=443), packets:0, bytes:0, used:never, actions:hash(l4(0)),recirc(0x34541)

recirc_id(0x34541),dp_hash(0x5/0xf),in_port(br-ex),eth(),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:ct(commit,zone=45,label=0x2/0x2,nat(dst=192.168.111.22:6443)),recirc(0x3450f)

Zone 45

ct_state(+new-est-rel-rpl-inv+trk),ct_label(0x2/0x3),recirc_id(0x3450f),in_port(br-ex),eth(src=0c:42:a1:00:b6:9c,dst=0c:42:a1:00:b6:9c),eth_type(0x0800),ipv4(dst=192.168.111.22,proto=6,ttl=64,frag=no), packets:0, bytes:0, used:2.180s, actions:set(eth(dst=00:8a:2e:2b:9d:e0)),set(ipv4(ttl=63)),ct(commit,nat(src=192.168.111.26)),recirc(0x34510)

Zone 0

ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x3),recirc_id(0x34510),in_port(br-ex),eth(src=0c:42:a1:00:b6:9c,dst=00:8a:2e:2b:9d:e0),eth_type(0x0800),ipv4(dst=192.0.0.0/192.0.0.0,frag=no), packets:0, bytes:0, used:2.180s, actions:ct_clear,ct(commit,zone=64000),ens8f0

Zone 64000

//------------------- REPLY DIRECTION ----------------------//

recirc_id(0),in_port(ens8f0),eth_type(0x0800),ipv4(proto=6,frag=no), packets:5532000, bytes:4473169296, used:0.000s, actions:ct(zone=64000),recirc(0x8)

Zone 64000

ct_state(+est-rel+rpl-inv+trk),ct_label(0/0x3),recirc_id(0x8),in_port(ens8f0),eth(src=00:8a:2e:2b:9d:e0,dst=0c:42:a1:00:b6:9c),eth_type(0x0800),ipv4(src=192.168.111.16/255.255.255.248,dst=192.168.111.26,proto=6,ttl=64,frag=no), packets:499, bytes:403323, used:0.490s, actions:ct_clear,ct(nat),recirc(0x3429d)

Zone 0

ct_state(+est+trk),recirc_id(0x3429d),in_port(ens8f0),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:349, bytes:310513, used:0.490s, actions:ct(zone=45,nat),recirc(0x3429e)

Zone 45

ct_state(+est-rel+rpl-inv+trk),ct_label(0x2/0x3),recirc_id(0x3429e),in_port(ens8f0),eth(src=00:8a:2e:2b:9d:e0,dst=0c:42:a1:00:b6:9c),eth_type(0x0800),ipv4(src=172.30.0.0/255.255.0.0,dst=169.254.169.2,proto=6,ttl=64,frag=no), packets:26, bytes:17159, used:2.170s, actions:ct_clear,set(eth(src=0c:42:a1:00:b6:9c,dst=52:54:00:a0:9d:83)),set(ipv4(ttl=63)),ct(zone=64001,nat),recirc(0x3450c)

Zone 64001

recirc_id(0x3450c),in_port(ens8f0),eth(src=0c:42:a1:00:b6:9c,dst=52:54:00:a0:9d:83),eth_type(0x0800),ipv4(frag=no), packets:26, bytes:17159, used:2.170s, actions:set(eth(src=52:54:00:a0:9d:83,dst=0c:42:a1:00:b6:9c)),br-ex



worker-3 - NOT working

kernel: 4.18.0-330.el8.mr1127_210810_2027.x86_64

//------------------- ORIG DIRECTION ----------------------//

recirc_id(0),in_port(br-ex),eth_type(0x0800),ipv4(dst=172.30.0.0/255.255.0.0,frag=no), packets:32, bytes:3752, used:2.840s, actions:ct(commit,zone=64001,nat(src=169.254.169.2)),recirc(0xe7)

Zone 64001

ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x3),recirc_id(0xe7),in_port(br-ex),eth(src=0c:42:a1:08:0a:da,dst=52:54:00:a0:9d:83),eth_type(0x0800),ipv4(src=128.0.0.0/192.0.0.0,dst=172.30.0.1,proto=6,ttl=64,frag=no), packets:0, bytes:0, used:2.851s, actions:ct_clear,set(eth(dst=0c:42:a1:08:0a:da)),ct(zone=53),recirc(0xe8)

recirc_id(0xe8),in_port(br-ex),ct_state(+new+trk),eth(),eth_type(0x0800),ipv4(dst=172.30.0.1,proto=6,frag=no),tcp(dst=443), packets:0, bytes:0, used:never, actions:hash(l4(0)),recirc(0x105)

recirc_id(0x105),dp_hash(0x3/0xf),in_port(br-ex),eth(),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never, actions:ct(commit,zone=53,label=0x2/0x2,nat(dst=192.168.111.20:6443)),recirc(0xea)

Zone 53

ct_state(+new-est-rel-rpl-inv+trk),ct_label(0x2/0x3),recirc_id(0xea),in_port(br-ex),eth(src=0c:42:a1:08:0a:da,dst=0c:42:a1:08:0a:da),eth_type(0x0800),ipv4(dst=192.168.111.20,proto=6,ttl=64,frag=no), packets:0, bytes:0, used:2.850s, actions:set(eth(dst=00:8a:2e:2b:9d:d8)),set(ipv4(ttl=63)),ct(commit,nat(src=192.168.111.27)),recirc(0xeb)

Zone 0

ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x3),recirc_id(0xeb),in_port(br-ex),eth(src=0c:42:a1:08:0a:da,dst=00:8a:2e:2b:9d:d8),eth_type(0x0800),ipv4(dst=192.0.0.0/192.0.0.0,frag=no), packets:0, bytes:0, used:2.850s, actions:ct_clear,ct(commit,zone=64000),ens8f0

Zone 64000


//------------------- REPLY DIRECTION ----------------------//

recirc_id(0),in_port(ens8f0),eth_type(0x0800),ipv4(proto=6,frag=no), packets:13728, bytes:11573335, used:0.000s, actions:ct(zone=64000),recirc(0x7)

Zone 64000

ct_state(+est-rel+rpl-inv+trk),ct_label(0/0x3),recirc_id(0x7),in_port(ens8f0),eth(src=00:8a:2e:2b:9d:d8,dst=0c:42:a1:08:0a:da),eth_type(0x0800),ipv4(src=192.168.111.16/255.255.255.248,dst=192.168.111.27,proto=6,ttl=64,frag=no), packets:9, bytes:8392, used:0.560s, actions:ct_clear,ct(nat),recirc(0x10)

Zone 0

ct_state(+est+trk),recirc_id(0x10),in_port(ens8f0),eth_type(0x0800),ipv4(dst=0.0.0.0/128.0.0.0,frag=no), packets:0, bytes:0, used:1.820s, actions:ct(zone=53,nat),recirc(0xe1)

Zone 53

ct_state(+est-rel+rpl-inv+trk),ct_label(0x2/0x3),recirc_id(0xe1),in_port(ens8f0),eth(src=00:8a:2e:2b:9d:d8,dst=0c:42:a1:08:0a:da),eth_type(0x0800),ipv4(src=172.30.0.0/255.255.0.0,dst=169.254.169.2,proto=6,ttl=64,frag=no), packets:8, bytes:8326, used:0.560s, actions:ct_clear,set(eth(src=0c:42:a1:08:0a:da,dst=52:54:00:a0:9d:83)),set(ipv4(ttl=63)),ct(zone=64001,nat),recirc(0xec)

Zone 64001

recirc_id(0xec),in_port(ens8f0),eth(src=0c:42:a1:08:0a:da,dst=52:54:00:a0:9d:83),eth_type(0x0800),ipv4(frag=no), packets:8, bytes:416, used:2.840s, actions:set(eth(src=52:54:00:a0:9d:83,dst=0c:42:a1:08:0a:da)),br-ex


CT comparison:

worker-2 - working

kernel: 4.18.0-322.el8.mr942_210708_1548.x86_64

Use local port 12345: curl --insecure --local-port 12345 https://172.30.0.1:443

ipv4     2 tcp      6 src=169.254.169.2 dst=192.168.111.22 sport=12345 dport=6443 src=192.168.111.22 dst=192.168.111.26 sport=6443 dport=12345 [OFFLOAD] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3
ipv4     2 tcp      6 src=169.254.169.2 dst=172.30.0.1 sport=12345 dport=443 src=192.168.111.22 dst=169.254.169.2 sport=6443 dport=12345 [OFFLOAD] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=45 use=3
ipv4     2 tcp      6 src=192.168.111.26 dst=192.168.111.22 sport=12345 dport=6443 src=192.168.111.22 dst=192.168.111.26 sport=6443 dport=12345 [OFFLOAD] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64000 use=3
ipv4     2 tcp      6 src=192.168.111.26 dst=172.30.0.1 sport=12345 dport=443 src=172.30.0.1 dst=169.254.169.2 sport=443 dport=12345 [OFFLOAD] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64001 use=3
ipv4     2 tcp      6 8 CLOSE src=192.168.111.26 dst=172.30.0.1 sport=12345 dport=443 src=172.30.0.1 dst=192.168.111.26 sport=443 dport=12345 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2


worker-3 - NOT working

kernel: 4.18.0-330.el8.mr1127_210810_2027.x86_64

Use local port 12345: curl --insecure --local-port 12345 https://172.30.0.1:443

ipv4     2 tcp      6 src=192.168.111.27 dst=172.30.0.1 sport=12345 dport=443 src=172.30.0.1 dst=169.254.169.2 sport=443 dport=12345 [HW_OFFLOAD] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64001 use=3
ipv4     2 tcp      6 431996 ESTABLISHED src=192.168.111.27 dst=172.30.0.1 sport=12345 dport=443 src=172.30.0.1 dst=192.168.111.27 sport=443 dport=12345 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2
ipv4     2 tcp      6 src=169.254.169.2 dst=192.168.111.20 sport=12345 dport=6443 src=192.168.111.20 dst=192.168.111.27 sport=6443 dport=12345 [HW_OFFLOAD] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3
ipv4     2 tcp      6 src=169.254.169.2 dst=172.30.0.1 sport=12345 dport=443 src=192.168.111.20 dst=169.254.169.2 sport=6443 dport=12345 [HW_OFFLOAD] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=53 use=3
ipv4     2 tcp      6 src=192.168.111.27 dst=192.168.111.20 sport=12345 dport=6443 src=192.168.111.20 dst=192.168.111.27 sport=6443 dport=12345 [HW_OFFLOAD] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=64000 use=3

Comment 46 zenghui.shi 2021-08-19 04:19:00 UTC

(In reply to Marcelo Ricardo Leitner from comment #43)
> I'm afraid we either need another test or more debug info. I can revert one
> of the patches to the previous version, the one on clearing CT info above,
> and see how it goes.
> 
> Or, we need dumps of datapath flows and so, like we had on comment #0, for
> the good and the bad kernel, so we can spot differences. This one is more
> promising.
> 
> Zenghui, please let me know which one you prefer.

Added comparisons for ovs datapath and CT between working node (worker-2) and non-working node (worker-3) in comment #45.

Full ovs datapath flows are attached as tarball in comment #45:

bz1961063/worker-3-flows-m.txt          -- worker-3 ovs datapath dump with -m
bz1961063/worker-3-flows-name.txt       -- worker-3 ovs datapath dump with --names
bz1961063/worker-2-flows-m.txt          -- worker-2 ovs datapath dump with -m
bz1961063/worker-2-flows-name.txt       -- worker-3 ovs datapath dump with --names

Comment 47 Marcelo Ricardo Leitner 2021-08-20 00:32:42 UTC

The only difference I could spot so far between working and non-working setups is that on the non-working one, the CT entries are HW_OFFLOAD while on the working setup, they were just OFFLOAD (which is not in HW).

This means that despite nearly all datapath flows being installed on dp:tc and (this one I didn't check one by one yet: ) offloaded:yes, they are not processed in HW because the ct action would be a miss on the very first ct action already.

Between both kernels, the differences in tc and netfilter are not big and quite known. The effect above reminded me of 
0cc254e5aa37 ("net/sched: act_ct: Offload connections with commit action")
but a) there are no commits in the reply direction and b) it's present in the working kernel (bz1965817).
Point being, on the changes in tc and netfilter, I can't tell one that could lead to such difference in the CT entries above.

One thing I had overlooked is that there is also a major driver rebase between these two kernels. Maybe something in it.
I'll rebase the non-working kernel on the same branch point as the working one, to minimize all changes to just these two commits, see how it goes and then go from there. So maybe we can say that the commits work, and something else broke it again, or maybe the new commits are simply not working well.


What I don't fully understand, and don't like, in BOTH setups, is a weird CT entry on zone 0. Both seem to have an extra CT entry on it.

Working:
ipv4     2 tcp      6 8 CLOSE src=192.168.111.26 dst=172.30.0.1     sport=12345 dport=443  src=172.30.0.1 dst=192.168.111.26 sport=443  dport=12345 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2

Non-working:
ipv4     2 tcp      6 431996 ESTABLISHED src=192.168.111.27 dst=172.30.0.1     sport=12345 dport=443  src=172.30.0.1 dst=192.168.111.27 sport=443  dport=12345 [ASSURED]    mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=2

it seems an entry created by netfilter itself. It would be best if OVN could refrain from using zone 0.

Comment 48 Marcelo Ricardo Leitner 2021-08-20 20:04:29 UTC

(In reply to Marcelo Ricardo Leitner from comment #47)
> One thing I had overlooked is that there is also a major driver rebase
> between these two kernels. Maybe something in it.
> I'll rebase the non-working kernel on the same branch point as the working
> one, to minimize all changes to just these two commits, see how it goes and
> then go from there. So maybe we can say that the commits work, and something
> else broke it again, or maybe the new commits are simply not working well.

There it goes:
https://s3.upshift.redhat.com/DH-PROD-CKI/internal/356198025/repo-x86_64.repo
https://s3.upshift.redhat.com/DH-PROD-CKI/internal/356198025/x86_64/4.18.0-333.el8.mr1196_210820_0050.x86_64

This one is based on the same branch point as the working one, and really the only difference is the 2 refreshed tc patches.

Comment 49 zenghui.shi 2021-08-23 04:31:24 UTC

(In reply to Marcelo Ricardo Leitner from comment #48)
> (In reply to Marcelo Ricardo Leitner from comment #47)
> > One thing I had overlooked is that there is also a major driver rebase
> > between these two kernels. Maybe something in it.
> > I'll rebase the non-working kernel on the same branch point as the working
> > one, to minimize all changes to just these two commits, see how it goes and
> > then go from there. So maybe we can say that the commits work, and something
> > else broke it again, or maybe the new commits are simply not working well.
> 
> There it goes:
> https://s3.upshift.redhat.com/DH-PROD-CKI/internal/356198025/repo-x86_64.repo
> https://s3.upshift.redhat.com/DH-PROD-CKI/internal/356198025/x86_64/4.18.0-
> 333.el8.mr1196_210820_0050.x86_64

This new kernel works.

By just replacing the kernel to 4.18.0-333.el8.mr1196_210820_0050.x86_64, the issue disappeared.
All other relevant components (ovn-k8s, ovn and ovs) stay unchanged in the testing.

> 
> This one is based on the same branch point as the working one, and really
> the only difference is the 2 refreshed tc patches.

Comment 50 Marcelo Ricardo Leitner 2021-08-23 19:55:52 UTC

That's good. Thanks Zenghui. So the 8.4.z's of those bugs should still be sound.
We still need to understand what broke it in 8.5 and fix it, though.

Comment 51 Marcelo Ricardo Leitner 2021-08-26 00:22:46 UTC

I reviewed the commits between the good and bad branching points again and the few non-driver related doesn't seem even related to me.
I know this flow is not supposed to be offloaded, but it really seems that the driver update affected it.
Alaa, WDYT?

One way of checking it is with:
git log --oneline gl8u/merge-requests/1196..gl8u/merge-requests/1127 -- net/ drivers/net/ethernet/mellanox/mlx5

with 'gl8u' being your git remote for the gitlab rhel8 repository, and:
[remote "gl8u"]
        url = git:redhat/rhel/src/kernel/rhel-8.git
        fetch = +refs/heads/*:refs/remotes/gl8u/*
        fetch = +refs/merge-requests/*/head:refs/remotes/gl8u/merge-requests/*

MR 1196: test kernel (with backported upstream patches) using the original branch point, works.
MR 1127: test kernel (with backported upstream patches) using a refreshed branch point, doesn't work.

Comment 53 Marcelo Ricardo Leitner 2021-10-14 19:12:11 UTC

Latest testing revealed that it is now working with latest 8.5 kernel and the 8.4.z MR kernel for bz1992230.
With that, as discussed with Zenghui and Ariel, next steps here are to just wait the z-stream above be merged into an official build, re-test, make sure it works, and close this bz as CURRENTRELEASE.
I'm taking the bz as a place holder till then.

Thanks folks!

Comment 54 Marcelo Ricardo Leitner 2021-10-21 11:51:28 UTC

Zenghui confirmed that with the official build for bz1992230 this usecase is working in 8.4.z.
As it is also working on 8.5.0, lets finally close this bz!
Thanks folks.

Side note that a new bug was found while trying this: https://bugzilla.redhat.com/show_bug.cgi?id=2014027
Some conntrack entries linger after the test.

Comment 55 zenghui.shi 2022-01-11 04:37:23 UTC

*** Bug 1908570 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.