Bug 1796157 - Networking not working on default namespace
Summary: Networking not working on default namespace
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Dan Winship
QA Contact: Weibin Liang
URL:
Whiteboard:
Depends On:
Blocks: 1771572 1800324
TreeView+ depends on / blocked
 
Reported: 2020-01-29 17:57 UTC by Federico Paolinelli
Modified: 2020-05-04 11:28 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1800324 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:27:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
flows on the RHCOS worker node (34.44 KB, text/plain)
2020-02-04 22:19 UTC, Neil Horman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift sdn pull 103 0 None closed Bug 1796157: Fix handling of VNID 0 with NetworkPolicy 2020-12-20 09:39:31 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:28:18 UTC

Description Federico Paolinelli 2020-01-29 17:57:22 UTC
Description of problem:
I have two pods in the default namespace, one exposing a port with SCTP protocol and the other one acting as a client. When the client attempts to connect, the connection hangs.
It works if I do the same in a custo namespace.


Version-Release number of selected component (if applicable):
oc version
Client Version: v4.2.0
Server Version: 4.4.0-0.nightly-2020-01-29-073040
Kubernetes Version: v1.17.1


How reproducible:
Always

Steps to Reproduce:
1.
Apply the machine configuration to unblacklist the sctp module to 
all the workers:

cat <<EOF | oc apply -f -

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: load-sctp-module
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
        - contents:
            source: data:,
            verification: {}
          filesystem: root
          mode: 420
          path: /etc/modprobe.d/sctp-blacklist.conf
        - contents:
            source: data:text/plain;charset=utf-8,sctp
          filesystem: root
          mode: 420
          path: /etc/modules-load.d/sctp-load.conf

EOF
-

Wait for the MCP to be ready:
oc wait mcp/worker --for condition=updated


2. Apply the following manifests to create the pods:

cat <<EOF | oc apply -f -

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sctpclient
  labels:
    app: sctpclient
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sctpclient
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: sctpclient
    spec:
      restartPolicy: Always
      containers:
        - image: quay.io/wcaban/net-toolbox:latest
          imagePullPolicy: IfNotPresent
          name: sctpclient
          command: ["sleep", "infinity"]

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sctpserver
  labels:
    app: sctpserver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sctpserver
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: sctpserver
    spec:
      restartPolicy: Always
      containers:
        - image: quay.io/wcaban/net-toolbox:latest
          imagePullPolicy: IfNotPresent
          name: sctpserver
          command: ["sleep", "infinity"]
          ports:
            - containerPort: 30102
              protocol: SCTP
EOF
-


3. Fetch the server pod's address with oc get pods -o wide

4. Connect to the server pod and start sctp_test:

bash-5.0$ sctp_test -H localhost -P 30102 -l

local:addr=::, port=30102, family=10
seed = 1580297852

Starting tests...
	socket(SOCK_SEQPACKET, IPPROTO_SCTP)  ->  sk=3
	bind(sk=3, [a:::,p:30102])  --  attempt 1/10
	listen(sk=3,backlog=100)

5. Connect to the client pod and launch sctp_test in client mode, using the server pod's address:

bash-5.0$ sctp_test -H localhost -P 30105 -h SERVER_POD_IP -p 30102 -s
remote:addr=10.129.0.52, port=30102, family=2
local:addr=::, port=30105, family=10
seed = 1580297881

Starting tests...
	socket(SOCK_SEQPACKET, IPPROTO_SCTP)  ->  sk=3
	bind(sk=3, [a:::,p:30105])  --  attempt 1/10



Actual results:

The server shows received packets and the client shows sent packets.

Expected results:

Nothing happens.


Additional info:

As mentioned before, if I do the same in a namespace:

oc create ns sctptest



cat <<EOF | oc apply -f -

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sctpclient
  namespace: sctptest
  labels:
    app: sctpclient
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sctpclient
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: sctpclient
    spec:
      restartPolicy: Always
      containers:
        - image: quay.io/wcaban/net-toolbox:latest
          imagePullPolicy: IfNotPresent
          name: sctpclient
          command: ["sleep", "infinity"]

---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sctpserver
  namespace: sctptest
  labels:
    app: sctpserver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sctpserver
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: sctpserver
    spec:
      restartPolicy: Always
      containers:
        - image: quay.io/wcaban/net-toolbox:latest
          imagePullPolicy: IfNotPresent
          name: sctpserver
          command: ["sleep", "infinity"]
          ports:
            - containerPort: 30102
              protocol: SCTP
EOF
-

It works. For example, on server side I see:

Starting tests...
	socket(SOCK_SEQPACKET, IPPROTO_SCTP)  ->  sk=3
	bind(sk=3, [a:::,p:30102])  --  attempt 1/10
	listen(sk=3,backlog=100)
Server: Receiving packets.
	recvmsg(sk=3) Notification: SCTP_ASSOC_CHANGE(COMMUNICATION_UP)
		(assoc_change: state=0, error=0, instr=10 outstr=10)
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657582 flags=0x1 ppid=221622501
cumtsn=3181657582
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657583 flags=0x1 ppid=1895765049
cumtsn=3181657583
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657584 flags=0x1 ppid=1703631155
cumtsn=3181657584
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657585 flags=0x1 ppid=1317151144
cumtsn=3181657585
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657586 flags=0x1 ppid=83569488
cumtsn=3181657586
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657587 flags=0x1 ppid=365621801
cumtsn=3181657587
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657588 flags=0x1 ppid=610560377
cumtsn=3181657588
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657589 flags=0x1 ppid=1188056854
cumtsn=3181657589
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657590 flags=0x1 ppid=163130494
cumtsn=3181657590
	recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0
	  SNDRCV(stream=0 ssn=0 tsn=3181657591 flags=0x1 ppid=1813451120
cumtsn=3181657591
...

Also, I am pretty sure this was demoed in 4.3 without any issue.

Comment 2 Dan Winship 2020-01-30 12:19:24 UTC
Do you have any NetworkPolicies configured in either namespace?

Comment 3 Federico Paolinelli 2020-01-30 12:23:05 UTC
The last attempt was on a brand new cluster created via cluster-bot, only executing the listed commands because I wanted to be sure.
I did not explicitly create any NetworkPolicy, but I can't assure they weren't created for me (at least, in the default ns).

Will spin up a cluster again to see if there are any.

Comment 4 Dan Winship 2020-01-30 12:35:36 UTC
there are no networkpolicies created by default

Comment 5 Eric Paris 2020-01-30 14:33:10 UTC
Is it a difference in scc's applied to pods/containers in the default vs non-default namespace?

Comment 6 Neil Horman 2020-02-02 21:06:00 UTC
Its my understanding that an scc restricts permissions of pods with selinux, but the description suggests that the applications execute, but hang, which to me suggests that they are allowed to execute the needed system calls, and just aren't sending packets back and forth, which in turn suggests an iptables rule issue to me.  Is it possible to run tcpdump on the pods to see if packets are egressing the client pod and ingressing to the server pod? That would at least tell us which side of the connection to focus on

Comment 7 Federico Paolinelli 2020-02-03 11:20:26 UTC
I need to reprovision the cluster. 
I am using cluster-bot to do that and currently it's not working, but it should be the same on any just-provisioned cluster where we apply the machine config to enable the sctp kernel module.

Will try to do it locally with a libvirt provisioned cluster and get back with the tcp dump results

Comment 8 Dan Winship 2020-02-03 13:07:20 UTC
(In reply to Neil Horman from comment #6)
> just aren't sending packets back and forth, which in turn suggests an
> iptables rule issue to me.

Based on the above Deployment, this is non-hostNetwork-pod to non-hostNetwork-pod traffic, so it would be entirely over OVS/VXLAN, which means it would never hit iptables. (And iptables isn't aware of pod namespaces anyway so it would be surprising if it could manage to screw up in this way...)

> but the description suggests that the applications execute, but hang

ah, yeah, but look where the hang is; the good one does:

	socket(SOCK_SEQPACKET, IPPROTO_SCTP)  ->  sk=3
	bind(sk=3, [a:::,p:30102])  --  attempt 1/10
	listen(sk=3,backlog=100)

but the bad one ends with the "bind" line and doesn't show a "listen". Assuming that's not just a cut+paste mistake, then it looks like either the listen() call is hanging, or else the code is doing some sort of retry on error, forever.


Ah!

> cat <<EOF | oc apply -f -
> 
> apiVersion: machineconfiguration.openshift.io/v1
> kind: MachineConfig
> metadata:
>   labels:
>     machineconfiguration.openshift.io/role: worker
>   name: load-sctp-module

That config only applies to worker nodes, but a deployment in Namespace "default" might be allowed to schedule pods on masters too... If that happened, the test would fail, because sctp would not have been loaded on the masters.

>         - contents:
>             source: data:,
>             verification: {}
>           filesystem: root
>           mode: 420
>           path: /etc/modprobe.d/sctp-blacklist.conf

(FWIW this should be unnecessary; blacklisting only stops it from being *autoloaded*. If you add the /etc/modules-load.d/sctp-load.conf file explicitly loading it then it doesn't matter if it's blacklisted as well.)

Comment 9 Federico Paolinelli 2020-02-03 13:17:38 UTC
(In reply to Dan Winship from comment #8)
> (In reply to Neil Horman from comment #6)
> > just aren't sending packets back and forth, which in turn suggests an
> > iptables rule issue to me.
> 
> Based on the above Deployment, this is non-hostNetwork-pod to
> non-hostNetwork-pod traffic, so it would be entirely over OVS/VXLAN, which
> means it would never hit iptables. (And iptables isn't aware of pod
> namespaces anyway so it would be surprising if it could manage to screw up
> in this way...)
> 
> > but the description suggests that the applications execute, but hang
> 
> ah, yeah, but look where the hang is; the good one does:
> 
> 	socket(SOCK_SEQPACKET, IPPROTO_SCTP)  ->  sk=3
> 	bind(sk=3, [a:::,p:30102])  --  attempt 1/10
> 	listen(sk=3,backlog=100)
> 
> but the bad one ends with the "bind" line and doesn't show a "listen".
> Assuming that's not just a cut+paste mistake, then it looks like either the
> listen() call is hanging, or else the code is doing some sort of retry on
> error, forever.
> 

Might have been a copy / paste error.
Just re-played the bad one for tcp dump's sake:

On server side I see:
[root@sctpserver-5ff87f99fc-w2nhb /]# sctp_test -H localhost -P 30102 -l
local:addr=::, port=30102, family=10
seed = 1580735765

Starting tests...
	socket(SOCK_SEQPACKET, IPPROTO_SCTP)  ->  sk=3
	bind(sk=3, [a:::,p:30102])  --  attempt 1/10
	listen(sk=3,backlog=100)
Server: Receiving packets.
	recvmsg(sk=3) 


On client side:
[root@sctpclient-645b994bf5-mmz88 /]# sctp_test -H localhost -P 30105 -h 10.128.2.6 -p 30102 -s
remote:addr=10.128.2.6, port=30102, family=2
local:addr=::, port=30105, family=10
seed = 1580735767

Starting tests...
	socket(SOCK_SEQPACKET, IPPROTO_SCTP)  ->  sk=3
	bind(sk=3, [a:::,p:30105])  --  attempt 1/10
Client: Sending packets.(1/10)
	sendmsg(sk=3, assoc=0)    1 bytes.
	  SNDRCV(stream=0 flags=0x1 ppid=1420752228



> 
> Ah!
> 
> > cat <<EOF | oc apply -f -
> > 
> > apiVersion: machineconfiguration.openshift.io/v1
> > kind: MachineConfig
> > metadata:
> >   labels:
> >     machineconfiguration.openshift.io/role: worker
> >   name: load-sctp-module
> 
> That config only applies to worker nodes, but a deployment in Namespace
> "default" might be allowed to schedule pods on masters too... If that
> happened, the test would fail, because sctp would not have been loaded on
> the masters.

Master should have taints to prevent that. Also, we would see a different type of failure.
In anycase, I just tried and

oc get pods -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP           NODE                                         NOMINATED NODE   READINESS GATES
sctpclient-645b994bf5-mmz88   1/1     Running   0          12m   10.128.2.5   ip-10-0-133-221.us-west-2.compute.internal   <none>           <none>
sctpserver-5ff87f99fc-w2nhb   1/1     Running   0          12m   10.128.2.6   ip-10-0-133-221.us-west-2.compute.internal   <none>           <none>


oc get nodes | grep ip-10-0-133-221.us-west-2.compute.internal
ip-10-0-133-221.us-west-2.compute.internal   Ready    worker   46m   v1.17.1




> 
> >         - contents:
> >             source: data:,
> >             verification: {}
> >           filesystem: root
> >           mode: 420
> >           path: /etc/modprobe.d/sctp-blacklist.conf
> 
> (FWIW this should be unnecessary; blacklisting only stops it from being
> *autoloaded*. If you add the /etc/modules-load.d/sctp-load.conf file
> explicitly loading it then it doesn't matter if it's blacklisted as well.)

Ah, good to know, thanks!

Comment 10 Federico Paolinelli 2020-02-03 13:19:43 UTC
TCPDUMP:

Nothing on server side:
[root@sctpserver-5ff87f99fc-w2nhb /]# tcpdump -i any sctp
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes



On client side:
[root@sctpclient-645b994bf5-mmz88 /]# tcpdump -i any sctp
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
13:18:32.696576 IP sctpclient-645b994bf5-mmz88.30105 > ip-10-128-2-6.us-west-2.compute.internal.30102: sctp (1) [INIT] [init tag: 3738353619] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 906454160] 
13:18:35.707623 IP sctpclient-645b994bf5-mmz88.30105 > ip-10-128-2-6.us-west-2.compute.internal.30102: sctp (1) [INIT] [init tag: 3738353619] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 906454160] 
13:18:42.107647 IP sctpclient-645b994bf5-mmz88.30105 > ip-10-128-2-6.us-west-2.compute.internal.30102: sctp (1) [INIT] [init tag: 3738353619] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 906454160]

Comment 11 Neil Horman 2020-02-03 16:39:24 UTC
The output in comment #10 would seem to indicate that the client is egressing frames, but they never arrive at the server.  That in turn would I think indicate that:

1) This has nothing to do with module loading (or we would have seen EPROTONOTSUPP error calls from the socket syscall)

2) This is likely about frames getting dropped in OVS between the client and host.  

I suggest tracing the packets through OVS to see how they are getting (not) forwarded

Comment 12 Federico Paolinelli 2020-02-03 16:46:37 UTC
> 2) This is likely about frames getting dropped in OVS between the client and
> host.  
> 
> I suggest tracing the packets through OVS to see how they are getting (not)
> forwarded

Can somebody from SDN take care of this?
As I wrote before, it's always reproducible and I provided the exact steps to reproduce this.

Comment 13 Ben Bennett 2020-02-04 14:07:57 UTC
Assigning this back to Federico until he can get Neal a cluster.

Comment 14 Federico Paolinelli 2020-02-04 14:21:42 UTC
Provided access to a lab machine I was using, assigning to Neal as per https://bugzilla.redhat.com/show_bug.cgi?id=1796157#c13

Comment 15 Dan Winship 2020-02-04 15:47:28 UTC
Hm... I just tested this in a 4.3 cluster and it works fine...

Comment 16 Neil Horman 2020-02-04 16:05:26 UTC
Thank you Federico for setting this up.

Notes that I can provide immediately:

1) The setup contains two RHCOS workers (available at 192.168.126.51 and .52
2) The Pods are currently both running on the .51 worker
3) Both pods are connected to the host via veth pairs (the server pod is connected on the host side via vethda6953f3, the client on the host side vai veth49a4dd9c)
4) using a toolbox container, I was able to confirm that sctp traffic from the client ingressed through veth49a4dd9c, but never arrived at the server
5) As a test, since the traffic was getting lost, I simply attempted to ping from the server pod (10.130.0.4) to the client pod (10.130.0.3), and that traffic was also lost, meaning this setup isn't just broken for sctp traffic.
6) The two veth interfaces are connected together via an OVS bridge, which I have confirmed contains both host side veth ports in its configuration

At this point I am fairly convinced this is a misconfiguration of OVS.  I'm attempting to trace the packets through the OVS bridge now to determine where they are getting lost

Comment 17 Neil Horman 2020-02-04 16:41:24 UTC
So, i am by no means a OVS expert, but I attempted to trace an icmp packet through the ovs bridge with the following command on the worker:
sudo ovs-appctl ofproto/trace br0 in_port=6,icmp,nw_src=10.130.0.4

and received this output:
Flow: icmp,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.4,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0

bridge("br0")
-------------
 0. ct_state=-trk,ip, priority 300
    ct(table=0)
    drop
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.

Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-trk,eth,ip,in_port=6,nw_frag=no
Datapath actions: ct,recirc(0x403)

===============================================================================
recirc(0x403) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================

Flow: recirc_id=0x403,ct_state=new|trk,eth,icmp,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.4,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0

bridge("br0")
-------------
    thaw
        Resuming from table 0
 0. ip, priority 100
    goto_table:20
20. priority 0
    drop

Final flow: unchanged
Megaflow: recirc_id=0x403,ct_state=+trk,eth,ip,in_port=6,nw_frag=no
Datapath actions: drop


Dumping the flows for the ovs bridge for table 20, I see this (annotated with line numbers):
1) table_id=20, duration=11703s, n_packets=1150, n_bytes=48300, priority=100,arp,in_port=3,arp_spa=10.130.0.2,arp_sha=00:00:0a:82:00:02/00:00:ff:ff:ff:ff,actions=load:0x4df556->NXM_NX_REG0[],goto_table:21
2) table_id=20, duration=9077s, n_packets=26, n_bytes=1092, priority=100,arp,in_port=4,arp_spa=10.130.0.3,arp_sha=00:00:0a:82:00:03/00:00:ff:ff:ff:ff,actions=load:0->NXM_NX_REG0[],goto_table:21
3) table_id=20, duration=9077s, n_packets=26, n_bytes=1092, priority=100,arp,in_port=5,arp_spa=10.130.0.4,arp_sha=00:00:0a:82:00:04/00:00:ff:ff:ff:ff,actions=load:0->NXM_NX_REG0[],goto_table:21
4) table_id=20, duration=11703s, n_packets=19079, n_bytes=2167819, priority=100,ip,in_port=3,nw_src=10.130.0.2,actions=load:0x4df556->NXM_NX_REG0[],goto_table:21
5) table_id=20, duration=9077s, n_packets=46, n_bytes=3772, priority=100,ip,in_port=4,nw_src=10.130.0.3,actions=load:0->NXM_NX_REG0[],goto_table:21
6) table_id=20, duration=9077s, n_packets=226, n_bytes=22148, priority=100,ip,in_port=5,nw_src=10.130.0.4,actions=load:0->NXM_NX_REG0[],goto_table:21
7) table_id=20, duration=11712s, n_packets=0, n_bytes=0, priority=0,actions=drop

Looking at lines 2 and 3, I think we match on those properly, because the arp table for each pod contains the ip and mac for the other pod.  However, when the icmp message is sent, we would (I expect) want to match on lines 4 and 5 to forward the packet by jumping to table 21, which continues the forwarding process (I think).  However, for some reason, despite the icmp packet having an ip header, we don't seem to match on it, and instead fall through to the end rule at line 7 and drop the packet (which I think is what the trace command above is indicating)

The same happens if I trace a sctp packet:
[core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-appctl ofproto/trace br0 in_port=6,sctp
Flow: sctp,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=0.0.0.0,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0

bridge("br0")
-------------
 0. ct_state=-trk,ip, priority 300
    ct(table=0)
    drop
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.

Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-trk,eth,ip,in_port=6,nw_frag=no
Datapath actions: ct,recirc(0x459)

===============================================================================
recirc(0x459) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================

Flow: recirc_id=0x459,ct_state=new|trk,eth,sctp,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=0.0.0.0,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0

bridge("br0")
-------------
    thaw
        Resuming from table 0
 0. ip, priority 100
    goto_table:20
20. priority 0
    drop

Final flow: unchanged
Megaflow: recirc_id=0x459,ct_state=+trk,eth,ip,in_port=6,nw_frag=no
Datapath actions: drop


I think at this point we need to have the openvswitch people take a look at this to confirm these findings and suggests next steps to understand what the OVS db should look like here and how to correct it.  CC-ing treadelli.

Timothy, could you please take a look at this ASAP, and suggest next steps?

Comment 18 Dan Winship 2020-02-04 16:51:34 UTC
(In reply to Neil Horman from comment #17)
> So, i am by no means a OVS expert, but I attempted to trace an icmp packet
> through the ovs bridge with the following command on the worker:
> sudo ovs-appctl ofproto/trace br0 in_port=6,icmp,nw_src=10.130.0.4

> 6) table_id=20, duration=9077s, n_packets=226, n_bytes=22148,
> priority=100,ip,in_port=5,nw_src=10.130.0.4,actions=load:0->NXM_NX_REG0[],
> goto_table:21

You're passing the wrong in_port value (6 rather than 5), so the rule doesn't match.

> The same happens if I trace a sctp packet:
> [core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-appctl ofproto/trace br0
> in_port=6,sctp

In this case you're not specifying the nw_src at all so it matches even less.

openshift-sdn's OVS flows are very restrictive, to ensure that pods aren't able to spoof traffic.

Comment 19 Neil Horman 2020-02-04 18:22:06 UTC
Ok, that moves us forward, thank you.  Changing the command to:
sudo ovs-appctl ofproto/trace br0 in_port=5,icmp,nw_src=10.130.0.4

Give us this output:
Flow: icmp,in_port=5,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.4,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0

bridge("br0")
-------------
 0. ct_state=-trk,ip, priority 300
    ct(table=0)
    drop
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.

Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-trk,eth,ip,in_port=5,nw_frag=no
Datapath actions: ct,recirc(0x61c)

===============================================================================
recirc(0x61c) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================

Flow: recirc_id=0x61c,ct_state=new|trk,eth,icmp,in_port=5,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.4,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0

bridge("br0")
-------------
    thaw
        Resuming from table 0
 0. ip, priority 100
    goto_table:20
20. ip,in_port=5,nw_src=10.130.0.4, priority 100
    load:0->NXM_NX_REG0[]
    goto_table:21
21. priority 0
    goto_table:30
30. ip, priority 0
    goto_table:100
100. priority 0
    goto_table:101
101. priority 0
    output:2

Final flow: unchanged
Megaflow: recirc_id=0x61c,ct_state=-rpl+trk,eth,icmp,in_port=5,nw_src=10.130.0.4,nw_dst=0.0.0.0/5,nw_frag=no
Datapath actions: 3


I think what thats saying is that we move through the tables in ovs and eventually, based on table 101, which reads:
table_id=101, duration=18105s, n_packets=3008, n_bytes=288451, priority=0,actions=output:2

port 2 in the ovs interface list is:
core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-vsctl -- --columns=name,ofport list Interface
name                : "vxlan0"
ofport              : 1

name                : "br0"
ofport              : 65534

name                : "vethda6953f3"
ofport              : 5

name                : "veth5c033bab"
ofport              : 3

name                : "tun0"
ofport              : 2

name                : "veth49a4dd9c"
ofport              : 4


So I would suppose that we're injecting the frame to tun0, which doesn't make sense to me, as its veth49a4dd9c that is the host side of the veth pair that leads to the client pod.  More stangely (note this is  using icmp traffic), I can, as previously noted, see traffic ingressing via vethda6953f3, which is the server pod that is sending the icmp echos, but I can't see the frame egressing via tun0, or any of the other interfaces in the ovs bridge


More specific to this problem, if I trace an sctp packet from the client pod (10.130.0.3, ingressing via port 5), I still see the trace end in a drop:

[core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-appctl ofproto/trace br0 in_port=5,ip,nw_src=10.130.0.3
[core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-appctl ofproto/trace br0 in_port=5,ip,nw_src=10.130.0.3
Flow: ip,in_port=5,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.3,nw_dst=0.0.0.0,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0

bridge("br0")
-------------
 0. ct_state=-trk,ip, priority 300
    ct(table=0)
    drop
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.

Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-trk,eth,ip,in_port=5,nw_frag=no
Datapath actions: ct,recirc(0x61c)

===============================================================================
recirc(0x61c) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================

Flow: recirc_id=0x61c,ct_state=new|trk,eth,ip,in_port=5,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.3,nw_dst=0.0.0.0,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0

bridge("br0")
-------------
    thaw
        Resuming from table 0
 0. ip, priority 100
    goto_table:20
20. priority 0
    drop

Final flow: unchanged
Megaflow: recirc_id=0x61c,ct_state=+trk,eth,ip,in_port=5,nw_src=10.130.0.3,nw_frag=no
Datapath actions: drop

So it seems, one way or the other, ovs is dropping these frames when it really shouldn't be

Comment 20 Dan Winship 2020-02-04 20:34:51 UTC
> So I would suppose that we're injecting the frame to tun0, which doesn't make
> sense to me, as its veth49a4dd9c that is the host side of the veth pair that
> leads to the client pod.

You didn't tell it to trace sending a packet to the client pod though; you provided no nw_dst on the trace command line, so it defaulted nw_dst=0.0.0.0, and that's not a pod IP, so the OVS flows say it should be delivered to the host networking stack via tun0 so that the host can route it.

Likewise, with the SCTP packet, you were still using a mismatched in_port/nw_src; there is no rule that matches packets with a in_port of 5 and a nw_src of 10.130.0.3, so the packet gets dropped.

You need to use the correct in_port, and specify both nw_src and nw_dst.

Or alternatively, instead of trying to fake packets, you could try "ovs-ofctl -O OpenFlow13 dump-flows br0", then running the SCTP test, then dumping the flows again and seeing which flows have had their n_packets counters increased (keeping in mind that OVS may also have processed some unrelated traffic in the same span, so not every change will be relevant).

But also, if you're going to debug flows, can you just attach the output of that dump-flows command here? And just for completeness, can you confirm that "oc get networkpolicies -n default -o yaml" returns nothing?

Comment 21 Federico Paolinelli 2020-02-04 20:52:50 UTC
I can reply just for the last part: no networkpolicies, in any namespace:

# oc get networkpolicies -n default 
No resources found in default namespace.
# oc get networkpolicies -A
No resources found

Comment 22 Neil Horman 2020-02-04 22:19:12 UTC
The bottom line here is, no matter whats wrong with my trace commands, the fact remains that we can see packets ingressing the ovs bridge on the server pods interface, and they never egress to the client post host-side interface, so the packets are getting lost somewhere in the ovs bridge.  Frederico, could you please set dan up with his public key on this cluster so that he can take a look at this directly?  For completeness I'll post the flow dump here, and run the tests requested in the morning.

Comment 23 Neil Horman 2020-02-04 22:19:44 UTC
Created attachment 1657683 [details]
flows on the RHCOS worker node

Comment 24 Federico Paolinelli 2020-02-05 08:57:16 UTC
(In reply to Neil Horman from comment #22)

> Frederico, could you please set dan up with his public key on this cluster
> so that he can take a look at this directly?  

Will do

Comment 25 Weibin Liang 2020-02-05 20:09:12 UTC
I Reproduced the same issue in 4.4.0-0.nightly-2020-02-04-101225.

Comment 26 Dan Winship 2020-02-05 20:14:06 UTC
yeah, it's not just SCTP. All network traffic in the "default" namespace is broken in master. (CI didn't catch this because none of the e2e tests use "default".)

Weibin, can you make sure that there is some QE test that would have eventually caught this? I know we do tests involving the default namespace under ovs-multitenant, but I'm not 100% sure we test it under ovs-networkpolicy.

Comment 27 Weibin Liang 2020-02-05 21:00:00 UTC
Dan,

Just did simple testing, simple curl also not work between the pods which are deployed in "default" namespace.
I will write a simple QE automation script to cover this test scenario.

Comment 29 Weibin Liang 2020-02-06 19:22:31 UTC
Verified in 4.4.0-0.nightly-2020-02-06-131745

Both HTTP and SCTP traffics work fine in default namespace

Comment 30 Neil Horman 2020-04-27 11:55:58 UTC
Dan winship fixed this, I expect he would be the appropriate person to document it

Comment 31 Dan Winship 2020-04-27 13:00:02 UTC
The bug never existed in any released version

Comment 33 errata-xmlrpc 2020-05-04 11:27:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.