Description of problem: I have two pods in the default namespace, one exposing a port with SCTP protocol and the other one acting as a client. When the client attempts to connect, the connection hangs. It works if I do the same in a custo namespace. Version-Release number of selected component (if applicable): oc version Client Version: v4.2.0 Server Version: 4.4.0-0.nightly-2020-01-29-073040 Kubernetes Version: v1.17.1 How reproducible: Always Steps to Reproduce: 1. Apply the machine configuration to unblacklist the sctp module to all the workers: cat <<EOF | oc apply -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: load-sctp-module spec: config: ignition: version: 2.2.0 storage: files: - contents: source: data:, verification: {} filesystem: root mode: 420 path: /etc/modprobe.d/sctp-blacklist.conf - contents: source: data:text/plain;charset=utf-8,sctp filesystem: root mode: 420 path: /etc/modules-load.d/sctp-load.conf EOF - Wait for the MCP to be ready: oc wait mcp/worker --for condition=updated 2. Apply the following manifests to create the pods: cat <<EOF | oc apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: sctpclient labels: app: sctpclient spec: replicas: 1 selector: matchLabels: app: sctpclient strategy: type: Recreate template: metadata: labels: app: sctpclient spec: restartPolicy: Always containers: - image: quay.io/wcaban/net-toolbox:latest imagePullPolicy: IfNotPresent name: sctpclient command: ["sleep", "infinity"] --- apiVersion: apps/v1 kind: Deployment metadata: name: sctpserver labels: app: sctpserver spec: replicas: 1 selector: matchLabels: app: sctpserver strategy: type: Recreate template: metadata: labels: app: sctpserver spec: restartPolicy: Always containers: - image: quay.io/wcaban/net-toolbox:latest imagePullPolicy: IfNotPresent name: sctpserver command: ["sleep", "infinity"] ports: - containerPort: 30102 protocol: SCTP EOF - 3. Fetch the server pod's address with oc get pods -o wide 4. Connect to the server pod and start sctp_test: bash-5.0$ sctp_test -H localhost -P 30102 -l local:addr=::, port=30102, family=10 seed = 1580297852 Starting tests... socket(SOCK_SEQPACKET, IPPROTO_SCTP) -> sk=3 bind(sk=3, [a:::,p:30102]) -- attempt 1/10 listen(sk=3,backlog=100) 5. Connect to the client pod and launch sctp_test in client mode, using the server pod's address: bash-5.0$ sctp_test -H localhost -P 30105 -h SERVER_POD_IP -p 30102 -s remote:addr=10.129.0.52, port=30102, family=2 local:addr=::, port=30105, family=10 seed = 1580297881 Starting tests... socket(SOCK_SEQPACKET, IPPROTO_SCTP) -> sk=3 bind(sk=3, [a:::,p:30105]) -- attempt 1/10 Actual results: The server shows received packets and the client shows sent packets. Expected results: Nothing happens. Additional info: As mentioned before, if I do the same in a namespace: oc create ns sctptest cat <<EOF | oc apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: sctpclient namespace: sctptest labels: app: sctpclient spec: replicas: 1 selector: matchLabels: app: sctpclient strategy: type: Recreate template: metadata: labels: app: sctpclient spec: restartPolicy: Always containers: - image: quay.io/wcaban/net-toolbox:latest imagePullPolicy: IfNotPresent name: sctpclient command: ["sleep", "infinity"] --- apiVersion: apps/v1 kind: Deployment metadata: name: sctpserver namespace: sctptest labels: app: sctpserver spec: replicas: 1 selector: matchLabels: app: sctpserver strategy: type: Recreate template: metadata: labels: app: sctpserver spec: restartPolicy: Always containers: - image: quay.io/wcaban/net-toolbox:latest imagePullPolicy: IfNotPresent name: sctpserver command: ["sleep", "infinity"] ports: - containerPort: 30102 protocol: SCTP EOF - It works. For example, on server side I see: Starting tests... socket(SOCK_SEQPACKET, IPPROTO_SCTP) -> sk=3 bind(sk=3, [a:::,p:30102]) -- attempt 1/10 listen(sk=3,backlog=100) Server: Receiving packets. recvmsg(sk=3) Notification: SCTP_ASSOC_CHANGE(COMMUNICATION_UP) (assoc_change: state=0, error=0, instr=10 outstr=10) recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657582 flags=0x1 ppid=221622501 cumtsn=3181657582 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657583 flags=0x1 ppid=1895765049 cumtsn=3181657583 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657584 flags=0x1 ppid=1703631155 cumtsn=3181657584 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657585 flags=0x1 ppid=1317151144 cumtsn=3181657585 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657586 flags=0x1 ppid=83569488 cumtsn=3181657586 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657587 flags=0x1 ppid=365621801 cumtsn=3181657587 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657588 flags=0x1 ppid=610560377 cumtsn=3181657588 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657589 flags=0x1 ppid=1188056854 cumtsn=3181657589 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657590 flags=0x1 ppid=163130494 cumtsn=3181657590 recvmsg(sk=3) Data 1 bytes. First 1 bytes: <empty> text[0]=0 SNDRCV(stream=0 ssn=0 tsn=3181657591 flags=0x1 ppid=1813451120 cumtsn=3181657591 ... Also, I am pretty sure this was demoed in 4.3 without any issue.
Do you have any NetworkPolicies configured in either namespace?
The last attempt was on a brand new cluster created via cluster-bot, only executing the listed commands because I wanted to be sure. I did not explicitly create any NetworkPolicy, but I can't assure they weren't created for me (at least, in the default ns). Will spin up a cluster again to see if there are any.
there are no networkpolicies created by default
Is it a difference in scc's applied to pods/containers in the default vs non-default namespace?
Its my understanding that an scc restricts permissions of pods with selinux, but the description suggests that the applications execute, but hang, which to me suggests that they are allowed to execute the needed system calls, and just aren't sending packets back and forth, which in turn suggests an iptables rule issue to me. Is it possible to run tcpdump on the pods to see if packets are egressing the client pod and ingressing to the server pod? That would at least tell us which side of the connection to focus on
I need to reprovision the cluster. I am using cluster-bot to do that and currently it's not working, but it should be the same on any just-provisioned cluster where we apply the machine config to enable the sctp kernel module. Will try to do it locally with a libvirt provisioned cluster and get back with the tcp dump results
(In reply to Neil Horman from comment #6) > just aren't sending packets back and forth, which in turn suggests an > iptables rule issue to me. Based on the above Deployment, this is non-hostNetwork-pod to non-hostNetwork-pod traffic, so it would be entirely over OVS/VXLAN, which means it would never hit iptables. (And iptables isn't aware of pod namespaces anyway so it would be surprising if it could manage to screw up in this way...) > but the description suggests that the applications execute, but hang ah, yeah, but look where the hang is; the good one does: socket(SOCK_SEQPACKET, IPPROTO_SCTP) -> sk=3 bind(sk=3, [a:::,p:30102]) -- attempt 1/10 listen(sk=3,backlog=100) but the bad one ends with the "bind" line and doesn't show a "listen". Assuming that's not just a cut+paste mistake, then it looks like either the listen() call is hanging, or else the code is doing some sort of retry on error, forever. Ah! > cat <<EOF | oc apply -f - > > apiVersion: machineconfiguration.openshift.io/v1 > kind: MachineConfig > metadata: > labels: > machineconfiguration.openshift.io/role: worker > name: load-sctp-module That config only applies to worker nodes, but a deployment in Namespace "default" might be allowed to schedule pods on masters too... If that happened, the test would fail, because sctp would not have been loaded on the masters. > - contents: > source: data:, > verification: {} > filesystem: root > mode: 420 > path: /etc/modprobe.d/sctp-blacklist.conf (FWIW this should be unnecessary; blacklisting only stops it from being *autoloaded*. If you add the /etc/modules-load.d/sctp-load.conf file explicitly loading it then it doesn't matter if it's blacklisted as well.)
(In reply to Dan Winship from comment #8) > (In reply to Neil Horman from comment #6) > > just aren't sending packets back and forth, which in turn suggests an > > iptables rule issue to me. > > Based on the above Deployment, this is non-hostNetwork-pod to > non-hostNetwork-pod traffic, so it would be entirely over OVS/VXLAN, which > means it would never hit iptables. (And iptables isn't aware of pod > namespaces anyway so it would be surprising if it could manage to screw up > in this way...) > > > but the description suggests that the applications execute, but hang > > ah, yeah, but look where the hang is; the good one does: > > socket(SOCK_SEQPACKET, IPPROTO_SCTP) -> sk=3 > bind(sk=3, [a:::,p:30102]) -- attempt 1/10 > listen(sk=3,backlog=100) > > but the bad one ends with the "bind" line and doesn't show a "listen". > Assuming that's not just a cut+paste mistake, then it looks like either the > listen() call is hanging, or else the code is doing some sort of retry on > error, forever. > Might have been a copy / paste error. Just re-played the bad one for tcp dump's sake: On server side I see: [root@sctpserver-5ff87f99fc-w2nhb /]# sctp_test -H localhost -P 30102 -l local:addr=::, port=30102, family=10 seed = 1580735765 Starting tests... socket(SOCK_SEQPACKET, IPPROTO_SCTP) -> sk=3 bind(sk=3, [a:::,p:30102]) -- attempt 1/10 listen(sk=3,backlog=100) Server: Receiving packets. recvmsg(sk=3) On client side: [root@sctpclient-645b994bf5-mmz88 /]# sctp_test -H localhost -P 30105 -h 10.128.2.6 -p 30102 -s remote:addr=10.128.2.6, port=30102, family=2 local:addr=::, port=30105, family=10 seed = 1580735767 Starting tests... socket(SOCK_SEQPACKET, IPPROTO_SCTP) -> sk=3 bind(sk=3, [a:::,p:30105]) -- attempt 1/10 Client: Sending packets.(1/10) sendmsg(sk=3, assoc=0) 1 bytes. SNDRCV(stream=0 flags=0x1 ppid=1420752228 > > Ah! > > > cat <<EOF | oc apply -f - > > > > apiVersion: machineconfiguration.openshift.io/v1 > > kind: MachineConfig > > metadata: > > labels: > > machineconfiguration.openshift.io/role: worker > > name: load-sctp-module > > That config only applies to worker nodes, but a deployment in Namespace > "default" might be allowed to schedule pods on masters too... If that > happened, the test would fail, because sctp would not have been loaded on > the masters. Master should have taints to prevent that. Also, we would see a different type of failure. In anycase, I just tried and oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sctpclient-645b994bf5-mmz88 1/1 Running 0 12m 10.128.2.5 ip-10-0-133-221.us-west-2.compute.internal <none> <none> sctpserver-5ff87f99fc-w2nhb 1/1 Running 0 12m 10.128.2.6 ip-10-0-133-221.us-west-2.compute.internal <none> <none> oc get nodes | grep ip-10-0-133-221.us-west-2.compute.internal ip-10-0-133-221.us-west-2.compute.internal Ready worker 46m v1.17.1 > > > - contents: > > source: data:, > > verification: {} > > filesystem: root > > mode: 420 > > path: /etc/modprobe.d/sctp-blacklist.conf > > (FWIW this should be unnecessary; blacklisting only stops it from being > *autoloaded*. If you add the /etc/modules-load.d/sctp-load.conf file > explicitly loading it then it doesn't matter if it's blacklisted as well.) Ah, good to know, thanks!
TCPDUMP: Nothing on server side: [root@sctpserver-5ff87f99fc-w2nhb /]# tcpdump -i any sctp dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes On client side: [root@sctpclient-645b994bf5-mmz88 /]# tcpdump -i any sctp dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 13:18:32.696576 IP sctpclient-645b994bf5-mmz88.30105 > ip-10-128-2-6.us-west-2.compute.internal.30102: sctp (1) [INIT] [init tag: 3738353619] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 906454160] 13:18:35.707623 IP sctpclient-645b994bf5-mmz88.30105 > ip-10-128-2-6.us-west-2.compute.internal.30102: sctp (1) [INIT] [init tag: 3738353619] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 906454160] 13:18:42.107647 IP sctpclient-645b994bf5-mmz88.30105 > ip-10-128-2-6.us-west-2.compute.internal.30102: sctp (1) [INIT] [init tag: 3738353619] [rwnd: 106496] [OS: 10] [MIS: 65535] [init TSN: 906454160]
The output in comment #10 would seem to indicate that the client is egressing frames, but they never arrive at the server. That in turn would I think indicate that: 1) This has nothing to do with module loading (or we would have seen EPROTONOTSUPP error calls from the socket syscall) 2) This is likely about frames getting dropped in OVS between the client and host. I suggest tracing the packets through OVS to see how they are getting (not) forwarded
> 2) This is likely about frames getting dropped in OVS between the client and > host. > > I suggest tracing the packets through OVS to see how they are getting (not) > forwarded Can somebody from SDN take care of this? As I wrote before, it's always reproducible and I provided the exact steps to reproduce this.
Assigning this back to Federico until he can get Neal a cluster.
Provided access to a lab machine I was using, assigning to Neal as per https://bugzilla.redhat.com/show_bug.cgi?id=1796157#c13
Hm... I just tested this in a 4.3 cluster and it works fine...
Thank you Federico for setting this up. Notes that I can provide immediately: 1) The setup contains two RHCOS workers (available at 192.168.126.51 and .52 2) The Pods are currently both running on the .51 worker 3) Both pods are connected to the host via veth pairs (the server pod is connected on the host side via vethda6953f3, the client on the host side vai veth49a4dd9c) 4) using a toolbox container, I was able to confirm that sctp traffic from the client ingressed through veth49a4dd9c, but never arrived at the server 5) As a test, since the traffic was getting lost, I simply attempted to ping from the server pod (10.130.0.4) to the client pod (10.130.0.3), and that traffic was also lost, meaning this setup isn't just broken for sctp traffic. 6) The two veth interfaces are connected together via an OVS bridge, which I have confirmed contains both host side veth ports in its configuration At this point I am fairly convinced this is a misconfiguration of OVS. I'm attempting to trace the packets through the OVS bridge now to determine where they are getting lost
So, i am by no means a OVS expert, but I attempted to trace an icmp packet through the ovs bridge with the following command on the worker: sudo ovs-appctl ofproto/trace br0 in_port=6,icmp,nw_src=10.130.0.4 and received this output: Flow: icmp,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.4,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0 bridge("br0") ------------- 0. ct_state=-trk,ip, priority 300 ct(table=0) drop -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0. -> Sets the packet to an untracked state, and clears all the conntrack fields. Final flow: unchanged Megaflow: recirc_id=0,ct_state=-trk,eth,ip,in_port=6,nw_frag=no Datapath actions: ct,recirc(0x403) =============================================================================== recirc(0x403) - resume conntrack with default ct_state=trk|new (use --ct-next to customize) =============================================================================== Flow: recirc_id=0x403,ct_state=new|trk,eth,icmp,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.4,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0 bridge("br0") ------------- thaw Resuming from table 0 0. ip, priority 100 goto_table:20 20. priority 0 drop Final flow: unchanged Megaflow: recirc_id=0x403,ct_state=+trk,eth,ip,in_port=6,nw_frag=no Datapath actions: drop Dumping the flows for the ovs bridge for table 20, I see this (annotated with line numbers): 1) table_id=20, duration=11703s, n_packets=1150, n_bytes=48300, priority=100,arp,in_port=3,arp_spa=10.130.0.2,arp_sha=00:00:0a:82:00:02/00:00:ff:ff:ff:ff,actions=load:0x4df556->NXM_NX_REG0[],goto_table:21 2) table_id=20, duration=9077s, n_packets=26, n_bytes=1092, priority=100,arp,in_port=4,arp_spa=10.130.0.3,arp_sha=00:00:0a:82:00:03/00:00:ff:ff:ff:ff,actions=load:0->NXM_NX_REG0[],goto_table:21 3) table_id=20, duration=9077s, n_packets=26, n_bytes=1092, priority=100,arp,in_port=5,arp_spa=10.130.0.4,arp_sha=00:00:0a:82:00:04/00:00:ff:ff:ff:ff,actions=load:0->NXM_NX_REG0[],goto_table:21 4) table_id=20, duration=11703s, n_packets=19079, n_bytes=2167819, priority=100,ip,in_port=3,nw_src=10.130.0.2,actions=load:0x4df556->NXM_NX_REG0[],goto_table:21 5) table_id=20, duration=9077s, n_packets=46, n_bytes=3772, priority=100,ip,in_port=4,nw_src=10.130.0.3,actions=load:0->NXM_NX_REG0[],goto_table:21 6) table_id=20, duration=9077s, n_packets=226, n_bytes=22148, priority=100,ip,in_port=5,nw_src=10.130.0.4,actions=load:0->NXM_NX_REG0[],goto_table:21 7) table_id=20, duration=11712s, n_packets=0, n_bytes=0, priority=0,actions=drop Looking at lines 2 and 3, I think we match on those properly, because the arp table for each pod contains the ip and mac for the other pod. However, when the icmp message is sent, we would (I expect) want to match on lines 4 and 5 to forward the packet by jumping to table 21, which continues the forwarding process (I think). However, for some reason, despite the icmp packet having an ip header, we don't seem to match on it, and instead fall through to the end rule at line 7 and drop the packet (which I think is what the trace command above is indicating) The same happens if I trace a sctp packet: [core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-appctl ofproto/trace br0 in_port=6,sctp Flow: sctp,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=0.0.0.0,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0 bridge("br0") ------------- 0. ct_state=-trk,ip, priority 300 ct(table=0) drop -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0. -> Sets the packet to an untracked state, and clears all the conntrack fields. Final flow: unchanged Megaflow: recirc_id=0,ct_state=-trk,eth,ip,in_port=6,nw_frag=no Datapath actions: ct,recirc(0x459) =============================================================================== recirc(0x459) - resume conntrack with default ct_state=trk|new (use --ct-next to customize) =============================================================================== Flow: recirc_id=0x459,ct_state=new|trk,eth,sctp,in_port=6,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=0.0.0.0,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=0 bridge("br0") ------------- thaw Resuming from table 0 0. ip, priority 100 goto_table:20 20. priority 0 drop Final flow: unchanged Megaflow: recirc_id=0x459,ct_state=+trk,eth,ip,in_port=6,nw_frag=no Datapath actions: drop I think at this point we need to have the openvswitch people take a look at this to confirm these findings and suggests next steps to understand what the OVS db should look like here and how to correct it. CC-ing treadelli. Timothy, could you please take a look at this ASAP, and suggest next steps?
(In reply to Neil Horman from comment #17) > So, i am by no means a OVS expert, but I attempted to trace an icmp packet > through the ovs bridge with the following command on the worker: > sudo ovs-appctl ofproto/trace br0 in_port=6,icmp,nw_src=10.130.0.4 > 6) table_id=20, duration=9077s, n_packets=226, n_bytes=22148, > priority=100,ip,in_port=5,nw_src=10.130.0.4,actions=load:0->NXM_NX_REG0[], > goto_table:21 You're passing the wrong in_port value (6 rather than 5), so the rule doesn't match. > The same happens if I trace a sctp packet: > [core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-appctl ofproto/trace br0 > in_port=6,sctp In this case you're not specifying the nw_src at all so it matches even less. openshift-sdn's OVS flows are very restrictive, to ensure that pods aren't able to spoof traffic.
Ok, that moves us forward, thank you. Changing the command to: sudo ovs-appctl ofproto/trace br0 in_port=5,icmp,nw_src=10.130.0.4 Give us this output: Flow: icmp,in_port=5,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.4,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0 bridge("br0") ------------- 0. ct_state=-trk,ip, priority 300 ct(table=0) drop -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0. -> Sets the packet to an untracked state, and clears all the conntrack fields. Final flow: unchanged Megaflow: recirc_id=0,ct_state=-trk,eth,ip,in_port=5,nw_frag=no Datapath actions: ct,recirc(0x61c) =============================================================================== recirc(0x61c) - resume conntrack with default ct_state=trk|new (use --ct-next to customize) =============================================================================== Flow: recirc_id=0x61c,ct_state=new|trk,eth,icmp,in_port=5,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.4,nw_dst=0.0.0.0,nw_tos=0,nw_ecn=0,nw_ttl=0,icmp_type=0,icmp_code=0 bridge("br0") ------------- thaw Resuming from table 0 0. ip, priority 100 goto_table:20 20. ip,in_port=5,nw_src=10.130.0.4, priority 100 load:0->NXM_NX_REG0[] goto_table:21 21. priority 0 goto_table:30 30. ip, priority 0 goto_table:100 100. priority 0 goto_table:101 101. priority 0 output:2 Final flow: unchanged Megaflow: recirc_id=0x61c,ct_state=-rpl+trk,eth,icmp,in_port=5,nw_src=10.130.0.4,nw_dst=0.0.0.0/5,nw_frag=no Datapath actions: 3 I think what thats saying is that we move through the tables in ovs and eventually, based on table 101, which reads: table_id=101, duration=18105s, n_packets=3008, n_bytes=288451, priority=0,actions=output:2 port 2 in the ovs interface list is: core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-vsctl -- --columns=name,ofport list Interface name : "vxlan0" ofport : 1 name : "br0" ofport : 65534 name : "vethda6953f3" ofport : 5 name : "veth5c033bab" ofport : 3 name : "tun0" ofport : 2 name : "veth49a4dd9c" ofport : 4 So I would suppose that we're injecting the frame to tun0, which doesn't make sense to me, as its veth49a4dd9c that is the host side of the veth pair that leads to the client pod. More stangely (note this is using icmp traffic), I can, as previously noted, see traffic ingressing via vethda6953f3, which is the server pod that is sending the icmp echos, but I can't see the frame egressing via tun0, or any of the other interfaces in the ovs bridge More specific to this problem, if I trace an sctp packet from the client pod (10.130.0.3, ingressing via port 5), I still see the trace end in a drop: [core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-appctl ofproto/trace br0 in_port=5,ip,nw_src=10.130.0.3 [core@test1-btnvf-worker-0-m6ksl ~]$ sudo ovs-appctl ofproto/trace br0 in_port=5,ip,nw_src=10.130.0.3 Flow: ip,in_port=5,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.3,nw_dst=0.0.0.0,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0 bridge("br0") ------------- 0. ct_state=-trk,ip, priority 300 ct(table=0) drop -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0. -> Sets the packet to an untracked state, and clears all the conntrack fields. Final flow: unchanged Megaflow: recirc_id=0,ct_state=-trk,eth,ip,in_port=5,nw_frag=no Datapath actions: ct,recirc(0x61c) =============================================================================== recirc(0x61c) - resume conntrack with default ct_state=trk|new (use --ct-next to customize) =============================================================================== Flow: recirc_id=0x61c,ct_state=new|trk,eth,ip,in_port=5,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.130.0.3,nw_dst=0.0.0.0,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0 bridge("br0") ------------- thaw Resuming from table 0 0. ip, priority 100 goto_table:20 20. priority 0 drop Final flow: unchanged Megaflow: recirc_id=0x61c,ct_state=+trk,eth,ip,in_port=5,nw_src=10.130.0.3,nw_frag=no Datapath actions: drop So it seems, one way or the other, ovs is dropping these frames when it really shouldn't be
> So I would suppose that we're injecting the frame to tun0, which doesn't make > sense to me, as its veth49a4dd9c that is the host side of the veth pair that > leads to the client pod. You didn't tell it to trace sending a packet to the client pod though; you provided no nw_dst on the trace command line, so it defaulted nw_dst=0.0.0.0, and that's not a pod IP, so the OVS flows say it should be delivered to the host networking stack via tun0 so that the host can route it. Likewise, with the SCTP packet, you were still using a mismatched in_port/nw_src; there is no rule that matches packets with a in_port of 5 and a nw_src of 10.130.0.3, so the packet gets dropped. You need to use the correct in_port, and specify both nw_src and nw_dst. Or alternatively, instead of trying to fake packets, you could try "ovs-ofctl -O OpenFlow13 dump-flows br0", then running the SCTP test, then dumping the flows again and seeing which flows have had their n_packets counters increased (keeping in mind that OVS may also have processed some unrelated traffic in the same span, so not every change will be relevant). But also, if you're going to debug flows, can you just attach the output of that dump-flows command here? And just for completeness, can you confirm that "oc get networkpolicies -n default -o yaml" returns nothing?
I can reply just for the last part: no networkpolicies, in any namespace: # oc get networkpolicies -n default No resources found in default namespace. # oc get networkpolicies -A No resources found
The bottom line here is, no matter whats wrong with my trace commands, the fact remains that we can see packets ingressing the ovs bridge on the server pods interface, and they never egress to the client post host-side interface, so the packets are getting lost somewhere in the ovs bridge. Frederico, could you please set dan up with his public key on this cluster so that he can take a look at this directly? For completeness I'll post the flow dump here, and run the tests requested in the morning.
Created attachment 1657683 [details] flows on the RHCOS worker node
(In reply to Neil Horman from comment #22) > Frederico, could you please set dan up with his public key on this cluster > so that he can take a look at this directly? Will do
I Reproduced the same issue in 4.4.0-0.nightly-2020-02-04-101225.
yeah, it's not just SCTP. All network traffic in the "default" namespace is broken in master. (CI didn't catch this because none of the e2e tests use "default".) Weibin, can you make sure that there is some QE test that would have eventually caught this? I know we do tests involving the default namespace under ovs-multitenant, but I'm not 100% sure we test it under ovs-networkpolicy.
Dan, Just did simple testing, simple curl also not work between the pods which are deployed in "default" namespace. I will write a simple QE automation script to cover this test scenario.
Verified in 4.4.0-0.nightly-2020-02-06-131745 Both HTTP and SCTP traffics work fine in default namespace
Dan winship fixed this, I expect he would be the appropriate person to document it
The bug never existed in any released version
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581