Bug 1968629 - sdn out of sync and network policy flows slow to propagate
Summary: sdn out of sync and network policy flows slow to propagate
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: x86_64
OS: All
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Dan Winship
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-07 17:17 UTC by Matthew Robson
Modified: 2021-11-23 09:54 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 14:12:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sdn pods memory graph (86.53 KB, image/jpeg)
2021-06-09 16:56 UTC, Matthew Robson
no flags Details
sdn pods cpu graph (98.97 KB, image/jpeg)
2021-06-09 16:56 UTC, Matthew Robson
no flags Details
sdn pod memory sorted (113.28 KB, image/png)
2021-06-09 17:13 UTC, Matthew Robson
no flags Details
sdn pod cpu sorted (91.82 KB, image/png)
2021-06-09 17:13 UTC, Matthew Robson
no flags Details

Description Matthew Robson 2021-06-07 17:17:33 UTC
Description of problem:

Seeing inconsistent issues with pod to pod communication when new pods come up (scale), when pods move between nodes of pods or when jobs run that connect to other existing pods.

Sometimes there is no issue, sometimes the issue resolves and sometimes the flows are never updated without some form of intervention.

Intervention being a) restarting the sdn pods on the nodes with impacted pods or b) scaling down the impacted pods and scaling back up again 

oc get networkpolicy --all-namespaces | wc -l
3691

Here is one example we pulled:

NAME         NETID     EGRESS IPS
8ad0ea-dev   2075391

hex: 0x1faaff

The indexer crashloops because it can not connect to the DB

# oc -n 8ad0ea-dev get pods -l app=Offline-Indexing-oli,env=dev,role=offline-indexer -o wide
NAME                          READY   STATUS             RESTARTS   AGE    IP             NODE                    NOMINATED NODE   READINESS GATES
offline-indexer-oli-3-5hh8x   0/1     CrashLoopBackOff   31         160m   10.97.14.152   app-07.dmz   <none>           <none>

# oc -n 8ad0ea-dev get pods -l app=Offline-Indexing-oli,role=db,env=dev -o wide
NAME             READY   STATUS    RESTARTS   AGE    IP              NODE                    NOMINATED NODE   READINESS GATES
db-oli-2-ktpl9   1/1     Running   0          162m   10.97.106.200   =app-27.dmz   <none>           <none>

NetPol for this one is `db-oli`

Looking at the flows, 10.97.14.152 on app7 -> 10.97.106.200 on app27 on port 5432 does not exist within table80.

app-07

 cookie=0x0, duration=10169.271s, table=20, n_packets=203, n_bytes=8526, priority=100,arp,in_port=12460,arp_spa=10.97.14.152,arp_sha=00:00:0a:61:0e:98/00:00:ff:ff:ff:ff actions=load:0x1faaff->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=10169.271s, table=20, n_packets=426, n_bytes=27744, priority=100,ip,in_port=12460,nw_src=10.97.14.152 actions=load:0x1faaff->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=10169.271s, table=25, n_packets=234, n_bytes=17316, priority=100,ip,nw_src=10.97.14.152 actions=load:0x1faaff->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=10169.272s, table=70, n_packets=189, n_bytes=13986, priority=100,ip,nw_dst=10.97.14.152 actions=load:0x1faaff->NXM_NX_REG1[],load:0x30ac->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.89.180 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.89.180 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.49.66 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.49.66 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.88.53 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.88.53 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.68.27 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.68.27 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.70.228 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.70.228 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.77.10 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.639s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.77.10 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.49.66,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.68.27,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.70.228,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.77.10,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.88.53,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.49.66,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.68.27,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.70.228,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.77.10,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.88.53,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.89.180,tp_dst=5672 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.89.180,tp_dst=5672 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.89.180,tp_dst=5672 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.89.180,tp_dst=5672 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.88.225,tp_dst=8983 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.88.225,tp_dst=8983 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.88.225,tp_dst=8983 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.88.225,tp_dst=8983 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.110.150,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.110.150,nw_dst=10.97.71.211,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.49.66,nw_dst=10.97.71.211,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.88.53,nw_dst=10.97.71.211,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.49.66,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.49.66,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.49.66,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.49.66,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.88.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.88.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.88.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.641s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.88.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.544s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1d64be,nw_src=10.97.102.45,nw_dst=10.97.110.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.544s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1d64be,nw_src=10.97.105.210,nw_dst=10.97.110.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.544s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1d64be,nw_src=10.97.117.206,nw_dst=10.97.110.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=5.544s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1d64be,nw_src=10.97.23.4,nw_dst=10.97.110.53,tp_dst=8024 actions=output:NXM_NX_REG2[]

app-27:

 cookie=0x0, duration=10225.072s, table=20, n_packets=106, n_bytes=4452, priority=100,arp,in_port=6143,arp_spa=10.97.106.200,arp_sha=00:00:0a:61:6a:c8/00:00:ff:ff:ff:ff actions=load:0x1faaff->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=10225.074s, table=20, n_packets=3054, n_bytes=209708, priority=100,ip,in_port=6143,nw_src=10.97.106.200 actions=load:0x1faaff->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=10225.075s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.97.106.200 actions=load:0x1faaff->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=10225.080s, table=70, n_packets=4378, n_bytes=299540, priority=100,ip,nw_dst=10.97.106.200 actions=load:0x1faaff->NXM_NX_REG1[],load:0x17ff->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.49.66,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.68.27,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.70.228,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.77.10,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.88.53,nw_dst=10.97.102.45,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.49.66,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.68.27,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.70.228,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.77.10,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.88.53,nw_dst=10.97.117.206,tp_dst=8080 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.110.150,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.63.127,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.88.225,tp_dst=8983 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.88.225,tp_dst=8983 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.88.225,tp_dst=8983 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.88.225,tp_dst=8983 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.110.150,nw_dst=10.97.71.211,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.49.66,nw_dst=10.97.71.211,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.88.53,nw_dst=10.97.71.211,tp_dst=5432 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.49.66,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.49.66,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.49.66,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.49.66,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.88.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.739s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.88.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.740s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.88.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.740s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.88.53,tp_dst=8024 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.740s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.102.45,nw_dst=10.97.89.180,tp_dst=5672 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.740s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.105.210,nw_dst=10.97.89.180,tp_dst=5672 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.740s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.117.206,nw_dst=10.97.89.180,tp_dst=5672 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.740s, table=80, n_packets=0, n_bytes=0, priority=150,tcp,reg0=0x1faaff,reg1=0x1faaff,nw_src=10.97.23.4,nw_dst=10.97.89.180,tp_dst=5672 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.49.66 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.49.66 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.88.53 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.88.53 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.68.27 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.68.27 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.70.228 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.70.228 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.77.10 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.77.10 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0,reg1=0x1faaff,nw_dst=10.97.89.180 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=3.761s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0xcecde4,reg1=0x1faaff,nw_dst=10.97.89.180 actions=output:NXM_NX_REG2[]

Fails to connect:

# oc -n 8ad0ea-dev rsh offline-indexer-oli-3-5hh8x
$ timeout 5 bash -c '< /dev/tcp/10.97.106.200/5432'; echo $?
124
$ 


# oc -n openshift-sdn get pods -o wide | grep -E "mcs-silver-app-07.dmz|mcs-silver-app-27.dmz"
ovs-2bcrq              1/1     Running   0          39d   142.34.151.188   app-27.dmz      <none>           <none>
ovs-9dtth              1/1     Running   0          79m   142.34.151.147   app-07.dmz      <none>           <none>
sdn-g9dx8              2/2     Running   0          78m   142.34.151.147   app-07.dmz      <none>           <none>
sdn-xf75g              2/2     Running   0          76m   142.34.151.188   app-27.dmz      <none>           <none>

We flipped the sdn pods into debug mode which triggered a restart and, as expected, resolved the communication issue.

In the debug logs you can now see the flow being added;

2021-06-04T21:58:27.064896598Z flow add table=80, priority=150, reg1=2075391, ip, nw_dst=10.97.106.200, reg0=2075391, ip, nw_src=10.97.14.152, tcp, tp_dst=5432,  actions=output:NXM_NX_REG2[]

Parsing the NetworkPolicy shows it an unchanged:

2021-06-04T21:58:26.659861311Z I0604 21:58:26.659816 1280823 networkpolicy.go:596] Parsed NetworkPolicy: &node.npPolicy{policy:v1.NetworkPolicy{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"db-oli", GenerateName:"", Namespace:"8ad0ea-dev", SelfLink:"/apis/networking.k8s.io/v1/namespaces/8ad0ea-dev/networkpolicies/db-oli", UID:"c8f08526-1f5e-407a-9942-10471871b536", ResourceVersion:"1131519476", Generation:1, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63758414551, loc:(*time.Location)(0x2ce4220)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app":"Offline-Indexing-oli", "app-group":"offline-indexing", "env":"dev", "name":"db-oli"}, Annotations:map[string]string{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"networking.k8s.io/v1\",\"kind\":\"NetworkPolicy\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"Offline-Indexing-oli\",\"app-group\":\"offline-indexing\",\"env\":\"dev\",\"name\":\"db-oli\"},\"name\":\"db-oli\",\"namespace\":\"8ad0ea-dev\"},\"spec\":{\"description\":\"Allow the api, msg queue worker, backup container, and schema spy to access the database.\\n\",\"ingress\":[{\"from\":[{\"namespaceSelector\":{\"matchLabels\":{\"environment\":\"dev\",\"name\":\"8ad0ea\"}},\"podSelector\":{\"matchLabels\":{\"app\":\"Offline-Indexing-oli\",\"env\":\"dev\",\"role\":\"api\"}}},{\"namespaceSelector\":{\"matchLabels\":{\"environment\":\"dev\",\"name\":\"8ad0ea\"}},\"podSelector\":{\"matchLabels\":{\"app\":\"Offline-Indexing-oli\",\"env\":\"dev\",\"role\":\"msg-queue-worker\"}}},{\"namespaceSelector\":{\"matchLabels\":{\"environment\":\"dev\",\"name\":\"8ad0ea\"}},\"podSelector\":{\"matchLabels\":{\"app\":\"Backup-oli\",\"env\":\"dev\",\"role\":\"backup\"}}},{\"namespaceSelector\":{\"matchLabels\":{\"environment\":\"dev\",\"name\":\"8ad0ea\"}},\"podSelector\":{\"matchLabels\":{\"app\":\"Offline-Indexing-oli\",\"env\":\"dev\",\"role\":\"schema-spy\"}}},{\"namespaceSelector\":{\"matchLabels\":{\"environment\":\"dev\",\"name\":\"8ad0ea\"}},\"podSelector\":{\"matchLabels\":{\"app\":\"Offline-Indexing-oli\",\"env\":\"dev\",\"role\":\"offline-indexer\"}}}],\"ports\":[{\"port\":5432,\"protocol\":\"TCP\"}]}],\"podSelector\":{\"matchLabels\":{\"app\":\"Offline-Indexing-oli\",\"env\":\"dev\",\"role\":\"db\"}}}}\n"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"oc.exe", Operation:"Update", APIVersion:"networking.k8s.io/v1", Time:(*v1.Time)(0xc0028419c0), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0028419e0)}}}, Spec:v1.NetworkPolicySpec{PodSelector:v1.LabelSelector{MatchLabels:map[string]string{"app":"Offline-Indexing-oli", "env":"dev", "role":"db"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}, Ingress:[]v1.NetworkPolicyIngressRule{v1.NetworkPolicyIngressRule{Ports:[]v1.NetworkPolicyPort{v1.NetworkPolicyPort{Protocol:(*v1.Protocol)(0xc0021bd680), Port:(*intstr.IntOrString)(0xc002841a00)}}, From:[]v1.NetworkPolicyPeer{v1.NetworkPolicyPeer{PodSelector:(*v1.LabelSelector)(0xc002841a40), NamespaceSelector:(*v1.LabelSelector)(0xc002841a60), IPBlock:(*v1.IPBlock)(nil)}, v1.NetworkPolicyPeer{PodSelector:(*v1.LabelSelector)(0xc002841a80), NamespaceSelector:(*v1.LabelSelector)(0xc002841aa0), IPBlock:(*v1.IPBlock)(nil)}, v1.NetworkPolicyPeer{PodSelector:(*v1.LabelSelector)(0xc002841ac0), NamespaceSelector:(*v1.LabelSelector)(0xc002841ae0), IPBlock:(*v1.IPBlock)(nil)}, v1.NetworkPolicyPeer{PodSelector:(*v1.LabelSelector)(0xc002841b00), NamespaceSelector:(*v1.LabelSelector)(0xc002841b40), IPBlock:(*v1.IPBlock)(nil)}, v1.NetworkPolicyPeer{PodSelector:(*v1.LabelSelector)(0xc002841b80), NamespaceSelector:(*v1.LabelSelector)(0xc002841ba0), IPBlock:(*v1.IPBlock)(nil)}}}}, Egress:[]v1.NetworkPolicyEgressRule(nil), PolicyTypes:[]v1.PolicyType{"Ingress"}}}, watchesNamespaces:true, watchesAllPods:true, watchesOwnPods:true, flows:[]string{"ip, nw_dst=10.97.106.200, reg0=2075391, ip, nw_src=10.97.14.152, tcp, tp_dst=5432, "}, selectedIPs:[]string{"10.97.106.200"}}
2021-06-04T21:58:26.659875067Z I0604 21:58:26.659856 1280823 networkpolicy.go:622] NetworkPolicy 8ad0ea-dev/db-oli is unchanged


Version-Release number of selected component (if applicable):

4.6.25

How reproducible:

Random but occurs quite often across different projects

Steps to Reproduce:
1.
2.
3.

Actual results:

Communication between pods is blocked.

Expected results:


Additional info:

Comment 4 Matthew Robson 2021-06-09 16:56:11 UTC
Created attachment 1789629 [details]
sdn pods memory graph

Comment 5 Matthew Robson 2021-06-09 16:56:37 UTC
Created attachment 1789630 [details]
sdn pods cpu graph

Comment 7 Matthew Robson 2021-06-09 17:13:22 UTC
Created attachment 1789633 [details]
sdn pod memory sorted

Comment 8 Matthew Robson 2021-06-09 17:13:50 UTC
Created attachment 1789634 [details]
sdn pod cpu sorted

Comment 27 Dan Winship 2021-06-21 14:59:30 UTC
(In reply to Matthew Robson from comment #26)
> While there may be more we can look into with respect to why it ends up
> holding so much memory, can we at least look to add some logging to help
> warn or debug this issue in openshift-sdn?

Yup. Already filed a Jira issue about this last week: https://issues.redhat.com/browse/SDN-1960

Comment 28 Dan Winship 2021-07-27 14:12:42 UTC
OK, so:

1. The customer problem was worked around by removing NetworkPolicies that behaved pathologically under openshift-sdn
2. There is a Jira issue about about providing better feedback to the user when this happens in the future
3. The support case has been closed

so there's nothing more to do here... I guess I'll call this "DEFERRED" since we created a Jira to adress part of it


Note You need to log in before you can comment on or make changes to this bug.