Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1594751

Summary: ovs-multitenant randomly drops outgoing packets
Product: OpenShift Container Platform Reporter: Juan Luis de Sousa-Valadas <jdesousa>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED NOTABUG Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, bbennett, cdc, danw, dzhukous, gferrazs, jdesousa, pasik, rpenta
Version: 3.7.0Flags: jdesousa: needinfo+
Target Milestone: ---   
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-26 15:59:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ovs flows none

Description Juan Luis de Sousa-Valadas 2018-06-25 10:44:37 UTC
Description of problem:

When using the SDN redhat/openshift-ovs-multitenant some outgoing packets are lost in (I believe) the virtual switch.


Client:
tshark -r  client.pcap -Y 'udp.port == 4789 and ip.addr == 1.240.14.235 and ip.addr == 1.240.26.1 and tcp.port == 60094'
 5783 2018-06-22 12:14:17.357092   1.240.26.1 → 1.240.14.235 TCP 124 60094 → 8080 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=870419283 TSecr=0 WS=128
 5789 2018-06-22 12:14:17.357965 1.240.14.235 → 1.240.26.1   TCP 124 8080 → 60094 [SYN, ACK] Seq=0 Ack=1 Win=26697 Len=0 MSS=8911 SACK_PERM=1 TSval=866627658 TSecr=870419283 WS=128
 5790 2018-06-22 12:14:17.358025   1.240.26.1 → 1.240.14.235 TCP 116 60094 → 8080 [ACK] Seq=1 Ack=1 Win=26752 Len=0 TSval=870419284 TSecr=866627658
 5791 2018-06-22 12:14:17.358088   1.240.26.1 → 1.240.14.235 HTTP 197 GET / HTTP/1.1  <- This is the HTTP GET

 Retransmissions 
 6136 2018-06-22 12:14:17.558977   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870419485 TSecr=866627658
 6425 2018-06-22 12:14:17.759983   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870419686 TSecr=866627658
 6954 2018-06-22 12:14:18.162986   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870420089 TSecr=866627658
 7907 2018-06-22 12:14:18.970004   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870420896 TSecr=866627658

Pod:

$ tshark -r  pod.pcap -Y 'tcp.port == 60094'
   57 2018-06-22 12:14:17.357513   1.240.26.1 → 1.240.14.235 TCP 74 60094 → 8080 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=870419283 TSecr=0 WS=128
   58 2018-06-22 12:14:17.357544 1.240.14.235 → 1.240.26.1   TCP 74 8080 → 60094 [SYN, ACK] Seq=0 Ack=1 Win=26697 Len=0 MSS=8911 SACK_PERM=1 TSval=866627658 TSecr=870419283 WS=128
   59 2018-06-22 12:14:17.358351   1.240.26.1 → 1.240.14.235 TCP 66 60094 → 8080 [ACK] Seq=1 Ack=1 Win=26752 Len=0 TSval=870419284 TSecr=866627658
   60 2018-06-22 12:14:17.358418   1.240.26.1 → 1.240.14.235 HTTP 147 GET / HTTP/1.1  <- Receives the GET request
   61 2018-06-22 12:14:17.358439 1.240.14.235 → 1.240.26.1   TCP 66 8080 → 60094 [ACK] Seq=1 Ack=82 Win=26752 Len=0 TSval=866627659 TSecr=870419284 <- TCP ACK
   62 2018-06-22 12:14:17.358566 1.240.14.235 → 1.240.26.1   HTTP 200 HTTP/1.1 200 OK  (text/plain) <-Replies

   The pod receives everything but there seems to be a problem with the outgoing traffic from pod to node:
   63 2018-06-22 12:14:17.558845 1.240.14.235 → 1.240.26.1   TCP 200 [TCP Retransmission] 8080 → 60094 [PSH, ACK] Seq=1 Ack=82 Win=26752 Len=134 TSval=866627860 TSecr=870419284
   64 2018-06-22 12:14:17.559375   1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 
   65 2018-06-22 12:14:17.559393 1.240.14.235 → 1.240.26.1   TCP 78 [TCP Dup ACK 61#1] 8080 → 60094 [ACK] Seq=135 Ack=82 Win=26752 Len=0 TSval=866627860 TSecr=870419485 SLE=1 SRE=82
   66 2018-06-22 12:14:17.760448   1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 
   67 2018-06-22 12:14:17.760485 1.240.14.235 → 1.240.26.1   TCP 78 [TCP Dup ACK 61#2] 8080 → 60094 [ACK] Seq=135 Ack=82 Win=26752 Len=0 TSval=866628061 TSecr=870419686 SLE=1 SRE=82
   76 2018-06-22 12:14:18.161863 1.240.14.235 → 1.240.26.1   TCP 200 [TCP Retransmission] 8080 → 60094 [PSH, ACK] Seq=1 Ack=82 Win=26752 Len=134 TSval=866628463 TSecr=870419686
   77 2018-06-22 12:14:18.163394   1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 
   78 2018-06-22 12:14:18.163411 1.240.14.235 → 1.240.26.1   TCP 78 [TCP Dup ACK 61#3] 8080 → 60094 [ACK] Seq=135 Ack=82 Win=26752 Len=0 TSval=866628464 TSecr=870420089 SLE=1 SRE=82
   79 2018-06-22 12:14:18.564846 1.240.14.235 → 1.240.26.1   TCP 200 [TCP Retransmission] 8080 → 60094 [PSH, ACK] Seq=1 Ack=82 Win=26752 Len=134 TSval=866628866 TSecr=870420089
   80 2018-06-22 12:14:18.970413   1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 
   81 2018-06-22 12:14:18.970432 1.240.14.235 → 1.240.26.1   TCP 78 [TCP Dup ACK 61#4] 8080 → 60094 [ACK] Seq=135 Ack=82 Win=26752 Len=0 TSval=866629271 TSecr=870420896 SLE=1 SRE=82
   88 2018-06-22 12:14:19.370852 1.240.14.235 → 1.240.26.1   TCP 200 [TCP Retransmission] 8080 → 60094 [PSH, ACK] Seq=1 Ack=82 Win=26752 Len=134 TSval=866629672 TSecr=870420896
   89 2018-06-22 12:14:20.393646   1.240.26.1 → 1.240.14.235 TCP 66 60094 → 8080 [FIN, ACK] Seq=82 Ack=1 Win=26752 Len=0 TSval=870422319 TSecr=866627658
   90 2018-06-22 12:14:20.393750 1.240.14.235 → 1.240.26.1   TCP 66 8080 → 60094 [FIN, ACK] Seq=135 Ack=83 Win=26752 Len=0 TSval=866630694 TSecr=870422319
   92 2018-06-22 12:14:20.582370   1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 
   93 2018-06-22 12:14:20.582390 1.240.14.235 → 1.240.26.1   TCP 78 [TCP Dup ACK 90#1] 8080 → 60094 [ACK] Seq=136 Ack=83 Win=26752 Len=0 TSval=866630883 TSecr=870422508 SLE=1 SRE=83
   94 2018-06-22 12:14:20.982872 1.240.14.235 → 1.240.26.1   TCP 200 [TCP Retransmission] 8080 → 60094 [FIN, PSH, ACK] Seq=1 Ack=83 Win=26752 Len=134 TSval=866631284 TSecr=870422508
  124 2018-06-22 12:14:23.802438   1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 
  125 2018-06-22 12:14:23.802465 1.240.14.235 → 1.240.26.1   TCP 78 [TCP Dup ACK 90#2] 8080 → 60094 [ACK] Seq=136 Ack=83 Win=26752 Len=0 TSval=866634103 TSecr=870425728 SLE=1 SRE=83
  130 2018-06-22 12:14:24.210846 1.240.14.235 → 1.240.26.1   TCP 200 [TCP Retransmission] 8080 → 60094 [FIN, PSH, ACK] Seq=1 Ack=83 Win=26752 Len=134 TSval=866634512 TSecr=870425728
  222 2018-06-22 12:14:30.250525   1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 
  223 2018-06-22 12:14:30.250570 1.240.14.235 → 1.240.26.1   TCP 78 [TCP Dup ACK 90#3] 8080 → 60094 [ACK] Seq=136 Ack=83 Win=26752 Len=0 TSval=866640551 TSecr=870432176 SLE=1 SRE=83
  226 2018-06-22 12:14:30.658847 1.240.14.235 → 1.240.26.1   TCP 200 [TCP Retransmission] 8080 → 60094 [FIN, PSH, ACK] Seq=1 Ack=83 Win=26752 Len=134 TSval=866640960 TSecr=870432176
  340 2018-06-22 12:14:43.130406   1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 
  341 2018-06-22 12:14:43.130427 1.240.14.235 → 1.240.26.1   TCP 78 [TCP Dup ACK 90#4] 8080 → 60094 [ACK] Seq=136 Ack=83 Win=26752 Len=0 TSval=866653431 TSecr=870445056 SLE=1 SRE=83
  344 2018-06-22 12:14:43.554881 1.240.14.235 → 1.240.26.1   TCP 200 [TCP Retransmission] 8080 → 60094 [FIN, PSH, ACK] Seq=1 Ack=83 Win=26752 Len=134 TSval=866653856 TSecr=870445056

Node:

$ tshark -r  node.pcap -Y 'udp.port == 4789 and ip.addr == 1.240.14.235 and ip.addr == 1.240.26.1 and tcp.port == 60094'
  182 2018-06-22 12:14:17.357463   1.240.26.1 → 1.240.14.235 TCP 124 60094 → 8080 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=870419283 TSecr=0 WS=128
  183 2018-06-22 12:14:17.357576 1.240.14.235 → 1.240.26.1   TCP 124 8080 → 60094 [SYN, ACK] Seq=0 Ack=1 Win=26697 Len=0 MSS=8911 SACK_PERM=1 TSval=866627658 TSecr=870419283 WS=128
  184 2018-06-22 12:14:17.358332   1.240.26.1 → 1.240.14.235 TCP 116 60094 → 8080 [ACK] Seq=1 Ack=1 Win=26752 Len=0 TSval=870419284 TSecr=866627658
  185 2018-06-22 12:14:17.358404   1.240.26.1 → 1.240.14.235 HTTP 197 GET / HTTP/1.1 <- This is the GET
       <- The reply is missing!
  186 2018-06-22 12:14:17.559346   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870419485 TSecr=866627658
  187 2018-06-22 12:14:17.760385   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870419686 TSecr=866627658
  195 2018-06-22 12:14:18.163368   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870420089 TSecr=866627658
  196 2018-06-22 12:14:18.970383   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870420896 TSecr=866627658
  202 2018-06-22 12:14:20.393598   1.240.26.1 → 1.240.14.235 TCP 116 60094 → 8080 [FIN, ACK] Seq=82 Ack=1 Win=26752 Len=0 TSval=870422319 TSecr=866627658
  208 2018-06-22 12:14:20.582338   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [FIN, PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870422508 TSecr=866627658
  267 2018-06-22 12:14:23.802389   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [FIN, PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870425728 TSecr=866627658
  441 2018-06-22 12:14:30.250451   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [FIN, PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870432176 TSecr=866627658
  702 2018-06-22 12:14:43.130379   1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [FIN, PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870445056 TSecr=866627658


There are some outgoing packets that do get switched so I believe the openvswitch flows are correct.


Version-Release number of selected component (if applicable):
atomic-openshift-node-3.7.46-1.git.0.e81594b.el7.x86_64

How reproducible:
Only happens on one node intermittently but easy to reproduce

Steps to Reproduce:
1. 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Dan Winship 2018-06-27 16:22:43 UTC
OVS shouldn't be dropping packets unless the machine is *very obviously* overloaded, which I assume you would have mentioned if that was the case. It seems more likely to me that there are bad OVS flows or something like that.

Can you get an sosreport from the node?

Comment 4 Juan Luis de Sousa-Valadas 2018-06-27 16:37:20 UTC
Created attachment 1455087 [details]
ovs flows

Comment 5 Dan Winship 2018-06-27 17:06:05 UTC
That doesn't include the IP that was failing before... is the node currently demonstrating the problem? If so, with what pod IPs?

Comment 9 Dan Winship 2018-07-10 18:22:00 UTC
> atomic-openshift-node-3.7.46-1.git.0.e81594b.el7.x86_64

Can you get them to confirm that they actually are running that version on all nodes?

Are they using EgressNetworkPolicies with dnsName matches? It looks like this might be bug 1571433 (which could cause outbound packets to be dropped if a remote DNS server was slow to respond while OpenShift was updating EgressNetworkPolicy rules), but that was supposed to be fixed in 3.7.46.

Comment 10 Juan Luis de Sousa-Valadas 2018-07-11 08:12:56 UTC
>  Can you get them to confirm that they actually are running that version on all nodes?
Yes, it's the same

> Are they using EgressNetworkPolicies with dnsName matches?
No. Multitenant

Comment 11 Dan Winship 2018-07-11 12:46:43 UTC
(In reply to Juan Luis de Sousa-Valadas from comment #10)
> > Are they using EgressNetworkPolicies with dnsName matches?
> No. Multitenant

No, EgressNetworkPolicy is something different from NetworkPolicy. They are definitely using EgressNetworkPolicy; I can see that in the OVS flows. The question is whether they're using it just with IP addresses or if they are using dnsName matches (which is the thing that was broken before that is supposed to be fixed in 3.7.46).

Comment 12 Dan Winship 2018-07-11 16:46:48 UTC
OK, the sosreport output shows that 3.7.46 was installed on June 5, and atomic-openshift-node.service was last restarted on June 12, so, yes, it's running 3.7.46, which (I've double-checked) has the EgressNetworkPolicy DNS fix.

Also, the provided pcaps show lossage from a pod to a node SDN IP, which wouldn't be affected by EgressNetworkPolicy anyway.

And yet... the failure *was* occurring at the time of the ENP reload, and was outbound-only...

Comment 14 Juan Luis de Sousa-Valadas 2018-07-16 14:36:11 UTC
There is no egressnetworkpolicy with dnsName. All of them are cidrSelector

Comment 17 Dan Winship 2018-08-22 18:15:02 UTC
(In reply to Dan Winship from comment #3)
> OVS shouldn't be dropping packets unless the machine is *very obviously*
> overloaded

sos_commands/network/ip_-s_-d_link in the sosreport does show 13872207 packets dropped for the OVS vxlan (which is the only interface it shows dropped packets for)

OTOH I see now in the case:

> The node itself seems to respond slow generally. For example, If I do ssh on that node.

which seems to suggest a general (non-OVS) network or CPU load problem; if the problem was limited to OVS then it shouldn't affect ssh access, which wouldn't touch the OVS interfaces.

Is is possible that some pod that was running on this node was chewing up CPU or network? That might also explain why disabling the broken node just causes another node to become broken: because the bad pod gets rescheduled to another node and starts breaking things there.

The network interfaces show:

   eth0: 364 GB rx / 182 GB tx
   tun0: 148 GB rx / 312 GB tx
  vxlan:  30 GB rx /  58 GB tx

over 15 days. Is that the sort of network usage they expect to see or is that excessive?

The journal/messages files seem to be full of random debug spew and java tracebacks. Not sure how much that's expected.

Comment 18 Juan Luis de Sousa-Valadas 2018-09-05 10:24:06 UTC
This node is not schedulable and checking the sar report, the CPU has always between 94.4 and 99.5 cpu idle.

There is a new sosreport

Comment 21 Dan Winship 2018-11-07 14:24:14 UTC
(In reply to Juan Luis de Sousa-Valadas from comment #18)
> This node is not schedulable and checking the sar report, the CPU has always
> between 94.4 and 99.5 cpu idle.

But was the cpu/network problem occuring at any point during the period sar was running?

It's hard to get any information out of the updated sosreport without knowing when the problem was occurring.

I had asked some questions before which didn't get answered:

(In reply to Dan Winship from comment #17)
> Is is possible that some pod that was running on this node was chewing up
> CPU or network? That might also explain why disabling the broken node just
> causes another node to become broken: because the bad pod gets rescheduled
> to another node and starts breaking things there.
> 
> The network interfaces show:
> 
>    eth0: 364 GB rx / 182 GB tx
>    tun0: 148 GB rx / 312 GB tx
>   vxlan:  30 GB rx /  58 GB tx
> 
> over 15 days. Is that the sort of network usage they expect to see or is
> that excessive?

Even if the CPU usage isn't excessive, it's possible a pod might be able to saturate the network without saturating the CPU.

From what I can see, my best guess is still that there is a specific pod which is spamming the network

Comment 22 Ben Bennett 2019-03-26 15:59:22 UTC
It looks like the customer indeed had a troublesome pod that was saturating the network.  And when it got evicted from one node, it would then break the next it got scheduled on.