Bug 1594751
| Summary: | ovs-multitenant randomly drops outgoing packets | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Juan Luis de Sousa-Valadas <jdesousa> | ||||
| Component: | Networking | Assignee: | Casey Callendrello <cdc> | ||||
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | ||||
| Status: | CLOSED NOTABUG | Docs Contact: | |||||
| Severity: | urgent | ||||||
| Priority: | urgent | CC: | aos-bugs, bbennett, cdc, danw, dzhukous, gferrazs, jdesousa, pasik, rpenta | ||||
| Version: | 3.7.0 | Flags: | jdesousa:
needinfo+
|
||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.7.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-03-26 15:59:22 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
OVS shouldn't be dropping packets unless the machine is *very obviously* overloaded, which I assume you would have mentioned if that was the case. It seems more likely to me that there are bad OVS flows or something like that. Can you get an sosreport from the node? Created attachment 1455087 [details]
ovs flows
That doesn't include the IP that was failing before... is the node currently demonstrating the problem? If so, with what pod IPs? > atomic-openshift-node-3.7.46-1.git.0.e81594b.el7.x86_64 Can you get them to confirm that they actually are running that version on all nodes? Are they using EgressNetworkPolicies with dnsName matches? It looks like this might be bug 1571433 (which could cause outbound packets to be dropped if a remote DNS server was slow to respond while OpenShift was updating EgressNetworkPolicy rules), but that was supposed to be fixed in 3.7.46. > Can you get them to confirm that they actually are running that version on all nodes? Yes, it's the same > Are they using EgressNetworkPolicies with dnsName matches? No. Multitenant (In reply to Juan Luis de Sousa-Valadas from comment #10) > > Are they using EgressNetworkPolicies with dnsName matches? > No. Multitenant No, EgressNetworkPolicy is something different from NetworkPolicy. They are definitely using EgressNetworkPolicy; I can see that in the OVS flows. The question is whether they're using it just with IP addresses or if they are using dnsName matches (which is the thing that was broken before that is supposed to be fixed in 3.7.46). OK, the sosreport output shows that 3.7.46 was installed on June 5, and atomic-openshift-node.service was last restarted on June 12, so, yes, it's running 3.7.46, which (I've double-checked) has the EgressNetworkPolicy DNS fix. Also, the provided pcaps show lossage from a pod to a node SDN IP, which wouldn't be affected by EgressNetworkPolicy anyway. And yet... the failure *was* occurring at the time of the ENP reload, and was outbound-only... There is no egressnetworkpolicy with dnsName. All of them are cidrSelector (In reply to Dan Winship from comment #3) > OVS shouldn't be dropping packets unless the machine is *very obviously* > overloaded sos_commands/network/ip_-s_-d_link in the sosreport does show 13872207 packets dropped for the OVS vxlan (which is the only interface it shows dropped packets for) OTOH I see now in the case: > The node itself seems to respond slow generally. For example, If I do ssh on that node. which seems to suggest a general (non-OVS) network or CPU load problem; if the problem was limited to OVS then it shouldn't affect ssh access, which wouldn't touch the OVS interfaces. Is is possible that some pod that was running on this node was chewing up CPU or network? That might also explain why disabling the broken node just causes another node to become broken: because the bad pod gets rescheduled to another node and starts breaking things there. The network interfaces show: eth0: 364 GB rx / 182 GB tx tun0: 148 GB rx / 312 GB tx vxlan: 30 GB rx / 58 GB tx over 15 days. Is that the sort of network usage they expect to see or is that excessive? The journal/messages files seem to be full of random debug spew and java tracebacks. Not sure how much that's expected. This node is not schedulable and checking the sar report, the CPU has always between 94.4 and 99.5 cpu idle. There is a new sosreport (In reply to Juan Luis de Sousa-Valadas from comment #18) > This node is not schedulable and checking the sar report, the CPU has always > between 94.4 and 99.5 cpu idle. But was the cpu/network problem occuring at any point during the period sar was running? It's hard to get any information out of the updated sosreport without knowing when the problem was occurring. I had asked some questions before which didn't get answered: (In reply to Dan Winship from comment #17) > Is is possible that some pod that was running on this node was chewing up > CPU or network? That might also explain why disabling the broken node just > causes another node to become broken: because the bad pod gets rescheduled > to another node and starts breaking things there. > > The network interfaces show: > > eth0: 364 GB rx / 182 GB tx > tun0: 148 GB rx / 312 GB tx > vxlan: 30 GB rx / 58 GB tx > > over 15 days. Is that the sort of network usage they expect to see or is > that excessive? Even if the CPU usage isn't excessive, it's possible a pod might be able to saturate the network without saturating the CPU. From what I can see, my best guess is still that there is a specific pod which is spamming the network It looks like the customer indeed had a troublesome pod that was saturating the network. And when it got evicted from one node, it would then break the next it got scheduled on. |
Description of problem: When using the SDN redhat/openshift-ovs-multitenant some outgoing packets are lost in (I believe) the virtual switch. Client: tshark -r client.pcap -Y 'udp.port == 4789 and ip.addr == 1.240.14.235 and ip.addr == 1.240.26.1 and tcp.port == 60094' 5783 2018-06-22 12:14:17.357092 1.240.26.1 → 1.240.14.235 TCP 124 60094 → 8080 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=870419283 TSecr=0 WS=128 5789 2018-06-22 12:14:17.357965 1.240.14.235 → 1.240.26.1 TCP 124 8080 → 60094 [SYN, ACK] Seq=0 Ack=1 Win=26697 Len=0 MSS=8911 SACK_PERM=1 TSval=866627658 TSecr=870419283 WS=128 5790 2018-06-22 12:14:17.358025 1.240.26.1 → 1.240.14.235 TCP 116 60094 → 8080 [ACK] Seq=1 Ack=1 Win=26752 Len=0 TSval=870419284 TSecr=866627658 5791 2018-06-22 12:14:17.358088 1.240.26.1 → 1.240.14.235 HTTP 197 GET / HTTP/1.1 <- This is the HTTP GET Retransmissions 6136 2018-06-22 12:14:17.558977 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870419485 TSecr=866627658 6425 2018-06-22 12:14:17.759983 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870419686 TSecr=866627658 6954 2018-06-22 12:14:18.162986 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870420089 TSecr=866627658 7907 2018-06-22 12:14:18.970004 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870420896 TSecr=866627658 Pod: $ tshark -r pod.pcap -Y 'tcp.port == 60094' 57 2018-06-22 12:14:17.357513 1.240.26.1 → 1.240.14.235 TCP 74 60094 → 8080 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=870419283 TSecr=0 WS=128 58 2018-06-22 12:14:17.357544 1.240.14.235 → 1.240.26.1 TCP 74 8080 → 60094 [SYN, ACK] Seq=0 Ack=1 Win=26697 Len=0 MSS=8911 SACK_PERM=1 TSval=866627658 TSecr=870419283 WS=128 59 2018-06-22 12:14:17.358351 1.240.26.1 → 1.240.14.235 TCP 66 60094 → 8080 [ACK] Seq=1 Ack=1 Win=26752 Len=0 TSval=870419284 TSecr=866627658 60 2018-06-22 12:14:17.358418 1.240.26.1 → 1.240.14.235 HTTP 147 GET / HTTP/1.1 <- Receives the GET request 61 2018-06-22 12:14:17.358439 1.240.14.235 → 1.240.26.1 TCP 66 8080 → 60094 [ACK] Seq=1 Ack=82 Win=26752 Len=0 TSval=866627659 TSecr=870419284 <- TCP ACK 62 2018-06-22 12:14:17.358566 1.240.14.235 → 1.240.26.1 HTTP 200 HTTP/1.1 200 OK (text/plain) <-Replies The pod receives everything but there seems to be a problem with the outgoing traffic from pod to node: 63 2018-06-22 12:14:17.558845 1.240.14.235 → 1.240.26.1 TCP 200 [TCP Retransmission] 8080 → 60094 [PSH, ACK] Seq=1 Ack=82 Win=26752 Len=134 TSval=866627860 TSecr=870419284 64 2018-06-22 12:14:17.559375 1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 65 2018-06-22 12:14:17.559393 1.240.14.235 → 1.240.26.1 TCP 78 [TCP Dup ACK 61#1] 8080 → 60094 [ACK] Seq=135 Ack=82 Win=26752 Len=0 TSval=866627860 TSecr=870419485 SLE=1 SRE=82 66 2018-06-22 12:14:17.760448 1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 67 2018-06-22 12:14:17.760485 1.240.14.235 → 1.240.26.1 TCP 78 [TCP Dup ACK 61#2] 8080 → 60094 [ACK] Seq=135 Ack=82 Win=26752 Len=0 TSval=866628061 TSecr=870419686 SLE=1 SRE=82 76 2018-06-22 12:14:18.161863 1.240.14.235 → 1.240.26.1 TCP 200 [TCP Retransmission] 8080 → 60094 [PSH, ACK] Seq=1 Ack=82 Win=26752 Len=134 TSval=866628463 TSecr=870419686 77 2018-06-22 12:14:18.163394 1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 78 2018-06-22 12:14:18.163411 1.240.14.235 → 1.240.26.1 TCP 78 [TCP Dup ACK 61#3] 8080 → 60094 [ACK] Seq=135 Ack=82 Win=26752 Len=0 TSval=866628464 TSecr=870420089 SLE=1 SRE=82 79 2018-06-22 12:14:18.564846 1.240.14.235 → 1.240.26.1 TCP 200 [TCP Retransmission] 8080 → 60094 [PSH, ACK] Seq=1 Ack=82 Win=26752 Len=134 TSval=866628866 TSecr=870420089 80 2018-06-22 12:14:18.970413 1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 81 2018-06-22 12:14:18.970432 1.240.14.235 → 1.240.26.1 TCP 78 [TCP Dup ACK 61#4] 8080 → 60094 [ACK] Seq=135 Ack=82 Win=26752 Len=0 TSval=866629271 TSecr=870420896 SLE=1 SRE=82 88 2018-06-22 12:14:19.370852 1.240.14.235 → 1.240.26.1 TCP 200 [TCP Retransmission] 8080 → 60094 [PSH, ACK] Seq=1 Ack=82 Win=26752 Len=134 TSval=866629672 TSecr=870420896 89 2018-06-22 12:14:20.393646 1.240.26.1 → 1.240.14.235 TCP 66 60094 → 8080 [FIN, ACK] Seq=82 Ack=1 Win=26752 Len=0 TSval=870422319 TSecr=866627658 90 2018-06-22 12:14:20.393750 1.240.14.235 → 1.240.26.1 TCP 66 8080 → 60094 [FIN, ACK] Seq=135 Ack=83 Win=26752 Len=0 TSval=866630694 TSecr=870422319 92 2018-06-22 12:14:20.582370 1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 93 2018-06-22 12:14:20.582390 1.240.14.235 → 1.240.26.1 TCP 78 [TCP Dup ACK 90#1] 8080 → 60094 [ACK] Seq=136 Ack=83 Win=26752 Len=0 TSval=866630883 TSecr=870422508 SLE=1 SRE=83 94 2018-06-22 12:14:20.982872 1.240.14.235 → 1.240.26.1 TCP 200 [TCP Retransmission] 8080 → 60094 [FIN, PSH, ACK] Seq=1 Ack=83 Win=26752 Len=134 TSval=866631284 TSecr=870422508 124 2018-06-22 12:14:23.802438 1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 125 2018-06-22 12:14:23.802465 1.240.14.235 → 1.240.26.1 TCP 78 [TCP Dup ACK 90#2] 8080 → 60094 [ACK] Seq=136 Ack=83 Win=26752 Len=0 TSval=866634103 TSecr=870425728 SLE=1 SRE=83 130 2018-06-22 12:14:24.210846 1.240.14.235 → 1.240.26.1 TCP 200 [TCP Retransmission] 8080 → 60094 [FIN, PSH, ACK] Seq=1 Ack=83 Win=26752 Len=134 TSval=866634512 TSecr=870425728 222 2018-06-22 12:14:30.250525 1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 223 2018-06-22 12:14:30.250570 1.240.14.235 → 1.240.26.1 TCP 78 [TCP Dup ACK 90#3] 8080 → 60094 [ACK] Seq=136 Ack=83 Win=26752 Len=0 TSval=866640551 TSecr=870432176 SLE=1 SRE=83 226 2018-06-22 12:14:30.658847 1.240.14.235 → 1.240.26.1 TCP 200 [TCP Retransmission] 8080 → 60094 [FIN, PSH, ACK] Seq=1 Ack=83 Win=26752 Len=134 TSval=866640960 TSecr=870432176 340 2018-06-22 12:14:43.130406 1.240.26.1 → 1.240.14.235 HTTP 147 [TCP Spurious Retransmission] GET / HTTP/1.1 341 2018-06-22 12:14:43.130427 1.240.14.235 → 1.240.26.1 TCP 78 [TCP Dup ACK 90#4] 8080 → 60094 [ACK] Seq=136 Ack=83 Win=26752 Len=0 TSval=866653431 TSecr=870445056 SLE=1 SRE=83 344 2018-06-22 12:14:43.554881 1.240.14.235 → 1.240.26.1 TCP 200 [TCP Retransmission] 8080 → 60094 [FIN, PSH, ACK] Seq=1 Ack=83 Win=26752 Len=134 TSval=866653856 TSecr=870445056 Node: $ tshark -r node.pcap -Y 'udp.port == 4789 and ip.addr == 1.240.14.235 and ip.addr == 1.240.26.1 and tcp.port == 60094' 182 2018-06-22 12:14:17.357463 1.240.26.1 → 1.240.14.235 TCP 124 60094 → 8080 [SYN] Seq=0 Win=26733 Len=0 MSS=8911 SACK_PERM=1 TSval=870419283 TSecr=0 WS=128 183 2018-06-22 12:14:17.357576 1.240.14.235 → 1.240.26.1 TCP 124 8080 → 60094 [SYN, ACK] Seq=0 Ack=1 Win=26697 Len=0 MSS=8911 SACK_PERM=1 TSval=866627658 TSecr=870419283 WS=128 184 2018-06-22 12:14:17.358332 1.240.26.1 → 1.240.14.235 TCP 116 60094 → 8080 [ACK] Seq=1 Ack=1 Win=26752 Len=0 TSval=870419284 TSecr=866627658 185 2018-06-22 12:14:17.358404 1.240.26.1 → 1.240.14.235 HTTP 197 GET / HTTP/1.1 <- This is the GET <- The reply is missing! 186 2018-06-22 12:14:17.559346 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870419485 TSecr=866627658 187 2018-06-22 12:14:17.760385 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870419686 TSecr=866627658 195 2018-06-22 12:14:18.163368 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870420089 TSecr=866627658 196 2018-06-22 12:14:18.970383 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870420896 TSecr=866627658 202 2018-06-22 12:14:20.393598 1.240.26.1 → 1.240.14.235 TCP 116 60094 → 8080 [FIN, ACK] Seq=82 Ack=1 Win=26752 Len=0 TSval=870422319 TSecr=866627658 208 2018-06-22 12:14:20.582338 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [FIN, PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870422508 TSecr=866627658 267 2018-06-22 12:14:23.802389 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [FIN, PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870425728 TSecr=866627658 441 2018-06-22 12:14:30.250451 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [FIN, PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870432176 TSecr=866627658 702 2018-06-22 12:14:43.130379 1.240.26.1 → 1.240.14.235 TCP 197 [TCP Retransmission] 60094 → 8080 [FIN, PSH, ACK] Seq=1 Ack=1 Win=26752 Len=81 TSval=870445056 TSecr=866627658 There are some outgoing packets that do get switched so I believe the openvswitch flows are correct. Version-Release number of selected component (if applicable): atomic-openshift-node-3.7.46-1.git.0.e81594b.el7.x86_64 How reproducible: Only happens on one node intermittently but easy to reproduce Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: