Bug 1834918
Summary: | High number of TX errors on geneve interfaces | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Sai Sindhur Malleni <smalleni> | |
Component: | OVN | Assignee: | Ben Nemec <bnemec> | |
Status: | CLOSED DUPLICATE | QA Contact: | Jianlin Shi <jishi> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | RHEL 8.0 | CC: | asegurap, bbennett, ctrautma, dblack, dcbw, gnault, jbenc, jtaleric, mcambria, mcornea, mkarg, mmichels, rkhan | |
Target Milestone: | --- | Keywords: | UpcomingSprint | |
Target Release: | RHEL 8 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1841214 1843412 (view as bug list) | Environment: | ||
Last Closed: | 2020-06-01 15:28:23 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: |
Description
Sai Sindhur Malleni
2020-05-12 16:27:44 UTC
Hi I'm from the OVN team. It's pretty difficult to get much information just based on what's been presented here. OVN sets up the geneve tunnels that OCP uses. So it would be good to see logs from ovn-controller on the node(s) where you see the transmission errors. This way we can see if OVN encountered any errors while setting up the tunnels. Similarly, the logs from ovs-vswitchd on the node(s) with transmission errors may also give more information on what's going wrong here. In addition, the contents of the OVN southbound database may also be useful. This way we can see how the interfaces and chassis have been One interesting data point here is that the ens2f0 and ens2f1 interfaces have lots of transmission errors, too. It's not just the geneve interfaces. So I'm curious if there's something more widespread going wrong here. I have a feeling someone from the networking-services kernel team will need to look into this if there's nothing obvious from the OVN logs that indicate errors. I'd say be prepared to dump more information for them as well. The geneve tunnel is using ens2f0 just as an additional note. As another data point, what is the kernel version? Which Kernel version? and which NIC. Also: ethtool -k ens2f0 ethtool -k ens2f1 Seeing this on a newly deployed OCP cluster, with no workloads/pods running. Going to get mus gather data shortly. Geneve tunnel is using ens2f0 and ens2f1 is not being used 2: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 3c:fd:fe:ee:49:08 brd ff:ff:ff:ff:ff:ff inet 192.168.222.10/24 brd 192.168.222.255 scope global dynamic noprefixroute ens2f0 valid_lft 2727sec preferred_lft 2727sec inet6 fe80::6b06:748a:ef56:1ae0/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 3c:fd:fe:ee:49:09 brd ff:ff:ff:ff:ff:ff NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-08-222601 True False 178m Cluster version is 4.5.0-0.nightly-2020-05-08-222601 Kernel: 4.18.0-147.8.1.el8_1.x86_64 sh-4.2# ethtool ens2f0 Settings for ens2f0: Supported ports: [ FIBRE ] Supported link modes: 25000baseSR/Full 10000baseSR/Full Supported pause frame use: Symmetric Supports auto-negotiation: Yes Supported FEC modes: None BaseR RS Advertised link modes: 25000baseSR/Full 10000baseSR/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Advertised FEC modes: None BaseR RS Speed: 25000Mb/s Duplex: Full Port: FIBRE PHYAD: 0 Transceiver: internal Auto-negotiation: off Supports Wake-on: d Wake-on: d Current message level: 0x00000007 (7) drv probe link Link detected: yes =========================================== sh-4.2# ethtool -i ens2f0 driver: i40e version: 2.8.20-k firmware-version: 6.01 0x80003554 1.1747.0 expansion-rom-version: bus-info: 0000:5e:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ============================================== sh-4.2# ethtool -k ens2f0 Features for ens2f0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: on scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: on tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: on receive-hashing: on highdma: on rx-vlan-filter: on [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] tls-hw-rx-offload: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: on esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on tls-hw-tx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] As an additional datapoint, running any workloads (creating projects or pods) on the cluster is failing with either TLS hadnshake and EOF errors i/o timeout errors Here are the error examples from the client 1. Unexpected error: <*url.Error | 0xc001bcb560>: { Op: "Post", URL: "https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods", Err: {s: "EOF"}, } Post https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods: EOF 2. Get https://api.test714.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout In prometheus we continuously see nodenetworktransmit errors: https://snapshot.raintank.io/dashboard/snapshot/vmqPeuQ3AL8TDkorrC5wqDNe60Ap8tlp I made sure we don't have any unexpected Ips/ hosts in the baremetal environment by running an nmap. Happy to give access to the environment and help debug further. Can these be turned off and try again? tx-checksumming: on scatter-gather: on tcp-segmentation-offload: on generic-segmentation-offload: on tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on hw-tc-offload: on I think the commands are (adjust ens2f0 accordingly.) ethtool -K ens2f0 tx off ethtool -K ens2f0 sgo off ethtool -K ens2f0 tso off ethtool -K ens2f0 gso off ethtool -K ens2f0 tx-gre-segmentation off ethtool -K ens2f0 tx-gre-csum-segmentation off ethtool -K ens2f0 tx-udp_tnl-segmentation off ethtool -K ens2f0 tx-udp_tnl-csum-segmentation off ethtool -K ens2f0 tx-gso-partial off ethtool -K ens2f0 hw-tc-offload off But check to see if the values change to be sure. Actually, I already attached must-gather data earlier, never mind. Please use the first link. *** Bug 1835376 has been marked as a duplicate of this bug. *** Still seeing errors on the geneve interface. Do note that there are no errors seen on ens2f0 before and after the changes. sh-4.2# ethtool -k genev_sys_6081 Features for genev_sys_6081: rx-checksumming: off tx-checksumming: off tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: off tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: off tx-scatter-gather: off tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: off tx-tcp-segmentation: off tx-tcp-ecn-segmentation: off tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: off udp-fragmentation-offload: off generic-segmentation-offload: off generic-receive-offload: off large-receive-offload: off [fixed] rx-vlan-offload: off [fixed] tx-vlan-offload: off [fixed] ntuple-filters: off [fixed] receive-hashing: off [fixed] highdma: off [fixed] rx-vlan-filter: off [fixed] vlan-challenged: off [fixed] tx-lockless: on [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: off [fixed] tx-gre-csum-segmentation: off [fixed] tx-ipxip4-segmentation: off [fixed] tx-ipxip6-segmentation: off [fixed] tx-udp_tnl-segmentation: off [fixed] tx-udp_tnl-csum-segmentation: off [fixed] tx-gso-partial: off [fixed] tx-sctp-segmentation: off tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] tls-hw-rx-offload: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off [fixed] esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: off [fixed] tls-hw-tx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] For ens2f0 sh-4.2# ethtool -k ens2f0 [8/59] Features for ens2f0: rx-checksumming: off tx-checksumming: off tx-checksum-ipv4: off tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: off tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: off tx-tcp-segmentation: off tx-tcp-ecn-segmentation: off tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: off udp-fragmentation-offload: off generic-segmentation-offload: off generic-receive-offload: off large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: on receive-hashing: on highdma: off rx-vlan-filter: on [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: off tx-gre-csum-segmentation: off [requested on] tx-ipxip4-segmentation: off tx-ipxip6-segmentation: off tx-udp_tnl-segmentation: off tx-udp_tnl-csum-segmentation: off tx-gso-partial: off tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] tls-hw-rx-offload: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: off tls-hw-tx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] S *** Bug 1809281 has been marked as a duplicate of this bug. *** ens2f0 does not have any TX errors in the excerpt unless we are looking at the wrong column. Random question (and not a cause/fix for anything); should the bare metal deployments be using jumbo frames? I noticed the NICs were 1500 MTU. Irrespective of Jumbo or not shouldn't the MTU be set accordingly for geneve. Even if MTU isn't set according, PMTU discovery should fix all this after a few drops. This assumes that a/ TCP PMTU discovery enabled b/ ICMP messages are being generated at the geneve interface (doubtful, see below) c/ iptables let the ICMP "would frag" packet back to the sending pod Looking at a working AWS cluster, the veth interface of a pod uses MTU of 8901. sh-4.4# ip -d link show b7da53d9b260336 13: b7da53d9b260336@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8901 qdisc noqueue master ovs-system state UP mode DEFAULT group default link/ether be:d4:0a:60:04:80 brd ff:ff:ff:ff:ff:ff link-netns 7f80269a-18ea-4fb7-b7d4-61be655894f1 promiscuity 1 veth openvswitch_slave addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 sh-4.4# It might be worth turning on Jumbo frames to see what's different. This could be a problem but not "the" problem. re "b" above: Normally, a TCP segment using PMTU Discovery would have the kernel auto-magically see the TCP segment > MTU on egress interface and trigger ICMP "would frag" For geneve, the TCP segment is put in a UDP datagram after geneve headers. Techinically the MTU of egress isn't even known until later, after a route lookup is done to figure out what the next hop is. If the UDP datagram > MTU of the egress interface, the source of the UDP datagram isn't the pod, it's the local UDP stack, not the sending pod. Coffee kicked in... Quick update to #c37. What's described above is accurate, but ONLY takes place at L3 to L2 boundary. Here we are already inside L2. The best that can be done is IP Fragment every UDP/GENEVE datagram which is > egress MTU, which will kill performance. To avoid, as suggested in #c36, is to make the MTU of each interface connected to the bridge <= the smallest interface of any (current or future) member of the bridge. We'll need access to cluster to continue digging into this. From what notes were saved off from last time, we believe that bare metal using 1500 mtu everywhere, so things are consistent. We don't see a combination of jumbo and 1500 mtu. Focus is back to why pod/veth is sending packets > mtu (with bad checksum). From an email from Michael:
> sh-4.4# ip -s link show genev_sys_6081
> 5: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc
> noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
> link/ether fe:b8:50:24:dd:9a brd ff:ff:ff:ff:ff:ff
> RX: bytes packets errors dropped overrun mcast
> 568426 3239 0 0 0 0
> TX: bytes packets errors dropped carrier collsns
> 1084980 3040 14736 0 0 0
>
> I think for geneve this counter is incremented here only:
> https://github.com/torvalds/linux/blob/master/drivers/net/geneve.c#L981
It's easy to confirm using dynamic debug:
ip -s -s a s dev genev_sys_6081 ; echo -n 'file drivers/net/geneve.c +p' > /sys/kernel/debug/dynamic_debug/control ; sleep 5 ; echo -n 'file drivers/net/geneve.c -p' > /sys/kernel/debug/dynamic_debug/control ; ip -s -s a s dev genev_sys_6081
Indeed, this produces dmesg messages:
[61503.656184] genev_sys_6081: no tunnel metadata
[61503.984187] genev_sys_6081: no tunnel metadata
...
The number of the messages matches the increment in tx_error stats. This confirms the drops happen due to no tunnel metadata.
Finally was able to install perf and get some meaningful info out of the box. (For the sake of anyone else debugging this, the key command to run after sshing to a node is 'toolbox'.) The tx_error messages are mostly caused by 'coredns' and 'mdns-publisher' processes. They send the UDP packets directly to the genev_sys_6081 interface (likely, they send to all interfaces). Understandingly, those packets are dropped as they don't (and can't) contain the lwt metadata. This is misconfiguration of those two applications. I'm seeing also some dropped packets sent by mld_ifc_timer_expire in the kernel. I'll look more into those. (In reply to Jiri Benc from comment #44) > Finally was able to install perf and get some meaningful info out of the > box. (For the sake of anyone else debugging this, the key command to run > after sshing to a node is 'toolbox'.) > > The tx_error messages are mostly caused by 'coredns' and 'mdns-publisher' > processes. They send the UDP packets directly to the genev_sys_6081 > interface (likely, they send to all interfaces). Understandingly, those > packets are dropped as they don't (and can't) contain the lwt metadata. This > is misconfiguration of those two applications. > > I'm seeing also some dropped packets sent by mld_ifc_timer_expire in the > kernel. I'll look more into those. Thanks Jiri Who needs to do what to stop sending the UDP packets to the genev_sys_6081 and to other interfaces. Even if they might not be getting in the way of the scale testing, they will cause alarms for our customers. So we should try to find a cure for these large number of dropped packets. e.g. stop sending them. Are coredns and mdns-publisher in OVS or OVN or somewhere else? (In reply to Rashid Khan from comment #45) > (In reply to Jiri Benc from comment #44) > > Finally was able to install perf and get some meaningful info out of the > > box. (For the sake of anyone else debugging this, the key command to run > > after sshing to a node is 'toolbox'.) > > > > The tx_error messages are mostly caused by 'coredns' and 'mdns-publisher' > > processes. They send the UDP packets directly to the genev_sys_6081 > > interface (likely, they send to all interfaces). Understandingly, those > > packets are dropped as they don't (and can't) contain the lwt metadata. This > > is misconfiguration of those two applications. > > > > I'm seeing also some dropped packets sent by mld_ifc_timer_expire in the > > kernel. I'll look more into those. > > Thanks Jiri > Who needs to do what to stop sending the UDP packets to the genev_sys_6081 > and to other interfaces. > Even if they might not be getting in the way of the scale testing, they will > cause alarms for our customers. > So we should try to find a cure for these large number of dropped packets. > e.g. stop sending them. > Are coredns and mdns-publisher in OVS or OVN or somewhere else? I cloned this bug to https://bugzilla.redhat.com/show_bug.cgi?id=1841214 for Network Edge team to investigate getting CoreDNS/mdns-publisher to stop whatever they are doing. It's likely we can close this bug soon, but I'd like to make sure there aren't other issues to look at (MTU mostly). CoreDNS-mDNS and mdns-publisher are handled by my team. We're on it! (In reply to Antoni Segura Puimedon from comment #47) > CoreDNS-mDNS and mdns-publisher are handled by my team. We're on it! I filed https://bugzilla.redhat.com/show_bug.cgi?id=1841214 as a clone for network edge team (because DNS). Should that one get closed, and this one moved to DNS component? Closing this one as a duplicate of bug 1841214 since that bug now has a patch. *** This bug has been marked as a duplicate of bug 1841214 *** |