Description of problem: we have a 3 master+ 21 worker node bare metal deployment using OVNKubernetes. We are seeing wide-spread inability in the environment with kube-scheduler pods restarting and several other TLS handshake errors. As a part of debugging we ended up looking at the TX/RX errors on interfaces and found that the geneve interface on both the masters as well as workers had million+ transmission side errors. Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-05-04-113741 How reproducible: 100% Steps to Reproduce: 1. Deploy with OVNKubernetes 2. Run some workload like launching pods 3. Observe interface errors Actual results: million + TX errors on the geneve tunnel interface Expected results: No/low number of errors Additional info: [root@master-0 core]# cat /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed b18be1669c605c1: 413588578 1240429 0 2 0 0 0 0 528657794 3171614 0 0 0 0 0 0 47a29baa21c6a62: 30323621 146108 0 2 0 0 0 0 297218856 2465274 0 0 0 0 0 0 ens2f1: 43299888 280361 0 0 0 0 0 254092 143884691 1239368 0 0 0 0 0 0 ens3f0: 49658654 308905 0 0 0 0 0 266360 143892720 1239605 0 0 0 0 0 0 231b3ee7b3e6a50: 2660710521 7689131 0 0 0 0 0 0 2643864491 10177787 0 0 0 0 0 0 be3338e7f9fd69b: 4778826958 4947955 0 2 0 0 0 0 11169925679 7876459 0 0 0 0 0 0 dd5337e8a56f5c7: 1474024 21534 0 2 0 0 0 0 283102010 2340514 0 0 0 0 0 0 lo: 124958714543 244429022 0 0 0 0 0 0 124958714543 244429022 0 0 0 0 0 0 ens3f1: 1000823673 7480273 0 0 0 0 0 5238809 45405061934 31714662 0 0 0 0 0 0 b895587b3045ce6: 283361201 1496326 0 2 0 0 0 0 1164723279 2781043 0 0 0 0 0 0 ff4817bc583218d: 342183647 1983976 0 0 0 0 0 0 928989884 4034835 0 0 0 0 0 0 ovn-k8s-gw0: 7463125754 20527322 0 0 0 0 0 0 21425455997 21625575 0 0 0 0 0 0 genev_sys_6081: 195463767146 125382597 0 0 0 0 0 0 44315269654 106570295 1672490 0 0 0 0 0 db6799ecd9fe7d7: 103399277 242733 0 2 0 0 0 0 365862649 2490227 0 0 0 0 0 0 81e19e8ad3a02bf: 268434434 1320423 0 2 0 0 0 0 506011280 3813758 0 0 0 0 0 0 ens2f0: 391148302407 787598228 0 0 0 0 0 58583981 305573768797 703018786 0 0 0 0 0 0 b41c47a2e579397: 138906457 1582257 0 2 0 0 0 0 1757501275 4015001 0 0 0 0 0 0 22aaf7e1448288b: 80244928 169520 0 2 0 0 0 0 264772250 1857246 0 0 0 0 0 0 ovn-k8s-mp0: 5957524390 12540359 0 0 0 0 0 0 3207601879 15507512 0 0 0 0 0 0 67a73da1b246095: 7393584784 94079402 0 2 0 0 0 0 193770401552 115219840 0 0 0 0 0 0 9f21a39be3d2a96: 30480411 185213 0 0 0 0 0 0 492539732 2513594 0 0 0 0 0 0 15a268026b10dca: 109745479 197920 0 2 0 0 0 0 361872488 2551184 0 0 0 0 0 0 br-int: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05ec1f85d07d191: 1216 16 0 2 0 0 0 0 3130371 26736 0 0 0 0 0 0 ovs-system: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7969ef50ef4ea41: 70641502 296715 0 2 0 0 0 0 296875003 1870716 0 0 0 0 0 0 bfb62ef5f4b9b9d: 35273960253 3655281 0 4 0 0 0 0 2603761783 6036449 0 0 0 0 0 0 2c90a2828e22afe: 162249959 707440 0 2 0 0 0 0 412779541 3132318 0 0 0 0 0 0 608859240f881fe: 88663360 855339 0 2 0 0 0 0 359538039 3368729 0 0 0 0 0 0 br-local: 748259239 2326802 0 0 0 0 0 0 217376604 1672293 0 0 0 0 0 0 ============================================================================ t[root@worker002 core]# cat /proc/net/dev Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed ens2f0: 973209719 5952116 0 0 0 0 0 5951437 292296262 2008515 0 0 0 0 0 0 br-int: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22e8c8998884bdc: 161124692 722182 0 2 0 0 0 0 314628934 2521846 0 0 0 0 0 0 genev_sys_6081: 82958882 625107 0 0 0 0 0 0 160187247 582702 1209563 0 0 0 0 0 ens2f1: 13606348217 65789313 0 0 0 0 0 60808605 6062554122 10972610 0 0 0 0 0 0 acddd4d9aebad40: 1711244 9697 0 2 0 0 0 0 11284661 37140 0 0 0 0 0 0 ovn-k8s-gw0: 184606517 1740705 0 0 0 0 0 0 930862619 1555233 0 0 0 0 0 0 br-local: 217050204 1105312 0 0 0 0 0 0 137365278 1209699 0 0 0 0 0 0 lo: 1152036105 4992300 0 0 0 0 0 0 1152036105 4992300 0 0 0 0 0 0 ovn-k8s-mp0: 333584092 510497 0 0 0 0 0 0 130119567 1355912 0 0 0 0 0 0 ovs-system: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Hi I'm from the OVN team. It's pretty difficult to get much information just based on what's been presented here. OVN sets up the geneve tunnels that OCP uses. So it would be good to see logs from ovn-controller on the node(s) where you see the transmission errors. This way we can see if OVN encountered any errors while setting up the tunnels. Similarly, the logs from ovs-vswitchd on the node(s) with transmission errors may also give more information on what's going wrong here. In addition, the contents of the OVN southbound database may also be useful. This way we can see how the interfaces and chassis have been One interesting data point here is that the ens2f0 and ens2f1 interfaces have lots of transmission errors, too. It's not just the geneve interfaces. So I'm curious if there's something more widespread going wrong here. I have a feeling someone from the networking-services kernel team will need to look into this if there's nothing obvious from the OVN logs that indicate errors. I'd say be prepared to dump more information for them as well.
The geneve tunnel is using ens2f0 just as an additional note.
As another data point, what is the kernel version?
Which Kernel version? and which NIC.
Also: ethtool -k ens2f0 ethtool -k ens2f1
Seeing this on a newly deployed OCP cluster, with no workloads/pods running. Going to get mus gather data shortly. Geneve tunnel is using ens2f0 and ens2f1 is not being used 2: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 3c:fd:fe:ee:49:08 brd ff:ff:ff:ff:ff:ff inet 192.168.222.10/24 brd 192.168.222.255 scope global dynamic noprefixroute ens2f0 valid_lft 2727sec preferred_lft 2727sec inet6 fe80::6b06:748a:ef56:1ae0/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 3c:fd:fe:ee:49:09 brd ff:ff:ff:ff:ff:ff NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-08-222601 True False 178m Cluster version is 4.5.0-0.nightly-2020-05-08-222601 Kernel: 4.18.0-147.8.1.el8_1.x86_64 sh-4.2# ethtool ens2f0 Settings for ens2f0: Supported ports: [ FIBRE ] Supported link modes: 25000baseSR/Full 10000baseSR/Full Supported pause frame use: Symmetric Supports auto-negotiation: Yes Supported FEC modes: None BaseR RS Advertised link modes: 25000baseSR/Full 10000baseSR/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Advertised FEC modes: None BaseR RS Speed: 25000Mb/s Duplex: Full Port: FIBRE PHYAD: 0 Transceiver: internal Auto-negotiation: off Supports Wake-on: d Wake-on: d Current message level: 0x00000007 (7) drv probe link Link detected: yes =========================================== sh-4.2# ethtool -i ens2f0 driver: i40e version: 2.8.20-k firmware-version: 6.01 0x80003554 1.1747.0 expansion-rom-version: bus-info: 0000:5e:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ============================================== sh-4.2# ethtool -k ens2f0 Features for ens2f0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: on scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: on tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: on receive-hashing: on highdma: on rx-vlan-filter: on [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] tls-hw-rx-offload: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: on esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on tls-hw-tx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed]
As an additional datapoint, running any workloads (creating projects or pods) on the cluster is failing with either TLS hadnshake and EOF errors i/o timeout errors Here are the error examples from the client 1. Unexpected error: <*url.Error | 0xc001bcb560>: { Op: "Post", URL: "https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods", Err: {s: "EOF"}, } Post https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods: EOF 2. Get https://api.test714.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout In prometheus we continuously see nodenetworktransmit errors: https://snapshot.raintank.io/dashboard/snapshot/vmqPeuQ3AL8TDkorrC5wqDNe60Ap8tlp I made sure we don't have any unexpected Ips/ hosts in the baremetal environment by running an nmap. Happy to give access to the environment and help debug further.
Can these be turned off and try again? tx-checksumming: on scatter-gather: on tcp-segmentation-offload: on generic-segmentation-offload: on tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on hw-tc-offload: on I think the commands are (adjust ens2f0 accordingly.) ethtool -K ens2f0 tx off ethtool -K ens2f0 sgo off ethtool -K ens2f0 tso off ethtool -K ens2f0 gso off ethtool -K ens2f0 tx-gre-segmentation off ethtool -K ens2f0 tx-gre-csum-segmentation off ethtool -K ens2f0 tx-udp_tnl-segmentation off ethtool -K ens2f0 tx-udp_tnl-csum-segmentation off ethtool -K ens2f0 tx-gso-partial off ethtool -K ens2f0 hw-tc-offload off But check to see if the values change to be sure.
Actually, I already attached must-gather data earlier, never mind. Please use the first link.
*** Bug 1835376 has been marked as a duplicate of this bug. ***
Still seeing errors on the geneve interface. Do note that there are no errors seen on ens2f0 before and after the changes. sh-4.2# ethtool -k genev_sys_6081 Features for genev_sys_6081: rx-checksumming: off tx-checksumming: off tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: off tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: off tx-scatter-gather: off tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: off tx-tcp-segmentation: off tx-tcp-ecn-segmentation: off tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: off udp-fragmentation-offload: off generic-segmentation-offload: off generic-receive-offload: off large-receive-offload: off [fixed] rx-vlan-offload: off [fixed] tx-vlan-offload: off [fixed] ntuple-filters: off [fixed] receive-hashing: off [fixed] highdma: off [fixed] rx-vlan-filter: off [fixed] vlan-challenged: off [fixed] tx-lockless: on [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: off [fixed] tx-gre-csum-segmentation: off [fixed] tx-ipxip4-segmentation: off [fixed] tx-ipxip6-segmentation: off [fixed] tx-udp_tnl-segmentation: off [fixed] tx-udp_tnl-csum-segmentation: off [fixed] tx-gso-partial: off [fixed] tx-sctp-segmentation: off tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] tls-hw-rx-offload: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off [fixed] esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: off [fixed] tls-hw-tx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] For ens2f0 sh-4.2# ethtool -k ens2f0 [8/59] Features for ens2f0: rx-checksumming: off tx-checksumming: off tx-checksum-ipv4: off tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: off tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: off tx-tcp-segmentation: off tx-tcp-ecn-segmentation: off tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: off udp-fragmentation-offload: off generic-segmentation-offload: off generic-receive-offload: off large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: on receive-hashing: on highdma: off rx-vlan-filter: on [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: off tx-gre-csum-segmentation: off [requested on] tx-ipxip4-segmentation: off tx-ipxip6-segmentation: off tx-udp_tnl-segmentation: off tx-udp_tnl-csum-segmentation: off tx-gso-partial: off tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] tls-hw-rx-offload: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: off tls-hw-tx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] S
*** Bug 1809281 has been marked as a duplicate of this bug. ***
ens2f0 does not have any TX errors in the excerpt unless we are looking at the wrong column.
Random question (and not a cause/fix for anything); should the bare metal deployments be using jumbo frames? I noticed the NICs were 1500 MTU.
Irrespective of Jumbo or not shouldn't the MTU be set accordingly for geneve.
Even if MTU isn't set according, PMTU discovery should fix all this after a few drops. This assumes that a/ TCP PMTU discovery enabled b/ ICMP messages are being generated at the geneve interface (doubtful, see below) c/ iptables let the ICMP "would frag" packet back to the sending pod Looking at a working AWS cluster, the veth interface of a pod uses MTU of 8901. sh-4.4# ip -d link show b7da53d9b260336 13: b7da53d9b260336@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8901 qdisc noqueue master ovs-system state UP mode DEFAULT group default link/ether be:d4:0a:60:04:80 brd ff:ff:ff:ff:ff:ff link-netns 7f80269a-18ea-4fb7-b7d4-61be655894f1 promiscuity 1 veth openvswitch_slave addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 sh-4.4# It might be worth turning on Jumbo frames to see what's different. This could be a problem but not "the" problem. re "b" above: Normally, a TCP segment using PMTU Discovery would have the kernel auto-magically see the TCP segment > MTU on egress interface and trigger ICMP "would frag" For geneve, the TCP segment is put in a UDP datagram after geneve headers. Techinically the MTU of egress isn't even known until later, after a route lookup is done to figure out what the next hop is. If the UDP datagram > MTU of the egress interface, the source of the UDP datagram isn't the pod, it's the local UDP stack, not the sending pod.
Coffee kicked in... Quick update to #c37. What's described above is accurate, but ONLY takes place at L3 to L2 boundary. Here we are already inside L2. The best that can be done is IP Fragment every UDP/GENEVE datagram which is > egress MTU, which will kill performance. To avoid, as suggested in #c36, is to make the MTU of each interface connected to the bridge <= the smallest interface of any (current or future) member of the bridge.
We'll need access to cluster to continue digging into this. From what notes were saved off from last time, we believe that bare metal using 1500 mtu everywhere, so things are consistent. We don't see a combination of jumbo and 1500 mtu. Focus is back to why pod/veth is sending packets > mtu (with bad checksum).
From an email from Michael: > sh-4.4# ip -s link show genev_sys_6081 > 5: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc > noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000 > link/ether fe:b8:50:24:dd:9a brd ff:ff:ff:ff:ff:ff > RX: bytes packets errors dropped overrun mcast > 568426 3239 0 0 0 0 > TX: bytes packets errors dropped carrier collsns > 1084980 3040 14736 0 0 0 > > I think for geneve this counter is incremented here only: > https://github.com/torvalds/linux/blob/master/drivers/net/geneve.c#L981 It's easy to confirm using dynamic debug: ip -s -s a s dev genev_sys_6081 ; echo -n 'file drivers/net/geneve.c +p' > /sys/kernel/debug/dynamic_debug/control ; sleep 5 ; echo -n 'file drivers/net/geneve.c -p' > /sys/kernel/debug/dynamic_debug/control ; ip -s -s a s dev genev_sys_6081 Indeed, this produces dmesg messages: [61503.656184] genev_sys_6081: no tunnel metadata [61503.984187] genev_sys_6081: no tunnel metadata ... The number of the messages matches the increment in tx_error stats. This confirms the drops happen due to no tunnel metadata.
Finally was able to install perf and get some meaningful info out of the box. (For the sake of anyone else debugging this, the key command to run after sshing to a node is 'toolbox'.) The tx_error messages are mostly caused by 'coredns' and 'mdns-publisher' processes. They send the UDP packets directly to the genev_sys_6081 interface (likely, they send to all interfaces). Understandingly, those packets are dropped as they don't (and can't) contain the lwt metadata. This is misconfiguration of those two applications. I'm seeing also some dropped packets sent by mld_ifc_timer_expire in the kernel. I'll look more into those.
(In reply to Jiri Benc from comment #44) > Finally was able to install perf and get some meaningful info out of the > box. (For the sake of anyone else debugging this, the key command to run > after sshing to a node is 'toolbox'.) > > The tx_error messages are mostly caused by 'coredns' and 'mdns-publisher' > processes. They send the UDP packets directly to the genev_sys_6081 > interface (likely, they send to all interfaces). Understandingly, those > packets are dropped as they don't (and can't) contain the lwt metadata. This > is misconfiguration of those two applications. > > I'm seeing also some dropped packets sent by mld_ifc_timer_expire in the > kernel. I'll look more into those. Thanks Jiri Who needs to do what to stop sending the UDP packets to the genev_sys_6081 and to other interfaces. Even if they might not be getting in the way of the scale testing, they will cause alarms for our customers. So we should try to find a cure for these large number of dropped packets. e.g. stop sending them. Are coredns and mdns-publisher in OVS or OVN or somewhere else?
(In reply to Rashid Khan from comment #45) > (In reply to Jiri Benc from comment #44) > > Finally was able to install perf and get some meaningful info out of the > > box. (For the sake of anyone else debugging this, the key command to run > > after sshing to a node is 'toolbox'.) > > > > The tx_error messages are mostly caused by 'coredns' and 'mdns-publisher' > > processes. They send the UDP packets directly to the genev_sys_6081 > > interface (likely, they send to all interfaces). Understandingly, those > > packets are dropped as they don't (and can't) contain the lwt metadata. This > > is misconfiguration of those two applications. > > > > I'm seeing also some dropped packets sent by mld_ifc_timer_expire in the > > kernel. I'll look more into those. > > Thanks Jiri > Who needs to do what to stop sending the UDP packets to the genev_sys_6081 > and to other interfaces. > Even if they might not be getting in the way of the scale testing, they will > cause alarms for our customers. > So we should try to find a cure for these large number of dropped packets. > e.g. stop sending them. > Are coredns and mdns-publisher in OVS or OVN or somewhere else? I cloned this bug to https://bugzilla.redhat.com/show_bug.cgi?id=1841214 for Network Edge team to investigate getting CoreDNS/mdns-publisher to stop whatever they are doing. It's likely we can close this bug soon, but I'd like to make sure there aren't other issues to look at (MTU mostly).
CoreDNS-mDNS and mdns-publisher are handled by my team. We're on it!
(In reply to Antoni Segura Puimedon from comment #47) > CoreDNS-mDNS and mdns-publisher are handled by my team. We're on it! I filed https://bugzilla.redhat.com/show_bug.cgi?id=1841214 as a clone for network edge team (because DNS). Should that one get closed, and this one moved to DNS component?
Closing this one as a duplicate of bug 1841214 since that bug now has a patch. *** This bug has been marked as a duplicate of bug 1841214 ***