Create a way of reporting the number of packets that have been sent to ovn-controller via a controller() action. This can also report the actions taken by OVN, whether there were errors, and which datapaths punted the packets to ovn-controller. If the command can be centralized, that would be fantastic. However, if that is not feasible, we can instead have the command be at the ovn-controller layer.
Update based on OVN team meeting 25 October, 2021: Initially, I suggested using the OVS coverage API to add statistics for each type of incoming packet-in that pinctrl handles. This would allow for "coverage/show" to illustrate the hot points in pinctrl so we could know what might be causing OVN to be performing poorly. During the meeting today, though, the team came up with an alternate solution, since having 20-something #defines for the different coverage counters would be unpalatable. Instead, what we can do is add coverage counters for the following sections: 1) process_packet_in(): Seeing this increase rapidly would tell us that ovn-controller is having to handle a great many packets. 2) notify_pinctrl_main(): Seeing this increase rapidly would tell us that pinctrl is waking up the main thread consistently. Then, instead of adding coverage counters to each individual type of packet-in, we can instead add more debug-level logging across these functions. Specifically, the debug logs should state what type of message is being handled (e.g. DHCP, IGMP, DNS), and where the message was received from (source IP/MAC and OpenFlow port). The idea is that an admin might notice OVN being slow, so they check the coverage counters. If they see one of the two coverage counters increasing rapidly, they can then enable debug logging and see what the culprit is. This way, they could see, for instance, that a certain VM is spamming DNS requests or something.
suggested solution submitted upstream: https://patchwork.ozlabs.org/project/ovn/patch/20211118105406.508257-1-mheib@redhat.com/ @mmichels, if you please can take a look at this change and see if it answer this BZ requirement or need to add more things.
Verified with a network of the following topology: ------router------ | | | | | | ls1 ls2 ls3 | | | | | | vm1 vm2 vm3 Reproduced on [root@bz_1821965 ~]# rpm -qa |grep -E 'ovn|openvswitch' openvswitch2.15-2.15.0-53.el8fdp.x86_64 ovn-2021-central-21.09.1-23.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch ovn-2021-host-21.09.1-23.el8fdp.x86_64 ovn-2021-21.09.1-23.el8fdp.x86_64 [root@bz_1821965 ~]# ls /var/run/ovn/ ovn-controller.1273217.ctl ovn-controller.pid ovnnb_db.ctl ovnnb_db.pid ovnnb_db.sock ovn-northd.1273050.ctl ovn-northd.pid ovnsb_db.ctl ovnsb_db.pid ovnsb_db.sock [root@bz_1821965 ~]# ovs-appctl -t /var/run/ovn/ovn-controller.1273217.ctl vlog/set dbg [root@bz_1821965 ~]# cat /var/log/ovn/ovn-controller.log | grep NXT_PACKET_IN2 | grep table_id=10 | wc -l 0 [root@bz_1821965 ~]# cat /var/log/ovn/ovn-controller.log | grep "pinctrl received packet-in" | grep opcode=PUT_ARP | grep OF_Table_ID=10 | wc -l 0 [root@bz_1821965 ~]# ovs-ofctl dump-flows br-int table=10 | grep arp | grep controller | grep -v n_packets=0 | wc -l 1 Verified on [root@bz_1821965 ~]#rpm -qa |grep -E 'ovn|openvswitch' ovn-2021-host-21.12.0-11.el8fdp.x86_64 openvswitch2.15-2.15.0-53.el8fdp.x86_64 ovn-2021-21.12.0-11.el8fdp.x86_64 ovn-2021-central-21.12.0-11.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-28.el8fdp.noarch [root@bz_1821965 ~]# ls /var/run/ovn/ ovn-controller.741050.ctl ovn-controller.pid ovnnb_db.ctl ovnnb_db.pid ovnnb_db.sock ovn-northd.740806.ctl ovn-northd.pid ovnsb_db.ctl ovnsb_db.pid ovnsb_db.sock [root@bz_1821965 ~]# ovs-appctl -t /var/run/ovn/ovn-controller.741050.ctl vlog/set dbg [root@bz_1821965 ~]# cat /var/log/ovn/ovn-controller.log | grep NXT_PACKET_IN2 | grep table_id=10 | wc -l 11 [root@bz_1821965 ~]# cat /var/log/ovn/ovn-controller.log | grep "pinctrl received packet-in" | grep opcode=PUT_ARP | grep OF_Table_ID=10 | wc -l 3 [root@bz_1821965 ~]# ovs-ofctl dump-flows br-int table=10 | grep arp | grep controller | grep -v n_packets=0 | wc -l 1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0674
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days