Description of problem: ---------------------- During scale ups of a cloud by customer, some DPDK nodes that are not directly part of the scale up see network failures, for example [1]. [1] 2022-08-31T05:25:46.618Z|1639536|bond|INFO|interface dpdk1: link state down 2022-08-31T05:25:46.618Z|1639537|bond|INFO|interface dpdk1: will be disabled if it stays down for 100 ms 2022-08-31T05:25:46.618Z|1639538|bond|INFO|interface dpdk0: link state down 2022-08-31T05:25:46.618Z|1639539|bond|INFO|interface dpdk0: will be disabled if it stays down for 100 ms 2022-08-31T05:25:46.718Z|1639540|bond|INFO|interface dpdk1: disabled 2022-08-31T05:25:46.718Z|1639541|bond|INFO|interface dpdk0: disabled 2022-08-31T05:25:46.718Z|1639542|bond|INFO|bond dpdkbond0: all interfaces disabled 2022-08-31T05:25:48.734Z|1639543|bond|INFO|interface dpdk1: link state up 2022-08-31T05:25:48.734Z|1639544|bond|INFO|interface dpdk1: will be enabled if it stays up for 200 ms 2022-08-31T05:25:48.734Z|1639545|bond|INFO|interface dpdk0: link state up 2022-08-31T05:25:48.734Z|1639546|bond|INFO|interface dpdk0: will be enabled if it stays up for 200 ms 2022-08-31T05:25:48.734Z|1639547|bond|INFO|bond dpdkbond0: active interface is now dpdk1, skipping remaining 200 ms updelay (since no interface was enabled) 2022-08-31T05:25:48.734Z|1639548|bond|INFO|interface dpdk1: enabled 2022-08-31T05:25:48.934Z|1639549|bond|INFO|interface dpdk0: enabled How reproducible: ---------------- 100% in customer Prod environment ("Everytime we deploy") Steps to Reproduce: ------------------ In customer Prod environment, proceed with a scale up Actual results: -------------- Some DPDK nodes that are not part of the scale up see network failures Expected results: ---------------- There should be no network failures following scale up Additional info: ---------------
Questions from Haresh Khandelwal (Red Hat) in case comment (with answers): - Is this issue happening on specific set of nodes or all dpdk nodes? *answer: When issue was first seen (scaling up) it happened on specific nodes only - I see it was mentioned that, dpdkbond0 flapped when you tried to generate sos-report? *answer: we discovered yesterday that issue is not specific to scale up. sosreport (related to high load?) caused the same - Do dpdk interfaces come up on its own after going down and bring up the dpdkbond0? *answer: yes they come back on their own after a couple of seconds (see ref [1] in this BZ Description) - 'iommu=pt' doesnt look like causing issue at this stage to me. - I see PMD cores are isolated so should not be getting syscalls of sos-report generation - Was there any link down observed at uplink switch? *answer: There was no link down event at the deployment time but there was an event when we grabbed sosreport on Sep 1st. - balance-tcp was TechPreview feature in OSP13. It causes re-circulation of the packets and degrading the performance.
There seems to be a correlation in time when an rsync operation gets executed [1], the loss of dpdkbond0 (see ref [1] in this BZ Description above), and ovs-vswitchd.log messages "Unreasonably long XXXXms poll interval" [2] [1] Aug 31 05:26:46 my-customer-host-name dockerd-current[11351]: + rsync -a -R --delay-updates --delete-after --exclude=/etc/puppetlabs/ --exclude=/opt/puppetlabs/ --exclude=/etc/pki/ca-trust/extracted --exclude=/etc/pki/ca-trust/source/anchors --exclude=/etc/pki/tls/certs/ca-bundle.crt --exclude=/etc/pki/tls/certs/ca-bundle.trust.crt --exclude=/etc/pki/tls/cert.pem --exclude=/etc/hosts --exclude=/etc/localtime --exclude=/etc/hostname /etc /root /opt /var/www /var/spool/cron /var/lib/config-data/neutron Aug 31 05:27:16 my-customer-host-name dockerd-current[11351]: ++ stat -c %y /var/lib/config-data/iscsid.origin_of_time [2] ovs-vswitchd.log:2022-08-31T05:25:46.611Z|1639463|timeval|WARN|Unreasonably long 6006ms poll interval (1ms user, 1ms system) ovs-vswitchd.log:2022-08-31T05:25:53.837Z|01979|timeval|WARN|Unreasonably long 2026657399ms poll interval (7440ms user, 148678ms system) ovs-vswitchd.log:2022-08-31T05:26:12.439Z|1639555|timeval|WARN|Unreasonably long 1181ms poll interval (0ms user, 1ms system) ovs-vswitchd.log:2022-08-31T05:26:17.407Z|1639558|timeval|WARN|Unreasonably long 1215ms poll interval (1ms user, 1ms system) ovs-vswitchd.log:2022-08-31T05:26:39.830Z|1639570|timeval|WARN|Unreasonably long 1089ms poll interval (2ms user, 1ms system) ovs-vswitchd.log:2022-08-31T05:26:52.265Z|02021|timeval|WARN|Unreasonably long 58428ms poll interval (0ms user, 5ms system) ovs-vswitchd.log:2022-08-31T05:32:30.725Z|02142|timeval|WARN|Unreasonably long 337802ms poll interval (26ms user, 0ms system) ovs-vswitchd.log:2022-08-31T16:01:26.011Z|1647624|timeval|WARN|Unreasonably long 1241ms poll interval (1ms user, 0ms system) ovs-vswitchd.log:2022-08-31T16:01:53.389Z|1647701|timeval|WARN|Unreasonably long 1009ms poll interval (2ms user, 0ms system) ovs-vswitchd.log:2022-08-31T16:02:17.189Z|1647709|timeval|WARN|Unreasonably long 1856ms poll interval (1ms user, 1ms system) ovs-vswitchd.log:2022-08-31T16:02:59.896Z|1647720|timeval|WARN|Unreasonably long 1334ms poll interval (2ms user, 0ms system) ovs-vswitchd.log:2022-08-31T16:03:25.215Z|1647724|timeval|WARN|Unreasonably long 3445ms poll interval (2ms user, 0ms system) ovs-vswitchd.log:2022-08-31T16:04:08.779Z|1647733|timeval|WARN|Unreasonably long 1320ms poll interval (1ms user, 2ms system) ovs-vswitchd.log:2022-08-31T17:13:01.105Z|1648508|timeval|WARN|Unreasonably long 1090ms poll interval (3ms user, 0ms system) ovs-vswitchd.log:2022-08-31T17:13:45.898Z|1648518|timeval|WARN|Unreasonably long 2951ms poll interval (1ms user, 0ms system)
Log messages mentioned in comment #2 (ovs-vswitchd.log messages "Unreasonably long XXXXms poll interval") are most likely a consequence of losing dpdkbond0 rather than the cause, if we go by [1] and [2] below. [1] https://access.redhat.com/solutions/4898121 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1809363
Moving to NFV team as this does not seem to be directly related to Neutron. Please reach back if there's any problem, thanks!
OSP 13 was retired on June 27, 2023. As this was resolved by a knowledge base article, no further action is planned on this issue at this time.