Bug 2124935

Summary: During scale ups of a cloud, some DPDK nodes that are not directly part of the scale up see network failures
Product: Red Hat OpenStack Reporter: Francois Palin <fpalin>
Component: openstack-neutronAssignee: OSP Team <rhos-maint>
Status: CLOSED EOL QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: aruffin, ccoursey, cfontain, chrisw, egarciar, fboboc, hakhande, scohen
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---Flags: fpalin: needinfo? (cfontain)
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-11 20:01:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Francois Palin 2022-09-07 13:57:53 UTC
Description of problem:
----------------------
During scale ups of a cloud by customer, some DPDK nodes that are not directly part of the scale up see network failures, for example [1].

[1]
2022-08-31T05:25:46.618Z|1639536|bond|INFO|interface dpdk1: link state down
2022-08-31T05:25:46.618Z|1639537|bond|INFO|interface dpdk1: will be disabled if it stays down for 100 ms
2022-08-31T05:25:46.618Z|1639538|bond|INFO|interface dpdk0: link state down
2022-08-31T05:25:46.618Z|1639539|bond|INFO|interface dpdk0: will be disabled if it stays down for 100 ms
2022-08-31T05:25:46.718Z|1639540|bond|INFO|interface dpdk1: disabled
2022-08-31T05:25:46.718Z|1639541|bond|INFO|interface dpdk0: disabled
2022-08-31T05:25:46.718Z|1639542|bond|INFO|bond dpdkbond0: all interfaces disabled
2022-08-31T05:25:48.734Z|1639543|bond|INFO|interface dpdk1: link state up
2022-08-31T05:25:48.734Z|1639544|bond|INFO|interface dpdk1: will be enabled if it stays up for 200 ms
2022-08-31T05:25:48.734Z|1639545|bond|INFO|interface dpdk0: link state up
2022-08-31T05:25:48.734Z|1639546|bond|INFO|interface dpdk0: will be enabled if it stays up for 200 ms
2022-08-31T05:25:48.734Z|1639547|bond|INFO|bond dpdkbond0: active interface is now dpdk1, skipping remaining 200 ms updelay (since no interface was enabled)
2022-08-31T05:25:48.734Z|1639548|bond|INFO|interface dpdk1: enabled
2022-08-31T05:25:48.934Z|1639549|bond|INFO|interface dpdk0: enabled


How reproducible:
----------------
100% in customer Prod environment ("Everytime we deploy")

Steps to Reproduce:
------------------
In customer Prod environment, proceed with a scale up

Actual results:
--------------
Some DPDK nodes that are not part of the scale up see network failures

Expected results:
----------------
There should be no network failures following scale up

Additional info:
---------------

Comment 1 Francois Palin 2022-09-07 14:00:07 UTC
Questions from Haresh Khandelwal (Red Hat) in case comment (with answers):

- Is this issue happening on specific set of nodes or all dpdk nodes?
    *answer: When issue was first seen (scaling up) it happened on specific nodes only
- I see it was mentioned that, dpdkbond0 flapped when you tried to generate sos-report?
    *answer: we discovered yesterday that issue is not specific to scale up. sosreport (related to high load?) caused the same
- Do dpdk interfaces come up on its own after going down and bring up the dpdkbond0?
    *answer: yes they come back on their own after a couple of seconds (see ref [1] in this BZ Description)
- 'iommu=pt' doesnt look like causing issue at this stage to me. 
- I see PMD cores are isolated so should not be getting syscalls of sos-report generation
- Was there any link down observed at uplink switch?
    *answer: There was no link down event at the deployment time but there was an event when we grabbed sosreport on Sep 1st.
- balance-tcp was TechPreview feature in OSP13. It causes re-circulation of the packets and degrading the performance.

Comment 2 Francois Palin 2022-09-07 14:01:05 UTC
There seems to be a correlation in time when an rsync operation gets executed [1], the loss of dpdkbond0 (see ref [1] in this BZ Description above),
and ovs-vswitchd.log messages "Unreasonably long XXXXms poll interval" [2]

[1] 
Aug 31 05:26:46 my-customer-host-name dockerd-current[11351]: + rsync -a -R --delay-updates --delete-after --exclude=/etc/puppetlabs/ --exclude=/opt/puppetlabs/ --exclude=/etc/pki/ca-trust/extracted --exclude=/etc/pki/ca-trust/source/anchors --exclude=/etc/pki/tls/certs/ca-bundle.crt --exclude=/etc/pki/tls/certs/ca-bundle.trust.crt --exclude=/etc/pki/tls/cert.pem --exclude=/etc/hosts --exclude=/etc/localtime --exclude=/etc/hostname /etc /root /opt /var/www /var/spool/cron /var/lib/config-data/neutron
Aug 31 05:27:16 my-customer-host-name dockerd-current[11351]: ++ stat -c %y /var/lib/config-data/iscsid.origin_of_time

[2]
ovs-vswitchd.log:2022-08-31T05:25:46.611Z|1639463|timeval|WARN|Unreasonably long 6006ms poll interval (1ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T05:25:53.837Z|01979|timeval|WARN|Unreasonably long 2026657399ms poll interval (7440ms user, 148678ms system)
ovs-vswitchd.log:2022-08-31T05:26:12.439Z|1639555|timeval|WARN|Unreasonably long 1181ms poll interval (0ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T05:26:17.407Z|1639558|timeval|WARN|Unreasonably long 1215ms poll interval (1ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T05:26:39.830Z|1639570|timeval|WARN|Unreasonably long 1089ms poll interval (2ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T05:26:52.265Z|02021|timeval|WARN|Unreasonably long 58428ms poll interval (0ms user, 5ms system)
ovs-vswitchd.log:2022-08-31T05:32:30.725Z|02142|timeval|WARN|Unreasonably long 337802ms poll interval (26ms user, 0ms system)

ovs-vswitchd.log:2022-08-31T16:01:26.011Z|1647624|timeval|WARN|Unreasonably long 1241ms poll interval (1ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T16:01:53.389Z|1647701|timeval|WARN|Unreasonably long 1009ms poll interval (2ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T16:02:17.189Z|1647709|timeval|WARN|Unreasonably long 1856ms poll interval (1ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T16:02:59.896Z|1647720|timeval|WARN|Unreasonably long 1334ms poll interval (2ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T16:03:25.215Z|1647724|timeval|WARN|Unreasonably long 3445ms poll interval (2ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T16:04:08.779Z|1647733|timeval|WARN|Unreasonably long 1320ms poll interval (1ms user, 2ms system)

ovs-vswitchd.log:2022-08-31T17:13:01.105Z|1648508|timeval|WARN|Unreasonably long 1090ms poll interval (3ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T17:13:45.898Z|1648518|timeval|WARN|Unreasonably long 2951ms poll interval (1ms user, 0ms system)

Comment 3 Francois Palin 2022-09-07 14:37:50 UTC
Log messages mentioned in comment #2 (ovs-vswitchd.log messages "Unreasonably long XXXXms poll interval")
are most likely a consequence of losing dpdkbond0 rather than the cause,
if we go by [1] and [2] below.

[1] https://access.redhat.com/solutions/4898121

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1809363

Comment 9 Elvira 2022-09-12 14:02:49 UTC
Moving to NFV team as this does not seem to be directly related to Neutron. Please reach back if there's any problem, thanks!

Comment 15 Lon Hohberger 2023-07-11 20:01:49 UTC
OSP 13 was retired on June 27, 2023. As this was resolved by a knowledge base article, no further action is planned on this issue at this time.