Bug 2124935 - During scale ups of a cloud, some DPDK nodes that are not directly part of the scale up see network failures [NEEDINFO]
Summary: During scale ups of a cloud, some DPDK nodes that are not directly part of th...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: ---
Assignee: OSP Team
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-09-07 13:57 UTC by Francois Palin
Modified: 2023-07-11 20:01 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-11 20:01:49 UTC
Target Upstream Version:
Embargoed:
fpalin: needinfo? (cfontain)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker NFV-2635 0 None None None 2022-09-12 14:16:11 UTC
Red Hat Issue Tracker OSP-18597 0 None None None 2022-09-07 14:06:35 UTC
Red Hat Knowledge Base (Solution) 4846991 0 None None None 2022-10-11 14:07:47 UTC
Red Hat Knowledge Base (Solution) 6976507 0 None None None 2022-10-11 14:07:47 UTC

Description Francois Palin 2022-09-07 13:57:53 UTC
Description of problem:
----------------------
During scale ups of a cloud by customer, some DPDK nodes that are not directly part of the scale up see network failures, for example [1].

[1]
2022-08-31T05:25:46.618Z|1639536|bond|INFO|interface dpdk1: link state down
2022-08-31T05:25:46.618Z|1639537|bond|INFO|interface dpdk1: will be disabled if it stays down for 100 ms
2022-08-31T05:25:46.618Z|1639538|bond|INFO|interface dpdk0: link state down
2022-08-31T05:25:46.618Z|1639539|bond|INFO|interface dpdk0: will be disabled if it stays down for 100 ms
2022-08-31T05:25:46.718Z|1639540|bond|INFO|interface dpdk1: disabled
2022-08-31T05:25:46.718Z|1639541|bond|INFO|interface dpdk0: disabled
2022-08-31T05:25:46.718Z|1639542|bond|INFO|bond dpdkbond0: all interfaces disabled
2022-08-31T05:25:48.734Z|1639543|bond|INFO|interface dpdk1: link state up
2022-08-31T05:25:48.734Z|1639544|bond|INFO|interface dpdk1: will be enabled if it stays up for 200 ms
2022-08-31T05:25:48.734Z|1639545|bond|INFO|interface dpdk0: link state up
2022-08-31T05:25:48.734Z|1639546|bond|INFO|interface dpdk0: will be enabled if it stays up for 200 ms
2022-08-31T05:25:48.734Z|1639547|bond|INFO|bond dpdkbond0: active interface is now dpdk1, skipping remaining 200 ms updelay (since no interface was enabled)
2022-08-31T05:25:48.734Z|1639548|bond|INFO|interface dpdk1: enabled
2022-08-31T05:25:48.934Z|1639549|bond|INFO|interface dpdk0: enabled


How reproducible:
----------------
100% in customer Prod environment ("Everytime we deploy")

Steps to Reproduce:
------------------
In customer Prod environment, proceed with a scale up

Actual results:
--------------
Some DPDK nodes that are not part of the scale up see network failures

Expected results:
----------------
There should be no network failures following scale up

Additional info:
---------------

Comment 1 Francois Palin 2022-09-07 14:00:07 UTC
Questions from Haresh Khandelwal (Red Hat) in case comment (with answers):

- Is this issue happening on specific set of nodes or all dpdk nodes?
    *answer: When issue was first seen (scaling up) it happened on specific nodes only
- I see it was mentioned that, dpdkbond0 flapped when you tried to generate sos-report?
    *answer: we discovered yesterday that issue is not specific to scale up. sosreport (related to high load?) caused the same
- Do dpdk interfaces come up on its own after going down and bring up the dpdkbond0?
    *answer: yes they come back on their own after a couple of seconds (see ref [1] in this BZ Description)
- 'iommu=pt' doesnt look like causing issue at this stage to me. 
- I see PMD cores are isolated so should not be getting syscalls of sos-report generation
- Was there any link down observed at uplink switch?
    *answer: There was no link down event at the deployment time but there was an event when we grabbed sosreport on Sep 1st.
- balance-tcp was TechPreview feature in OSP13. It causes re-circulation of the packets and degrading the performance.

Comment 2 Francois Palin 2022-09-07 14:01:05 UTC
There seems to be a correlation in time when an rsync operation gets executed [1], the loss of dpdkbond0 (see ref [1] in this BZ Description above),
and ovs-vswitchd.log messages "Unreasonably long XXXXms poll interval" [2]

[1] 
Aug 31 05:26:46 my-customer-host-name dockerd-current[11351]: + rsync -a -R --delay-updates --delete-after --exclude=/etc/puppetlabs/ --exclude=/opt/puppetlabs/ --exclude=/etc/pki/ca-trust/extracted --exclude=/etc/pki/ca-trust/source/anchors --exclude=/etc/pki/tls/certs/ca-bundle.crt --exclude=/etc/pki/tls/certs/ca-bundle.trust.crt --exclude=/etc/pki/tls/cert.pem --exclude=/etc/hosts --exclude=/etc/localtime --exclude=/etc/hostname /etc /root /opt /var/www /var/spool/cron /var/lib/config-data/neutron
Aug 31 05:27:16 my-customer-host-name dockerd-current[11351]: ++ stat -c %y /var/lib/config-data/iscsid.origin_of_time

[2]
ovs-vswitchd.log:2022-08-31T05:25:46.611Z|1639463|timeval|WARN|Unreasonably long 6006ms poll interval (1ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T05:25:53.837Z|01979|timeval|WARN|Unreasonably long 2026657399ms poll interval (7440ms user, 148678ms system)
ovs-vswitchd.log:2022-08-31T05:26:12.439Z|1639555|timeval|WARN|Unreasonably long 1181ms poll interval (0ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T05:26:17.407Z|1639558|timeval|WARN|Unreasonably long 1215ms poll interval (1ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T05:26:39.830Z|1639570|timeval|WARN|Unreasonably long 1089ms poll interval (2ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T05:26:52.265Z|02021|timeval|WARN|Unreasonably long 58428ms poll interval (0ms user, 5ms system)
ovs-vswitchd.log:2022-08-31T05:32:30.725Z|02142|timeval|WARN|Unreasonably long 337802ms poll interval (26ms user, 0ms system)

ovs-vswitchd.log:2022-08-31T16:01:26.011Z|1647624|timeval|WARN|Unreasonably long 1241ms poll interval (1ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T16:01:53.389Z|1647701|timeval|WARN|Unreasonably long 1009ms poll interval (2ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T16:02:17.189Z|1647709|timeval|WARN|Unreasonably long 1856ms poll interval (1ms user, 1ms system)
ovs-vswitchd.log:2022-08-31T16:02:59.896Z|1647720|timeval|WARN|Unreasonably long 1334ms poll interval (2ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T16:03:25.215Z|1647724|timeval|WARN|Unreasonably long 3445ms poll interval (2ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T16:04:08.779Z|1647733|timeval|WARN|Unreasonably long 1320ms poll interval (1ms user, 2ms system)

ovs-vswitchd.log:2022-08-31T17:13:01.105Z|1648508|timeval|WARN|Unreasonably long 1090ms poll interval (3ms user, 0ms system)
ovs-vswitchd.log:2022-08-31T17:13:45.898Z|1648518|timeval|WARN|Unreasonably long 2951ms poll interval (1ms user, 0ms system)

Comment 3 Francois Palin 2022-09-07 14:37:50 UTC
Log messages mentioned in comment #2 (ovs-vswitchd.log messages "Unreasonably long XXXXms poll interval")
are most likely a consequence of losing dpdkbond0 rather than the cause,
if we go by [1] and [2] below.

[1] https://access.redhat.com/solutions/4898121

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1809363

Comment 9 Elvira 2022-09-12 14:02:49 UTC
Moving to NFV team as this does not seem to be directly related to Neutron. Please reach back if there's any problem, thanks!

Comment 15 Lon Hohberger 2023-07-11 20:01:49 UTC
OSP 13 was retired on June 27, 2023. As this was resolved by a knowledge base article, no further action is planned on this issue at this time.


Note You need to log in before you can comment on or make changes to this bug.