RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/
Bug 2065504 - Neutron DVR breaks with kernel 4.18.0-365.el8.x86_64
Summary: Neutron DVR breaks with kernel 4.18.0-365.el8.x86_64
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: RDO
Classification: Community
Component: openstack-neutron
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: trunk
Assignee: Daniel Alvarez Sanchez
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On: 2062870
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-18 01:25 UTC by Jonathan Mills
Modified: 2022-08-30 04:27 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-30 04:27:34 UTC
Embargoed:


Attachments (Terms of Use)
pcap of the working example (11.22 KB, application/vnd.tcpdump.pcap)
2022-03-18 15:44 UTC, Jonathan Mills
no flags Details
pcap of the broken example (8.45 KB, application/vnd.tcpdump.pcap)
2022-03-18 15:44 UTC, Jonathan Mills
no flags Details
Additional details on working example as text log (22.40 KB, text/plain)
2022-03-18 15:45 UTC, Jonathan Mills
no flags Details
Additional details on broken example as text log (17.15 KB, text/plain)
2022-03-18 15:46 UTC, Jonathan Mills
no flags Details

Description Jonathan Mills 2022-03-18 01:25:05 UTC
Description of problem:

Neutron Distributed Virtual Routers break, on the compute node side, when the node is patched up to CentOS 8 Stream kernel 4.18.0-365.el8.x86_64.  This works regardless of whether the Neutron ML2 mechanism_driver is openvswitch or linuxbridge.


Version-Release number of selected component (if applicable):

CentOS 8 Stream, fully patched as of this ticket
OpenStack Wallaby (RDO packages, CentOS Cloud SIG repo)

Neutron packages:
openstack-neutron-18.2.0-1.el8.noarch
openstack-neutron-common-18.2.0-1.el8.noarch
openstack-neutron-linuxbridge-18.2.0-1.el8.noarch
openstack-neutron-ml2-18.2.0-1.el8.noarch
python3-neutron-18.2.0-1.el8.noarch
python3-neutronclient-7.3.0-1.el8.noarch
python3-neutron-lib-2.10.2-1.el8.noarch


How reproducible:

100%


Steps to Reproduce:

1. Assuming you have a multi-node OpenStack cluster, running CentOS 8 Stream, with OpenStack Wallaby RPMs from RDO, and Neutron routers are in DVR mode...

2. Neutron network node L3 agent is in "dvr_snat" mode
3. Neutron compute node L3 agents are in "dvr_no_external" mode
4. Neutron server's ML2 plugin mode can have mechanism_driver set to "openvswitch,linuxbridge".  The problem is independent of the mech driver

5. Patch your compute node kernel from 4.18.0-348.el8.x86_64 up to 4.18.0-365.el8.x86_64, reboot.  Ensure nova-compute, and neutron agents are running.

6. Boot a VM, assign a floating IP, and try to ssh to it.  


Actual results:

You'll be able to ping, as ICMP still works.  But ssh and other TCP-based traffic fails. Specifically, it fails at the DNAT.  SNAT, and east-west traffic still works from the VM.  End users will be confused.

Expected results:

All of DNAT, SNAT, and East-West traffic should work with the VM.  In short, it should work normally.


Additional info:

We performed TCP dumps of the bridge interface to which the VM's tap device is attached.  What we saw was a bizarre randomization of TCP source ports from the VM. This results in massive TCP retransmits, until the handshaking breaks down, usually appearing as an ssh client timeout.  I'll try to upload a tcpdump pcap later.

Comment 1 Jonathan Mills 2022-03-18 01:30:14 UTC
I should have mentioned...we can fix the problem 100% of the time simply be reverting our kernels from 4.18.0-365.el8.x86_64 to 4.18.0-348.el8.x86_64.  We have reverted the kernels on our production cloud hypervisors.  But clearly this isn't ideal, as we also have a duty to patch...

Comment 2 Robert Budden 2022-03-18 15:22:05 UTC
To add some additional information, we are able to live swap tenant virtual routers to centralized mode to restore North/South connectivity. This has obvious performance impacts, but may prove useful in narrowing down or debugging the issue. On the surface one might expect the North/South pieces of DVR and Centralized to be similar if not the same, but admittedly I have not yet dug into the code.

Comment 3 Jonathan Mills 2022-03-18 15:44:10 UTC
Created attachment 1866639 [details]
pcap of the working example

Comment 4 Jonathan Mills 2022-03-18 15:44:46 UTC
Created attachment 1866640 [details]
pcap of the broken example

Comment 5 Jonathan Mills 2022-03-18 15:45:31 UTC
Created attachment 1866641 [details]
Additional details on working example as text log

Comment 6 Jonathan Mills 2022-03-18 15:46:05 UTC
Created attachment 1866642 [details]
Additional details on broken example as text log

Comment 10 Yatin Karel 2022-07-20 13:58:36 UTC
So checked the issue reproduces even with latest C8 kernel 4.18.0-394.el8 and 'dvr_no_external' l3 agent mode, and also issue don't reproduce on CentOS 9-Stream.

On checking further with @ralonsoh we found that it's caused by the fix of https://bugzilla.redhat.com/show_bug.cgi?id=2006167. 
That fix was reverted in RHEL 9 as part of https://bugzilla.redhat.com/show_bug.cgi?id=2061850 that's the reason we don't see the issue in CentOS 9-Stream, but the issue is not yet fixed in RHEL 8 kernel https://bugzilla.redhat.com/show_bug.cgi?id=2051413

Comment 11 Florian Westphal 2022-07-20 14:44:52 UTC
(In reply to Yatin Karel from comment #10)
> So checked the issue reproduces even with latest C8 kernel 4.18.0-394.el8
> and 'dvr_no_external' l3 agent mode, and also issue don't reproduce on
> CentOS 9-Stream.
> 
> On checking further with @ralonsoh we found that it's caused by the fix of
> https://bugzilla.redhat.com/show_bug.cgi?id=2006167. 
> That fix was reverted in RHEL 9 as part of
> https://bugzilla.redhat.com/show_bug.cgi?id=2061850 that's the reason we
> don't see the issue in CentOS 9-Stream, but the issue is not yet fixed in
> RHEL 8 kernel https://bugzilla.redhat.com/show_bug.cgi?id=2051413

It is fixed in RHEL8, in 4.18.0-397.el8. The bug you are referencing is filed vs. Fedora.
The RHEL8 bug is https://bugzilla.redhat.com/show_bug.cgi?id=2062870.

Comment 12 Yatin Karel 2022-07-20 15:08:49 UTC
kernel-4.18.0-408.el8(In reply to Florian Westphal from comment #11)
> (In reply to Yatin Karel from comment #10)
> > So checked the issue reproduces even with latest C8 kernel 4.18.0-394.el8
> > and 'dvr_no_external' l3 agent mode, and also issue don't reproduce on
> > CentOS 9-Stream.
> > 
> > On checking further with @ralonsoh we found that it's caused by the fix of
> > https://bugzilla.redhat.com/show_bug.cgi?id=2006167. 
> > That fix was reverted in RHEL 9 as part of
> > https://bugzilla.redhat.com/show_bug.cgi?id=2061850 that's the reason we
> > don't see the issue in CentOS 9-Stream, but the issue is not yet fixed in
> > RHEL 8 kernel https://bugzilla.redhat.com/show_bug.cgi?id=2051413
> 
> It is fixed in RHEL8, in 4.18.0-397.el8. The bug you are referencing is
> filed vs. Fedora.
> The RHEL8 bug is https://bugzilla.redhat.com/show_bug.cgi?id=2062870.

Thanks Florian for the link, i updated the bz reference.
I see kernel-4.18.0-408.el8 which includes the revert just built yesterday for C8-Stream, so should soon be available in C8-Stream repos.

Comment 13 Yatin Karel 2022-08-30 04:27:34 UTC
To update,(In reply to Yatin Karel from comment #12)
> kernel-4.18.0-408.el8(In reply to Florian Westphal from comment #11)
> > (In reply to Yatin Karel from comment #10)
> > > So checked the issue reproduces even with latest C8 kernel 4.18.0-394.el8
> > > and 'dvr_no_external' l3 agent mode, and also issue don't reproduce on
> > > CentOS 9-Stream.
> > > 
> > > On checking further with @ralonsoh we found that it's caused by the fix of
> > > https://bugzilla.redhat.com/show_bug.cgi?id=2006167. 
> > > That fix was reverted in RHEL 9 as part of
> > > https://bugzilla.redhat.com/show_bug.cgi?id=2061850 that's the reason we
> > > don't see the issue in CentOS 9-Stream, but the issue is not yet fixed in
> > > RHEL 8 kernel https://bugzilla.redhat.com/show_bug.cgi?id=2051413
> > 
> > It is fixed in RHEL8, in 4.18.0-397.el8. The bug you are referencing is
> > filed vs. Fedora.
> > The RHEL8 bug is https://bugzilla.redhat.com/show_bug.cgi?id=2062870.
> 
> Thanks Florian for the link, i updated the bz reference.
> I see kernel-4.18.0-408.el8 which includes the revert just built yesterday
> for C8-Stream, so should soon be available in C8-Stream repos.

kernel-4.18.0-408.el8 now available in C8-Stream repos and is working fine, tested with both wallaby and train[1][2].

Closing the bug based on this, feel free to reopen if you still see the issue with latest kernel.

[1] https://review.rdoproject.org/zuul/build/e63e400e87324bd88e877fc326e310a8
[2] https://review.rdoproject.org/zuul/build/bcdef368fcb74cc7aced9fabe7d8b9e6


Note You need to log in before you can comment on or make changes to this bug.