Bug 2257274

Summary: [OSP17.1] After upgrade to OSP16.2.6 Octavia Mgmt network amphoras having random MTU change: smaller MTU (1500) compared to orginal value 8950 (jumbo frames)
Product: Red Hat OpenStack Reporter: Gregory Thiemonge <gthiemon>
Component: tripleo-ansibleAssignee: Gregory Thiemonge <gthiemon>
Status: CLOSED ERRATA QA Contact: Bruna Bonguardo <bbonguar>
Severity: high Docs Contact: Greg Rakauskas <gregraka>
Priority: high    
Version: 17.1 (Wallaby)CC: bbonguar, bcafarel, dalvarez, gregraka, gthiemon, jlibosva, jsoliman, mariel, mburns, midzik, njohnston, oschwart, pgrist, ralonsoh, tvainio, tweining
Target Milestone: z3Keywords: Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-3.3.1-17.1.20231101230827.el9ost Doc Type: Bug Fix
Doc Text:
Before this update, when using jumbo frames for Networking service (neutron) tenant networks, a RHOSP Controller shutting down could sometimes cause the RHOSP Load-balancing service (octavia) management interface (`o-hm0`) to have its MTU reset to a small value, such as 1500 or 1450. This problem usually occurred when the RHOSP Controller was rebooted for the first time, or in a situation when the Controller was abruptly terminated. With this update, RHOSP director now ensures that Open vSwitch (OVS) is configured with the correct MTU when the `o-hm0` is created.
Story Points: ---
Clone Of: 2255253 Environment:
Last Closed: 2024-05-22 20:42:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2255253    
Bug Blocks:    

Description Gregory Thiemonge 2024-01-08 14:53:15 UTC
+++ This bug was initially created as a clone of Bug #2255253 +++

Description of problem:
Hello, we have a CU has upgraded the ENV to 16.2.6, CU using jumbo frame in deployment templates
- Suddenly after upgrade some loadbalancers ended up in error state. 
- Failing them over didn't fix anything they ended up in 'pending_update' state and some amphoras went in 'error' state.

- we tried to compare two sosreport from the same node before and after upgrade 
- in both there are q-devices which has either mtu 1500 or 8950 so not all are after the upgrade
- result: Octavia Mgmt is not working for large amount of amphoras.
- Octavia Management network has with some amphoras has been set to 1500 MTU which is too low for Octavia Mgmt health messages. 

many amphora related interfaces have small mtu, like:
ip a | grep eaf87d51-ac 
862: qbreaf87d51-ac: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
863: qvoeaf87d51-ac@qvbeaf87d51-ac: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000
864: qvbeaf87d51-ac@qvoeaf87d51-ac: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master qbreaf87d51-ac state UP group default qlen 1000
872: tapeaf87d51-ac: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue master qbreaf87d51-ac state UNKNOWN group default qlen 1000


Many loadbalancer are stuck in pending update and prevent any actions.

we have noticed upstream in [1] which states:
~~~
A new parameter octavia_provider_network_mtu is added to set the MTU to 1500 by default. This is important for deployments that allow jumbo frames while setting the management to the standard Ethernet MTU. The MTU can be still changed at any point during the initial octavia deployment or with the openstack network set –mtu command line.
~~~

that may be available downstream (variable) in [2]
But changing network manually is also an option.
by doing:
 openstack network set --mtu 8950 <LB networks ID> but on test lab first
but This is not tested yet,

we need to investigate why the MTU in Octavia Mgmt network is having smaller MTU and return the orginal value to 8950

[1] https://docs.openstack.org/releasenotes/openstack-ansible-os_octavia/unreleased.html
[2] https://review.opendev.org/c/openstack/openstack-ansible-os_octavia/+/864819

Version-Release number of selected component (if applicable):
RHOSP 16.2.6
puppet-octavia-15.5.1-2.20220821005128.a56b33a.el8ost.noarch

Actual results:
MTU in Octavia Mgmt network is having smaller MTU and return the orginal value to 8950, Octavia Mgmt is not working for large amount of amphoras

Expected results:
MTU in Octavia Mgmt network is having the orginal value: 8950


Additional info:
sos-report attached on case from compute and controllers

Comment 6 Omer Schwartz 2024-04-01 09:06:54 UTC
After running the following commands on a host with RHOS-17.1-RHEL-9-20240320.n.1:

(As indicated before) I Deployed OSP 17.1 with jumbo frames, added Octavia to the deployment, checked the MTUs on the controllers, networkers:


(overcloud) [stack@undercloud-0 ~]$ for host in controller-{0,1,2} networker-{0,1,2}; do echo $host; ssh $host.ctlplane "ip link show o-hm0"; done
controller-0
Warning: Permanently added 'controller-0.ctlplane' (ED25519) to the list of known hosts.
14: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:b3:79:a7 brd ff:ff:ff:ff:ff:ff
controller-1
Warning: Permanently added 'controller-1.ctlplane' (ED25519) to the list of known hosts.
14: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:62:df:0c brd ff:ff:ff:ff:ff:ff
controller-2
Warning: Permanently added 'controller-2.ctlplane' (ED25519) to the list of known hosts.
14: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:e3:69:24 brd ff:ff:ff:ff:ff:ff
networker-0
Warning: Permanently added 'networker-0.ctlplane' (ED25519) to the list of known hosts.
12: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:08:89:34 brd ff:ff:ff:ff:ff:ff
networker-1
Warning: Permanently added 'networker-1.ctlplane' (ED25519) to the list of known hosts.
12: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:0a:21:43 brd ff:ff:ff:ff:ff:ff
networker-2
Warning: Permanently added 'networker-2.ctlplane' (ED25519) to the list of known hosts.
12: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:93:a8:70 brd ff:ff:ff:ff:ff:ff

# we can see all have MTU == 8942

# Reboot nodes
(overcloud) [stack@undercloud-0 ~]$ for host in controller-0 networker-0; do ssh ${host}.ctlplane sudo reboot; done
Warning: Permanently added 'controller-0.ctlplane' (ED25519) to the list of known hosts.
Warning: Permanently added 'networker-0.ctlplane' (ED25519) to the list of known hosts.

# Make sure their MTUs stayed the same after the reboot
(overcloud) [stack@undercloud-0 ~]$ for host in controller-{0,1,2} networker-{0,1,2}; do echo $host; ssh $host.ctlplane "ip link show o-hm0"; done
controller-0
Warning: Permanently added 'controller-0.ctlplane' (ED25519) to the list of known hosts.
10: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:b3:79:a7 brd ff:ff:ff:ff:ff:ff
controller-1
Warning: Permanently added 'controller-1.ctlplane' (ED25519) to the list of known hosts.
14: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:62:df:0c brd ff:ff:ff:ff:ff:ff
controller-2
Warning: Permanently added 'controller-2.ctlplane' (ED25519) to the list of known hosts.
14: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:e3:69:24 brd ff:ff:ff:ff:ff:ff
networker-0
Warning: Permanently added 'networker-0.ctlplane' (ED25519) to the list of known hosts.
10: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:08:89:34 brd ff:ff:ff:ff:ff:ff
networker-1
Warning: Permanently added 'networker-1.ctlplane' (ED25519) to the list of known hosts.
12: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:0a:21:43 brd ff:ff:ff:ff:ff:ff
networker-2
Warning: Permanently added 'networker-2.ctlplane' (ED25519) to the list of known hosts.
12: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:93:a8:70 brd ff:ff:ff:ff:ff:ff

# Hard reset the nodes
[root@osp-devnest-5 ~]# virsh reset networker-0
Domain 'networker-0' was reset

[root@osp-devnest-5 ~]# virsh reset networker-1
Domain 'networker-1' was reset

# Make sure the MTUs are the same:
[stack@undercloud-0 ~]$ for host in controller-{0,1,2} networker-{0,1,2}; do echo $host; ssh $host.ctlplane "ip link show o-hm0"; done
controller-0
Warning: Permanently added 'controller-0.ctlplane' (ED25519) to the list of known hosts.
10: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:b3:79:a7 brd ff:ff:ff:ff:ff:ff
controller-1
Warning: Permanently added 'controller-1.ctlplane' (ED25519) to the list of known hosts.
14: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:62:df:0c brd ff:ff:ff:ff:ff:ff
controller-2
Warning: Permanently added 'controller-2.ctlplane' (ED25519) to the list of known hosts.
14: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:e3:69:24 brd ff:ff:ff:ff:ff:ff
networker-0
Warning: Permanently added 'networker-0.ctlplane' (ED25519) to the list of known hosts.
11: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:08:89:34 brd ff:ff:ff:ff:ff:ff
networker-1
Warning: Permanently added 'networker-1.ctlplane' (ED25519) to the list of known hosts.
11: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:0a:21:43 brd ff:ff:ff:ff:ff:ff
networker-2
Warning: Permanently added 'networker-2.ctlplane' (ED25519) to the list of known hosts.
12: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8942 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether fa:16:3e:93:a8:70 brd ff:ff:ff:ff:ff:ff

The MTUs have stayed the same after all reboot operations, it looks good to me and I am moving the BZ status to verified.

Comment 17 errata-xmlrpc 2024-05-22 20:42:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: openstack-tripleo-heat-templates and tripleo-ansible update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:2736