Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1931661

Summary:	OSP 16.1 ovn octavia load balancers in error state - Pool <UUID> is immutable and cannot be updated
Product:	Red Hat OpenStack	Reporter:	Matt Flusche <mflusche>
Component:	python-ovn-octavia-provider	Assignee:	Brian Haley <bhaley>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Bruna Bonguardo <bbonguar>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	16.1 (Train)	CC:	afariasa, bhaley, cgoncalves, dhill, eolivare, gthiemon, igarciam, ihrachys, kfida, knakai, lpeer, ltomasbo, majopela, mdemaced, mdulko, pveiga, ramishra, scohen
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-04-26 14:18:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matt Flusche 2021-02-22 21:30:50 UTC

Description of problem:
OVN octavia load balancers stop working.  octavia.log contains the following errors:

Client-side error: Pool <UUID> is immutable and cannot be updated. format_exception /usr/lib/python3.6/site-packages/wsme/api.py:222

Perhaps: https://bugs.launchpad.net/neutron/+bug/1900763

Version-Release number of selected component (if applicable):
16.1.x

How reproducible:
This specific environment


Additional info:
I'll provide additional details in privates comments

Comment 31 Rabi Mishra 2021-03-08 12:10:30 UTC

https://access.redhat.com/solutions/5858391 seems relevant.

Comment 37 Brian Haley 2021-03-15 19:47:01 UTC

Hi Ignacio,

Unfortunately these errors were happening in even the oldest logs in the sosreports I found on supportshell, so I can't tell what operation might have triggered them exactly.

One possibility is that the subnet was not created properly, and when Kuryr/CNO went to allocate a VIP with that IP, it was already in-use.

Another is that there was a loadbalancer create operation that was somehow stopped and left a stale DB entry.

I know there were some manual DB operations to clean things up, followed by a forced failover of (I think) this loadbalancer which cleared things up.

All the log files I've looked at in the case show this failure (AllocateVIPException), I think to get further in the RCA we'd need the logs for when the failure stopped, so we could see what operation happened at that point in time.

Thanks,

-Brian