Bug 2033862

Summary: MachineSet is not scaling up due to an OpenStack error trying to create multiple ports with the same MAC address
Product: OpenShift Container Platform Reporter: Vincent Lours <vlours>
Component: Cloud ComputeAssignee: Martin André <m.andre>
Cloud Compute sub component: OpenStack Provider QA Contact: Itzik Brown <itbrown>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: aos-bugs, enothen, igarciam, itbrown, kurathod, ltamagno, m.andre, mbooth, mfedosin, openshift-bugs-escalate, pprinett, ssonigra
Version: 4.8Keywords: Triaged
Target Milestone: ---Flags: vlours: needinfo-
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: A bug in Cisco ACI's neutron implementation, present in RHOSP16, causes the query for subnets belonging to a given network to return unexpected results. Consequence: The OpenStack cluster-api-provider could potentially try to provision instances with duplicated ports on the same subnet, leading to a failed provisioning. Fix: Add additional filtering in the OpenStack cluster-api-provider to ensure there is no more than one port per subnet. Result: It is now possible to deploy OCP on RHOSP16 with Cisco ACI.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:40:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2050064    

Description Vincent Lours 2021-12-18 06:49:41 UTC
Description of problem:
As described in the BZ 1936511 (https://bugzilla.redhat.com/show_bug.cgi?id=1936511), one customer was facing an issue when trying to deploy new VM(s) in Openstack.

Version-Release number of selected component (if applicable):
RHOCP 4.8.20

How reproducible:
It seems reproducible in Openstack environments.

Actual results:
Provisionning a new VM failed to be provisionned.

Expected results:
Having the machineset able to create the desired VM using flags.

Additional info:
This is related to the BZ 1936511, which has been closed as duplicated by the BZ 1955969 (https://bugzilla.redhat.com/show_bug.cgi?id=1955969).
The patch was included in RHOCP 4.8.3 and should be already fixed.

Would it be possible to ensure that the fix has not be reverted, or correctly implemented?

Comment 2 Martin André 2021-12-20 16:09:39 UTC
Hi, while I can't say the problem isn't in CAPO, I do not believe the patch at https://github.com/openshift/cluster-api-provider-openstack/pull/181 is at fault - there must be another issue at play.

I can see from the attached customer case the issue started appearing after a migration from OSP13 to OSP16. That likely means they also switched from OVS to OVN for the openstack networking and it's possible they're hitting an OVN bug (such as https://bugzilla.redhat.com/show_bug.cgi?id=1947823). It's also possible that other openshift overlays could be causing this issue. I remember a similar issue with Cisco ACI (https://bugzilla.redhat.com/show_bug.cgi?id=2002295) which I believe this customer is using.

We would need more info to help us debugging. Could you provide us with a must-gather?

Comment 4 Vincent Lours 2021-12-21 01:16:29 UTC
Hi Martin,

Thank you for sharing the information.

The customer has updated the case saying that the workaround provided in the KCS is not in adequacy with an IPI install.
Based on your last comment I will request additional information to the customer.

As the Must-gather is available from the case, would it be possible to get someone assigned to this BZ?

Comment 5 Martin André 2021-12-21 09:55:59 UTC
Could you also provide the problematic MachineSet? The 4.8 must-gather I was looking at only included the `hub-2m8kz-worker-0` machineset that seems to work as expected where replicas == availableReplicas. I can also see machines from this machineset would in theory only be attached to 1 subnet, assuming the filter returns only one match.

Comment 29 Itzik Brown 2022-02-24 14:48:41 UTC
Since we don't have the specific setup the only way I could verify is to scale a worker and make sure it's becoming ready.

Used:
OCP 4.11.0-0.nightly-2022-02-23-185405 
RHOS-16.2-RHEL-8-20211129.n.1

Comment 33 errata-xmlrpc 2022-08-10 10:40:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069