Bug 2049892

Summary:	Installation fails due to API vip on control plane nodes
Product:	OpenShift Container Platform	Reporter:	Nick Carboni <ncarboni>
Component:	Machine Config Operator	Assignee:	Bob Fournier <bfournie>
Machine Config Operator sub component:	platform-baremetal	QA Contact:	Victor Voronkov <vvoronko>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, bnemec, tsedovic
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-02-02 21:37:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nick Carboni 2022-02-02 21:00:41 UTC

Created attachment 1858758 [details]
Logs from 4.9 failure

Description of problem:

During installation the API VIP is being assigned to control plane nodes even though the API server is not running there yet (it is running on the bootstrap).

Version-Release number of MCO (Machine Config Operator) (if applicable):
4.9.9 and 4.8.22

Platform (AWS, VSphere, Metal, etc.):
Metal - Specifically the assisted-installer SaaS

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Not sure

How reproducible:
Unsure - we've seen two similar failures in the assisted installer SaaS in the last week or so.

Actual results:

Expected results:

The API VIP should only be assigned to the node running the API server

Additional info:

Added logs for both instances of the problem we've seen.

Comment 2 Ben Nemec 2022-02-02 21:36:36 UTC

This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2022050 . See the following sequence of log messages on the bootstrap:

time="2022-01-30T06:17:24Z" level=info msg="Command message successfully sent to Keepalived container control socket: stop\n"
time="2022-01-30T06:17:24Z" level=info msg="Command message successfully sent to Keepalived container control socket: reload\n"

The client sent: stop
Sun Jan 30 06:17:24 2022: Stopping
The client sent: reload
Sun Jan 30 06:17:24 2022: (tm-nc-oam-et-rm17-rack02_API) sent 0 priority
Sun Jan 30 06:17:24 2022: (tm-nc-oam-et-rm17-rack02_API) removing VIPs.
Sun Jan 30 06:17:25 2022: Stopped - used 0.017990 user time, 0.069454 system time
Sun Jan 30 06:17:25 2022: CPU usage (self/children) user: 0.008299/0.018071 system: 0.006205/0.070616
Sun Jan 30 06:17:25 2022: Stopped Keepalived v2.1.5 (07/13,2020)

I've proposed a backport of the fix to 4.9 which should take care of this.

Comment 3 Ben Nemec 2022-02-02 21:37:05 UTC


*** This bug has been marked as a duplicate of bug 2022050 ***

Comment 4 Nick Carboni 2022-02-03 13:17:19 UTC

This was also seen in a 4.8 install. Is it worth also backporting there?
Or do you think that one was a separate issue?

Comment 5 Ben Nemec 2022-02-03 16:25:27 UTC

I've proposed a backport to 4.8, but the 4.8 logs look different. From what I can tell, there are two interfaces on the VIP subnet and it's causing keepalived to bounce back and forth between them. I don't think this will work because we have no way to know which one is supposed to be used. There should only be one interface on the node with an address on the VIP subnet.