2049892 – Installation fails due to API vip on control plane nodes

Bug 2049892 - Installation fails due to API vip on control plane nodes

Summary: Installation fails due to API vip on control plane nodes

Keywords:
Status:	CLOSED DUPLICATE of bug 2022050
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Bob Fournier
QA Contact:	Victor Voronkov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-02 21:00 UTC by Nick Carboni
Modified:	2022-02-03 16:25 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-02-02 21:37:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Nick Carboni 2022-02-02 21:00:41 UTC

Created attachment 1858758 [details]
Logs from 4.9 failure

Description of problem:

During installation the API VIP is being assigned to control plane nodes even though the API server is not running there yet (it is running on the bootstrap).

Version-Release number of MCO (Machine Config Operator) (if applicable):
4.9.9 and 4.8.22

Platform (AWS, VSphere, Metal, etc.):
Metal - Specifically the assisted-installer SaaS

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Not sure

How reproducible:
Unsure - we've seen two similar failures in the assisted installer SaaS in the last week or so.

Actual results:

Expected results:

The API VIP should only be assigned to the node running the API server

Additional info:

Added logs for both instances of the problem we've seen.

Comment 2 Ben Nemec 2022-02-02 21:36:36 UTC

This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2022050 . See the following sequence of log messages on the bootstrap:

time="2022-01-30T06:17:24Z" level=info msg="Command message successfully sent to Keepalived container control socket: stop\n"
time="2022-01-30T06:17:24Z" level=info msg="Command message successfully sent to Keepalived container control socket: reload\n"

The client sent: stop
Sun Jan 30 06:17:24 2022: Stopping
The client sent: reload
Sun Jan 30 06:17:24 2022: (tm-nc-oam-et-rm17-rack02_API) sent 0 priority
Sun Jan 30 06:17:24 2022: (tm-nc-oam-et-rm17-rack02_API) removing VIPs.
Sun Jan 30 06:17:25 2022: Stopped - used 0.017990 user time, 0.069454 system time
Sun Jan 30 06:17:25 2022: CPU usage (self/children) user: 0.008299/0.018071 system: 0.006205/0.070616
Sun Jan 30 06:17:25 2022: Stopped Keepalived v2.1.5 (07/13,2020)

I've proposed a backport of the fix to 4.9 which should take care of this.

Comment 3 Ben Nemec 2022-02-02 21:37:05 UTC


*** This bug has been marked as a duplicate of bug 2022050 ***

Comment 4 Nick Carboni 2022-02-03 13:17:19 UTC

This was also seen in a 4.8 install. Is it worth also backporting there?
Or do you think that one was a separate issue?

Comment 5 Ben Nemec 2022-02-03 16:25:27 UTC

I've proposed a backport to 4.8, but the 4.8 logs look different. From what I can tell, there are two interfaces on the VIP subnet and it's causing keepalived to bounce back and forth between them. I don't think this will work because we have no way to know which one is supposed to be used. There should only be one interface on the node with an address on the VIP subnet.

Note You need to log in before you can comment on or make changes to this bug.