Created attachment 1858758 [details] Logs from 4.9 failure Description of problem: During installation the API VIP is being assigned to control plane nodes even though the API server is not running there yet (it is running on the bootstrap). Version-Release number of MCO (Machine Config Operator) (if applicable): 4.9.9 and 4.8.22 Platform (AWS, VSphere, Metal, etc.): Metal - Specifically the assisted-installer SaaS Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Not sure How reproducible: Unsure - we've seen two similar failures in the assisted installer SaaS in the last week or so. Actual results: Expected results: The API VIP should only be assigned to the node running the API server Additional info: Added logs for both instances of the problem we've seen.
This looks like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2022050 . See the following sequence of log messages on the bootstrap: time="2022-01-30T06:17:24Z" level=info msg="Command message successfully sent to Keepalived container control socket: stop\n" time="2022-01-30T06:17:24Z" level=info msg="Command message successfully sent to Keepalived container control socket: reload\n" The client sent: stop Sun Jan 30 06:17:24 2022: Stopping The client sent: reload Sun Jan 30 06:17:24 2022: (tm-nc-oam-et-rm17-rack02_API) sent 0 priority Sun Jan 30 06:17:24 2022: (tm-nc-oam-et-rm17-rack02_API) removing VIPs. Sun Jan 30 06:17:25 2022: Stopped - used 0.017990 user time, 0.069454 system time Sun Jan 30 06:17:25 2022: CPU usage (self/children) user: 0.008299/0.018071 system: 0.006205/0.070616 Sun Jan 30 06:17:25 2022: Stopped Keepalived v2.1.5 (07/13,2020) I've proposed a backport of the fix to 4.9 which should take care of this.
*** This bug has been marked as a duplicate of bug 2022050 ***
This was also seen in a 4.8 install. Is it worth also backporting there? Or do you think that one was a separate issue?
I've proposed a backport to 4.8, but the 4.8 logs look different. From what I can tell, there are two interfaces on the VIP subnet and it's causing keepalived to bounce back and forth between them. I don't think this will work because we have no way to know which one is supposed to be used. There should only be one interface on the node with an address on the VIP subnet.