Bug 1950316 - [IPI baremetal] installation without worker randomly fails due to ingress vip
Summary: [IPI baremetal] installation without worker randomly fails due to ingress vip
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Yossi Boaron
QA Contact: Nataf Sharabi
URL:
Whiteboard:
Depends On:
Blocks: dit
TreeView+ depends on / blocked
 
Reported: 2021-04-16 11:51 UTC by Borja Aranda
Modified: 2022-06-15 17:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-01 16:39:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather (13.20 MB, application/gzip)
2021-04-16 12:14 UTC, Borja Aranda
no flags Details

Description Borja Aranda 2021-04-16 11:51:24 UTC
### Version
~~~
# openshift-baremetal-install version
openshift-baremetal-install 4.8.0-0.nightly-2021-04-13-091654
built from commit 65c320503d99a3eddffca3c35d74722b4f7c7ef1
release image registry.ci.openshift.org/ocp/release@sha256:24f70d82374d531d067bccfef026e7ddb075056409fc3c70e2bca02083df08f5
~~~

### Platform
Baremetal IPI


### Issue

In 4.8, baremetal IPI installation uses keepalived to balance the Ingress VIP.

If the environment does not have worker nodes, keepalived will balance the ingress VIP between the three master nodes.

By default, the ingress operator deploys only two replicas of the default router, which randomly leads to the bug:

If the routers are *not* deployed in the master with the ingress vip, then the cluster-operators depending on routes (such as authentication or console) won't be able to complete the installation or will fail, preventing the cluster to be succesfully installed until the routers are deployed on the master holding the Ingress VIP.

Comment 1 Borja Aranda 2021-04-16 12:14:21 UTC
Created attachment 1772422 [details]
must-gather

Comment 2 Ben Nemec 2021-04-22 20:51:41 UTC
Are you sure the failure in this must-gather was related to the VIP? I'm seeing the "too many requests" error in the authentication pod logs, not a problem talking to ingress. Also, looking at the ingress pods, it appears the VIP correctly transitioned when new ingress pods were started. I'm not sure what prompted the change in ingress, but I see the pods started around 11:59 and the VIP transitioned from the .21 node to the .20 node at that time, which makes sense because neither of the ingress pods were running on .21 at that point. It's hard to say what was happening before that because we don't have logs from before the ingress pods restarted, but it suggests the healthcheck and VIP placement was working correctly.

The other thing I will note is that this deployment used a release from before the double VIP fix merged. It's possible that bug would manifest as the VIP being on a node without ingress. You would have to look at all three masters and see if two of them have the VIP configured. If so, then it's the same bug which has already been fixed in later releases.

So to make progress on this I need to see a must-gather where the failure was caused by an issue talking to ingress, and I would also like to know if the problem still occurs on a build from this week since the double VIP fix merged last week.

Comment 3 Ben Nemec 2021-05-11 16:53:35 UTC
Any update? Is this still happening? We did fix one duplicate VIP issue since this was opened. We're currently working on another, although I don't think that would affect the ingress VIP.

Comment 4 Yossi Boaron 2021-05-26 12:28:35 UTC
Keepalived for ingress VIP doesn't select randomly one of the nodes, it should pick a node that a router pod is running on ( see [1] ), so even for deployment with 0 worker nodes ingress VIP supposed to land on the right master node.

As Ben mentioned we solved a bug [2] related to the VIPs management, which could cause a node without router pod to hold the VIP.

Please let us know if you still having this issue


[1] https://github.com/openshift/machine-config-operator/blob/master/templates/master/00-master/on-prem/files/keepalived-keepalived.yaml#L39-#L43
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1931505

Comment 5 Kiran Thyagaraja 2021-06-01 16:39:03 UTC
Closing this for now. The needinfo request has been overdue for more than a month. Please feel free to reopen.


Note You need to log in before you can comment on or make changes to this bug.