Bug 2093990 - [Submariner] - Deployment of 2 gateway nodes on aws platform fails
Summary: [Submariner] - Deployment of 2 gateway nodes on aws platform fails
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Advanced Cluster Management for Kubernetes
Classification: Red Hat
Component: Submariner
Version: rhacm-2.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: rhacm-2.6
Assignee: Mike Kolesnik
QA Contact: Maxim Babushkin
Christopher Dawson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-06 14:10 UTC by Maxim Babushkin
Modified: 2022-10-03 20:23 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-03 20:23:57 UTC
Target Upstream Version:
Embargoed:
bot-tracker-sync: rhacm-2.6+


Attachments (Terms of Use)
Submariner gather logs (613.78 KB, application/gzip)
2022-06-06 14:10 UTC, Maxim Babushkin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github stolostron backlog issues 23007 0 None None None 2022-06-06 17:40:18 UTC
Github stolostron submariner-addon issues 386 0 None closed cloud prepare for AWS is invoked periodically 2022-07-07 13:00:32 UTC
Github stolostron submariner-addon pull 411 0 None Merged Skip syncing unchanged submariner configs 2022-07-07 13:00:32 UTC
Github stolostron submariner-addon pull 416 0 None Merged [release-2.5] Skip syncing unchanged submariner configs 2022-07-17 10:05:12 UTC

Description Maxim Babushkin 2022-06-06 14:10:53 UTC
Created attachment 1887192 [details]
Submariner gather logs

**What happened**:
ACM 2.5
Submariner 0.12.1
Deployment platforms: AWS and GCP

Deployment of submariner with 2 gateway nodes.
On GCP platform, 2 GW nodes deployed and labeled successfully.
But on AWS platform, the following steps were occurred:
1) Gateway node vm has been created
2) It has been labeled with the gateway label
3) After a 10 - 15 seconds, the vm got destroyed
4) The gateway pod started and destroyed after a few seconds
5) Everything started again in a loop

**What you expected to happen**:
AWS platform should successfully create and label the required amount of gateway nodes.
In our case - 2.
When making a deployment with 1 gateway node, everything works without any issue.

**How to reproduce it (as minimally and precisely as possible)**:
Deploy submariner from the acm addon and specify 2 gateways for each cluster.

**Environment**:
- Submariner version (use `subctl version`): 0.12.1
- Kubernetes version (use `kubectl version`): 4.9.36
- Diagnose information (use `subctl diagnose all`): Log attached
- Gather information (use `subctl gather`) Log attached
- Cloud provider or hardware configuration: AWS provider
- Install tools: Addon deployment from ACM hub

Comment 6 Mike Kolesnik 2022-06-09 12:45:38 UTC
I have a hypothesis on why this is happening (though realistically this should be happening even with one gateway cluster, I'm not sure why that seems to work).

AWS preparation is handled on the hub, which does so successfully.
However, when the spoke tried to sync (probably due to some update event on one of the relevant resources), it tries to count the number of gateway nodes and miscounts them due to this lookup in updateGatewayStatus - https://github.com/stolostron/submariner-addon/blob/main/pkg/spoke/submarineragent/config_controller.go#L528-L531

Thne, the addon on the spoke update the status of submariner config with a "failed" condition due to "InsufficientNodes".

This causes (probably) the hub controller to try and resync the operation (or it could simply be reacting to the event due to the submariner config being updated with the failed condition). This results in some endless loop since the controller on the hub tries to run cloud prepare again (and since addon uses manifest work for deployment, it might not be idempotent causing the machineset to be destroyed and recreated).

This is wrong, since nodes are actually labeled with "submariner.io/gateway" but not necessarily with "node-role.kubernetes.io/worker" (the machineset used by cloud-prepare doesnt label as worker so that the node is a dedicated node).

Therefore, if we fix the code in `updateGatewayStatus` to only look for nodes with the gateway label, this should work as intended.

For comparison, I did a cloud prepare with `subctl` and it was behaving as expected. Also, it's clear from the attached hub log that the cloud prepare worked as expected.

Additionally, the hub controller might be inefficiently trying to reprepare the cluster on each sync, it should probably have some checks to skip preparing the cluster if its fine (not really related to this bug, but may be worth checking).

Comment 7 Nir Yechiel 2022-06-15 08:54:15 UTC
This one seems to be related --> cloud prepare for AWS is invoked periodically: https://github.com/stolostron/submariner-addon/issues/386

Comment 9 Maxim Babushkin 2022-08-05 05:07:58 UTC
The fix has been verified both on 2.6/0.13.0 and 2.5.2/0.12.2 versions.


Note You need to log in before you can comment on or make changes to this bug.