2093990 – [Submariner] - Deployment of 2 gateway nodes on aws platform fails

Bug 2093990 - [Submariner] - Deployment of 2 gateway nodes on aws platform fails

Summary: [Submariner] - Deployment of 2 gateway nodes on aws platform fails

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Submariner
Sub Component:
Version:	rhacm-2.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	rhacm-2.6
Assignee:	Mike Kolesnik
QA Contact:	Maxim Babushkin
Docs Contact:	Christopher Dawson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-06 14:10 UTC by Maxim Babushkin
Modified:	2022-10-03 20:23 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-03 20:23:57 UTC
Target Upstream Version:
Embargoed:
Flags:	bot-tracker-sync: rhacm-2.6+

Attachments	(Terms of Use)
Submariner gather logs (613.78 KB, application/gzip) 2022-06-06 14:10 UTC, Maxim Babushkin	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	stolostron backlog issues 23007	None	None	None	2022-06-06 17:40:18 UTC
Github	stolostron submariner-addon issues 386	None	closed	cloud prepare for AWS is invoked periodically	2022-07-07 13:00:32 UTC
Github	stolostron submariner-addon pull 411	None	Merged	Skip syncing unchanged submariner configs	2022-07-07 13:00:32 UTC
Github	stolostron submariner-addon pull 416	None	Merged	[release-2.5] Skip syncing unchanged submariner configs	2022-07-17 10:05:12 UTC

Description Maxim Babushkin 2022-06-06 14:10:53 UTC

Created attachment 1887192 [details]
Submariner gather logs

**What happened**:
ACM 2.5
Submariner 0.12.1
Deployment platforms: AWS and GCP

Deployment of submariner with 2 gateway nodes.
On GCP platform, 2 GW nodes deployed and labeled successfully.
But on AWS platform, the following steps were occurred:
1) Gateway node vm has been created
2) It has been labeled with the gateway label
3) After a 10 - 15 seconds, the vm got destroyed
4) The gateway pod started and destroyed after a few seconds
5) Everything started again in a loop

**What you expected to happen**:
AWS platform should successfully create and label the required amount of gateway nodes.
In our case - 2.
When making a deployment with 1 gateway node, everything works without any issue.

**How to reproduce it (as minimally and precisely as possible)**:
Deploy submariner from the acm addon and specify 2 gateways for each cluster.

**Environment**:
- Submariner version (use `subctl version`): 0.12.1
- Kubernetes version (use `kubectl version`): 4.9.36
- Diagnose information (use `subctl diagnose all`): Log attached
- Gather information (use `subctl gather`) Log attached
- Cloud provider or hardware configuration: AWS provider
- Install tools: Addon deployment from ACM hub

Comment 6 Mike Kolesnik 2022-06-09 12:45:38 UTC

I have a hypothesis on why this is happening (though realistically this should be happening even with one gateway cluster, I'm not sure why that seems to work).

AWS preparation is handled on the hub, which does so successfully.
However, when the spoke tried to sync (probably due to some update event on one of the relevant resources), it tries to count the number of gateway nodes and miscounts them due to this lookup in updateGatewayStatus - https://github.com/stolostron/submariner-addon/blob/main/pkg/spoke/submarineragent/config_controller.go#L528-L531

Thne, the addon on the spoke update the status of submariner config with a "failed" condition due to "InsufficientNodes".

This causes (probably) the hub controller to try and resync the operation (or it could simply be reacting to the event due to the submariner config being updated with the failed condition). This results in some endless loop since the controller on the hub tries to run cloud prepare again (and since addon uses manifest work for deployment, it might not be idempotent causing the machineset to be destroyed and recreated).

This is wrong, since nodes are actually labeled with "submariner.io/gateway" but not necessarily with "node-role.kubernetes.io/worker" (the machineset used by cloud-prepare doesnt label as worker so that the node is a dedicated node).

Therefore, if we fix the code in `updateGatewayStatus` to only look for nodes with the gateway label, this should work as intended.

For comparison, I did a cloud prepare with `subctl` and it was behaving as expected. Also, it's clear from the attached hub log that the cloud prepare worked as expected.

Additionally, the hub controller might be inefficiently trying to reprepare the cluster on each sync, it should probably have some checks to skip preparing the cluster if its fine (not really related to this bug, but may be worth checking).

Comment 7 Nir Yechiel 2022-06-15 08:54:15 UTC

This one seems to be related --> cloud prepare for AWS is invoked periodically: https://github.com/stolostron/submariner-addon/issues/386

Comment 9 Maxim Babushkin 2022-08-05 05:07:58 UTC

The fix has been verified both on 2.6/0.13.0 and 2.5.2/0.12.2 versions.

Note You need to log in before you can comment on or make changes to this bug.