Created attachment 1887192 [details] Submariner gather logs **What happened**: ACM 2.5 Submariner 0.12.1 Deployment platforms: AWS and GCP Deployment of submariner with 2 gateway nodes. On GCP platform, 2 GW nodes deployed and labeled successfully. But on AWS platform, the following steps were occurred: 1) Gateway node vm has been created 2) It has been labeled with the gateway label 3) After a 10 - 15 seconds, the vm got destroyed 4) The gateway pod started and destroyed after a few seconds 5) Everything started again in a loop **What you expected to happen**: AWS platform should successfully create and label the required amount of gateway nodes. In our case - 2. When making a deployment with 1 gateway node, everything works without any issue. **How to reproduce it (as minimally and precisely as possible)**: Deploy submariner from the acm addon and specify 2 gateways for each cluster. **Environment**: - Submariner version (use `subctl version`): 0.12.1 - Kubernetes version (use `kubectl version`): 4.9.36 - Diagnose information (use `subctl diagnose all`): Log attached - Gather information (use `subctl gather`) Log attached - Cloud provider or hardware configuration: AWS provider - Install tools: Addon deployment from ACM hub
I have a hypothesis on why this is happening (though realistically this should be happening even with one gateway cluster, I'm not sure why that seems to work). AWS preparation is handled on the hub, which does so successfully. However, when the spoke tried to sync (probably due to some update event on one of the relevant resources), it tries to count the number of gateway nodes and miscounts them due to this lookup in updateGatewayStatus - https://github.com/stolostron/submariner-addon/blob/main/pkg/spoke/submarineragent/config_controller.go#L528-L531 Thne, the addon on the spoke update the status of submariner config with a "failed" condition due to "InsufficientNodes". This causes (probably) the hub controller to try and resync the operation (or it could simply be reacting to the event due to the submariner config being updated with the failed condition). This results in some endless loop since the controller on the hub tries to run cloud prepare again (and since addon uses manifest work for deployment, it might not be idempotent causing the machineset to be destroyed and recreated). This is wrong, since nodes are actually labeled with "submariner.io/gateway" but not necessarily with "node-role.kubernetes.io/worker" (the machineset used by cloud-prepare doesnt label as worker so that the node is a dedicated node). Therefore, if we fix the code in `updateGatewayStatus` to only look for nodes with the gateway label, this should work as intended. For comparison, I did a cloud prepare with `subctl` and it was behaving as expected. Also, it's clear from the attached hub log that the cloud prepare worked as expected. Additionally, the hub controller might be inefficiently trying to reprepare the cluster on each sync, it should probably have some checks to skip preparing the cluster if its fine (not really related to this bug, but may be worth checking).
This one seems to be related --> cloud prepare for AWS is invoked periodically: https://github.com/stolostron/submariner-addon/issues/386
The fix has been verified both on 2.6/0.13.0 and 2.5.2/0.12.2 versions.