Bug 1970464

Summary: Google Cloud is not reflecting the correct information of the new master created after restoring etcd member from OCP
Product: OpenShift Container Platform Reporter: Pamela Escorza <pescorza>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, jspeed, mimccune, rsandu, wking
Version: 4.6   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-06 14:43:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pamela Escorza 2021-06-10 14:08:20 UTC
Description of problem:
In an OCP IPI on GCP, each master belongs to an instance group created based on the zone location:

 $ gcloud compute instance-groups list --sort-by=Name
NAME                                     LOCATION        SCOPE  NETWORK                    MANAGED  INSTANCES
k8s-ig--9dbb74ab8cd189b6                 europe-west2-c  zone   vcp-1                      No       2           ---> worker,infra
k8s-ig--9dbb74ab8cd189b6                 europe-west2-b  zone   vcp-1                      No       2           ---> worker,infra
k8s-ig--9dbb74ab8cd189b6                 europe-west2-a  zone   vcp-1                      No       2           ---> worker,infra
ocp-int-79462-master-europe-west2-a      europe-west2-a  zone   vcp-1                      No       1           ---> master
ocp-int-79462-master-europe-west2-b      europe-west2-b  zone   vcp-1                      No       1           ---> master
ocp-int-79462-master-europe-west2-c      europe-west2-c  zone   vcp-1                      No       1           ---> master

The Load Balancer create for the internal api:

ocp-int-79462-api-internal
Frontend
  Protocol:           TCP
  Scope:              europe-west2 
  Subnetwork          vcp1-we1nip (10.17.0.0/16)
  IP:Ports            10.17.0.25:6443,22623
	 	 		
Backend
  Region:             europe-west2 
  Network:            vcp-1
  Endpoint protocol:  TCP 
  Session affinity:     None 
  Health check:         ocp-int-79462-api-internal

  Instance group                          Zone                  Healthy        Autoscaling         Use as failover group
  ocp-int-79462-master-europe-west2-a	  europe-west2-a        1 of 1         No configuration	   No 	
  ocp-int-79462-master-europe-west2-b	  europe-west2-b 	    1 of 1         No configuration	   No 	
  ocp-int-79462-master-europe-west2-c	  europe-west2-c        1 of 1         No configuration	   No 	

After restoring an etcd member as per documentation: 
https://docs.openshift.com/container-platform/4.6/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member.

The restored master is not reflected at all as part of the load balancer for the internal api:
Instance group                          Zone                  Healthy        Autoscaling         Use as failover group
  ocp-int-79462-master-europe-west2-a	  europe-west2-a        1 of 1         No configuration	   No 	
  ocp-int-79462-master-europe-west2-b	  europe-west2-b 	    0 of 0         No configuration	   No 	
  ocp-int-79462-master-europe-west2-c	  europe-west2-c        1 of 1         No configuration	   No 	

As the restored master member belongs to a worker instance group:

k8s-ig--9dbb74ab8cd189b6                 europe-west2-c  zone   vcp-1                      No       2           ---> worker,infra
k8s-ig--9dbb74ab8cd189b6                 europe-west2-b  zone   vcp-1                      No       3           ---> worker,infra and the new master node
k8s-ig--9dbb74ab8cd189b6                 europe-west2-a  zone   vcp-1                      No       2           ---> worker,infra

The bug has been opened to verify which should be the correct procedure to apply in order to get the restored master as part of the master instance group as it's where it belongs 


Version-Release number of selected component (if applicable):
OpenShift 4.6 IPI on Google Cloud 

How reproducible:
Deploy an IPI OCP 4.6 cluster on GCP and restore an etcd member as procedure in the documentation:
https://docs.openshift.com/container-platform/4.6/backup_and_restore/replacing-unhealthy-etcd-member.html#restore-replace-stopped-etcd-member_replacing-unhealthy-etcd-member.

Steps to Reproduce:
1. Once restored the etcd member verify the instance group of the restored master 


Actual results:
Restored master member is not attached to master's instance group, it's attached to a worker's instance group .

Expected results:
Restored master should  be part of the respective master's instance group  as per zone

Comment 1 Joel Speed 2021-06-14 09:41:18 UTC
Could we please collect a must gather from the customer cluster? In particular I think we need to take a look at the machines and the machine controller.
The load balancer attachment for master machines is the responsibility of the machine controller. We need to check the correct load balancers are noted on the machine spec.

Comment 3 Joel Speed 2021-06-15 10:21:00 UTC
I've had a look through the must gather and can see that there are no `targetPools` listed on the `providerSpec`. This means that the machine controller isn't adding the new master to the load balancers. I believe part of the process of adding the master to the load balancer is to move it to the correct instance groups. Has the customer tried adding the name of the master target pools to the machine spec? The field is a list of strings

I'm going to try to bring up a cluster to verify this, but it may take some time for me to get one up and check this out

Comment 4 Joel Speed 2021-06-16 10:09:27 UTC
I've had a play around with a cluster on GCP today. This isn't a bug but a misconfiguration.

Looking at the must gather, the `targetPools` field is missing from the master instances. Could you please ask the customer to ensure that on their master instances, they add the `targetPools` value.

spec:
  providerSpec:
    value:
      targetPools:
      - ocp-int-79462-api-internal

> As the restored master member belongs to a worker instance group:

> k8s-ig--9dbb74ab8cd189b6                 europe-west2-c  zone   vcp-1                      No       2           ---> worker,infra
> k8s-ig--9dbb74ab8cd189b6                 europe-west2-b  zone   vcp-1                      No       3           ---> worker,infra and the new master node
> k8s-ig--9dbb74ab8cd189b6                 europe-west2-a  zone   vcp-1                      No       2           ---> worker,infra

This is actually not true, looking at the must gather, they have 7 worker machines, these are all accounted for in these instance groups. These instance groups are related to a service object that uses type LoadBalancer. Because each of the master nodes has the following label `node.kubernetes.io/exclude-from-external-load-balancers: ""`, they are not included in the backends for this service and do not appear in the instance groups here either.

I think if the customer adds the `targetPools` to their master instances then the controller should fix things up and make sure that the instance is added to the load balancer appropriately.

Looking at the latest clusters being built, this is done automatically for newer clusters, perhaps this was missed in the installer when this particular cluster was installed.

Comment 5 Pamela Escorza 2021-06-18 12:42:11 UTC
Hi Joel, 

Please allow me to clarify that information in the description is about a cluster where I had replicated the behavior.
And the must-gather is from customer cluster.
So I have a new cluster and has applied the targetPool as suggested and the result is that provisioning is not working:
~~~
$ oc get machine -n openshift-machine-api
NAME                                PHASE          TYPE            REGION         ZONE             AGE
ocp-template-4pvnk-master-0         Running        n1-standard-4   europe-west2   europe-west2-a   23h
ocp-template-4pvnk-master-1         Running        n1-standard-4   europe-west2   europe-west2-b   23h
ocp-template-4pvnk-master-3         Provisioning   n1-standard-4   europe-west2   europe-west2-c   4h36m
ocp-template-4pvnk-worker-a-m88kr   Running        n1-standard-4   europe-west2   europe-west2-a   23h
ocp-template-4pvnk-worker-b-dgxlq   Running        n1-standard-4   europe-west2   europe-west2-b   23h
ocp-template-4pvnk-worker-c-kr2tq   Running        n1-standard-4   europe-west2   europe-west2-c   23h
~~~

~~~
$ oc get event -n openshift-machine-api
LAST SEEN   TYPE      REASON         OBJECT                                MESSAGE
13m         Warning   FailedUpdate   machine/ocp-template-4pvnk-master-3   ocp-template-4pvnk-master-3: reconciler failed to Update machine: unable to get targetpool: googleapi: Error 404: The resource 'projects/pescorza-tam-ocp-cee/regions/europe-west2/targetPools/ocp-template-4pvnk-api-internal' was not found, notFound
~~~

Load Balancer details:
~~~
ocp-template-4pvnk-api-internal 
Frontend
  Protocol:           TCP
  Scope:              europe-west2 
  Subnetwork          vcp1-we1nip (10.17.0.0/16)
  IP:Ports            10.17.0.43:6443,22623 
	 	 		
Backend
  Region:             europe-west2 
  Network:            vcp-1
  Endpoint protocol:  TCP 
  Session affinity:     None 
  Health check:         ocp-template-4pvnk-api-internal

  Instance group                          Zone                  Healthy        Autoscaling         Use as failover group
  ocp-int-79462-master-europe-west2-a	  europe-west2-a        1 of 1         No configuration	   No 	
  ocp-int-79462-master-europe-west2-b	  europe-west2-b        1 of 1         No configuration	   No 	
  ocp-int-79462-master-europe-west2-c	  europe-west2-c        0 of 0         No configuration	   No 	
~~~

Checking instances groups after restoring the etcd master:
~~~
$ gcloud compute instance-groups list --sort-by=Name
NAME                                      LOCATION        SCOPE  NETWORK  MANAGED  INSTANCES
k8s-ig--5d0a1f297ee7a62e                  europe-west2-c  zone   vcp-1    No       2
k8s-ig--5d0a1f297ee7a62e                  europe-west2-b  zone   vcp-1    No       1
k8s-ig--5d0a1f297ee7a62e                  europe-west2-a  zone   vcp-1    No       1
ocp-template-4pvnk-master-europe-west2-a  europe-west2-a  zone   vcp-1    No       1
ocp-template-4pvnk-master-europe-west2-b  europe-west2-b  zone   vcp-1    No       1
ocp-template-4pvnk-master-europe-west2-c  europe-west2-c  zone   vcp-1    No       0
~~~

On the other hand as per documentation:

https://docs.openshift.com/container-platform/4.6/installing/installing_gcp/installing-gcp-private.html#private-clusters-about-gcp_installing-gcp-private
" The internal load balancer relies on instance groups rather than the target pools that the network load balancers use."


Could you please let me know how to proceed from here?

Comment 6 Joel Speed 2021-06-21 14:54:39 UTC
Ahh, I did not realise that this was a private GCP cluster.

In this case, as far as I know, Machine API cannot yet manage these internal API load balancers. My suggestion is that the customer manually adds the instance back to the load balancer instance group themselves as this is something that is manually done by the installer.

We should add a note to the documentation that the customer may need to do this on GCP private clusters.

Comment 7 Pamela Escorza 2021-06-25 14:25:29 UTC
@jspeed: is there something else we can do in order to improve this manual process?

Comment 8 Joel Speed 2021-06-25 14:56:43 UTC
We may be able to add support in the longer term for these kinds of load balancers, but we would need an RFE for this as it will need investigation and is likely to be a larger piece of work

Comment 9 Pamela Escorza 2021-07-06 11:12:56 UTC
Hi @jspeed, Customer requested the enhancement so RFE is in place:  https://issues.redhat.com/browse/RFE-1975
Please let me know if any further information is required. thank you.

Comment 10 Joel Speed 2021-07-06 11:58:43 UTC
Would you please fill out the four sections in the description of the RFE to the best of your ability to help our product manager to understand the use cases and requirements

Comment 11 Pamela Escorza 2021-07-06 12:18:29 UTC
@jspeed, done now. Not sure why it was not available from the creation of the RFE as I filled out the form.

Comment 12 Joel Speed 2021-07-06 14:43:17 UTC
Thanks, going to close this out for now and we can continue the discussion via the RFE