Bug 1970464
| Summary: | Google Cloud is not reflecting the correct information of the new master created after restoring etcd member from OCP | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pamela Escorza <pescorza> |
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED DEFERRED | Docs Contact: | |
| Severity: | medium | ||
| Priority: | unspecified | CC: | aos-bugs, jspeed, mimccune, rsandu, wking |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-06 14:43:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Pamela Escorza
2021-06-10 14:08:20 UTC
Could we please collect a must gather from the customer cluster? In particular I think we need to take a look at the machines and the machine controller. The load balancer attachment for master machines is the responsibility of the machine controller. We need to check the correct load balancers are noted on the machine spec. I've had a look through the must gather and can see that there are no `targetPools` listed on the `providerSpec`. This means that the machine controller isn't adding the new master to the load balancers. I believe part of the process of adding the master to the load balancer is to move it to the correct instance groups. Has the customer tried adding the name of the master target pools to the machine spec? The field is a list of strings I'm going to try to bring up a cluster to verify this, but it may take some time for me to get one up and check this out I've had a play around with a cluster on GCP today. This isn't a bug but a misconfiguration.
Looking at the must gather, the `targetPools` field is missing from the master instances. Could you please ask the customer to ensure that on their master instances, they add the `targetPools` value.
spec:
providerSpec:
value:
targetPools:
- ocp-int-79462-api-internal
> As the restored master member belongs to a worker instance group:
> k8s-ig--9dbb74ab8cd189b6 europe-west2-c zone vcp-1 No 2 ---> worker,infra
> k8s-ig--9dbb74ab8cd189b6 europe-west2-b zone vcp-1 No 3 ---> worker,infra and the new master node
> k8s-ig--9dbb74ab8cd189b6 europe-west2-a zone vcp-1 No 2 ---> worker,infra
This is actually not true, looking at the must gather, they have 7 worker machines, these are all accounted for in these instance groups. These instance groups are related to a service object that uses type LoadBalancer. Because each of the master nodes has the following label `node.kubernetes.io/exclude-from-external-load-balancers: ""`, they are not included in the backends for this service and do not appear in the instance groups here either.
I think if the customer adds the `targetPools` to their master instances then the controller should fix things up and make sure that the instance is added to the load balancer appropriately.
Looking at the latest clusters being built, this is done automatically for newer clusters, perhaps this was missed in the installer when this particular cluster was installed.
Hi Joel, Please allow me to clarify that information in the description is about a cluster where I had replicated the behavior. And the must-gather is from customer cluster. So I have a new cluster and has applied the targetPool as suggested and the result is that provisioning is not working: ~~~ $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ocp-template-4pvnk-master-0 Running n1-standard-4 europe-west2 europe-west2-a 23h ocp-template-4pvnk-master-1 Running n1-standard-4 europe-west2 europe-west2-b 23h ocp-template-4pvnk-master-3 Provisioning n1-standard-4 europe-west2 europe-west2-c 4h36m ocp-template-4pvnk-worker-a-m88kr Running n1-standard-4 europe-west2 europe-west2-a 23h ocp-template-4pvnk-worker-b-dgxlq Running n1-standard-4 europe-west2 europe-west2-b 23h ocp-template-4pvnk-worker-c-kr2tq Running n1-standard-4 europe-west2 europe-west2-c 23h ~~~ ~~~ $ oc get event -n openshift-machine-api LAST SEEN TYPE REASON OBJECT MESSAGE 13m Warning FailedUpdate machine/ocp-template-4pvnk-master-3 ocp-template-4pvnk-master-3: reconciler failed to Update machine: unable to get targetpool: googleapi: Error 404: The resource 'projects/pescorza-tam-ocp-cee/regions/europe-west2/targetPools/ocp-template-4pvnk-api-internal' was not found, notFound ~~~ Load Balancer details: ~~~ ocp-template-4pvnk-api-internal Frontend Protocol: TCP Scope: europe-west2 Subnetwork vcp1-we1nip (10.17.0.0/16) IP:Ports 10.17.0.43:6443,22623 Backend Region: europe-west2 Network: vcp-1 Endpoint protocol: TCP Session affinity: None Health check: ocp-template-4pvnk-api-internal Instance group Zone Healthy Autoscaling Use as failover group ocp-int-79462-master-europe-west2-a europe-west2-a 1 of 1 No configuration No ocp-int-79462-master-europe-west2-b europe-west2-b 1 of 1 No configuration No ocp-int-79462-master-europe-west2-c europe-west2-c 0 of 0 No configuration No ~~~ Checking instances groups after restoring the etcd master: ~~~ $ gcloud compute instance-groups list --sort-by=Name NAME LOCATION SCOPE NETWORK MANAGED INSTANCES k8s-ig--5d0a1f297ee7a62e europe-west2-c zone vcp-1 No 2 k8s-ig--5d0a1f297ee7a62e europe-west2-b zone vcp-1 No 1 k8s-ig--5d0a1f297ee7a62e europe-west2-a zone vcp-1 No 1 ocp-template-4pvnk-master-europe-west2-a europe-west2-a zone vcp-1 No 1 ocp-template-4pvnk-master-europe-west2-b europe-west2-b zone vcp-1 No 1 ocp-template-4pvnk-master-europe-west2-c europe-west2-c zone vcp-1 No 0 ~~~ On the other hand as per documentation: https://docs.openshift.com/container-platform/4.6/installing/installing_gcp/installing-gcp-private.html#private-clusters-about-gcp_installing-gcp-private " The internal load balancer relies on instance groups rather than the target pools that the network load balancers use." Could you please let me know how to proceed from here? Ahh, I did not realise that this was a private GCP cluster. In this case, as far as I know, Machine API cannot yet manage these internal API load balancers. My suggestion is that the customer manually adds the instance back to the load balancer instance group themselves as this is something that is manually done by the installer. We should add a note to the documentation that the customer may need to do this on GCP private clusters. @jspeed: is there something else we can do in order to improve this manual process? We may be able to add support in the longer term for these kinds of load balancers, but we would need an RFE for this as it will need investigation and is likely to be a larger piece of work Hi @jspeed, Customer requested the enhancement so RFE is in place: https://issues.redhat.com/browse/RFE-1975 Please let me know if any further information is required. thank you. Would you please fill out the four sections in the description of the RFE to the best of your ability to help our product manager to understand the use cases and requirements @jspeed, done now. Not sure why it was not available from the creation of the RFE as I filled out the form. Thanks, going to close this out for now and we can continue the discussion via the RFE |