1956609 – [cluster-machine-approver] CSRs for replacement control plane nodes not approved after restore from backup

Bug 1956609 - [cluster-machine-approver] CSRs for replacement control plane nodes not approved after restore from backup

Summary: [cluster-machine-approver] CSRs for replacement control plane nodes not appro...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-04 04:22 UTC by Maru Newby
Modified:	2021-07-27 23:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:06:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-machine-approver pull 121	None	open	Bug 1956609: Bypass cache when reading Node and Machine objects	2021-05-26 14:40:04 UTC
Github	openshift cluster-machine-approver pull 123	None	open	Bug 1956609: Use a direct client for uncached reads	2021-06-10 13:53:39 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 23:06:25 UTC

Description Maru Newby 2021-05-04 04:22:13 UTC

It appears that the machine approver is detects the addition of machines added during the disruptive job's quorum restore test slowly or not at all. This results in CSRs from the kubelet on those machines being denied due to the approver not being able to find the machines indicated in the CSRs.

The test in question has failed the following disruptive builds:

- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1389227167677681664

- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1386740421345939456

- https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1379894501044129792

The common symptom from the machine approver logs is that it can't find the machine for the request e.g.:

`csr-457n7: failed to find machine for node ip-10-0-155-29.us-west-2.compute.internal, cannot approve`

All 3 builds show the machines for which CSRs are being denied as being present for a sustained period of time (>45 minutes) before the test times out, and a stream of 'failed to find machine' in the approver logs for that time period. This may be an indication that the mechanism the machine approver is using to read the machines (apparently controller-runtime) is not reading the latest API state and is instead returning stale cached data.

An additional build does show approval, but approval too long enough that the cluster didn't have time to stabilize before timeout:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1387401783969583104  

Given the apparent lack of existing test coverage on the machine approver to prevent this kind of regression, I'll be enabling a disruptive job on the cluster-machine-approver repo.

Comment 1 Joel Speed 2021-05-12 10:57:34 UTC

I've spent a bit of time over the last couple of days to investigate reproducing this, which has involved familiarising myself with the test suite and getting clusters up and the right combinations of options to get the test to run, so I'll post here in case anyone else wants to try

1. Ensure you have an SSH key pair on the AWS region you want 
2. Create a cluster using an IPI install and provide the key from step 1
3. Create a bastion host https://docs.openshift.com/container-platform/4.7/networking/accessing-hosts.html, again using the same key.
  a. I'd recommend using a container linux style AMI as this will automatically create a "core" user for you
  b. If you use something else, you'll need to set up a user called "core" and ensure that the public key is present in their `authorized_keys` file.
4. Note down the public IP of the bastion host
5. Checkout the branch from https://github.com/openshift/origin/pull/25774
6. Build the test binary using make: `make WHAT=cmd/openshift-tests`
7. Ensure your KUBECONFIG env var is set correctly
8. Run the test: KUBE_SSH_USER=core KUBE_SSH_BASTION=<bastion-host-ip>:22 KUBE_SSH_KEY_PATH=<path-to-private-key> ./openshift-tests run all --run "Cluster should restore itself after quorum loss"  --timeout 90m
  a. I found the timeout is important, if you leave it out the test will cancel after 15 minutes (which in my case was just after the masters were deleted, but before the restore)

In my attempts so far at running this test, I have not managed to reproduce the issue.
That said, in both attempts, the cluster-machine-approver pod has been rescheduled during the test execution.

I suspect that this problem may be dependent on which host the pod is initially running on, ie, if it is running on the "survivor" control plane host, it shouldn't be rescheduled and therefore might exhibit the issues described.

I need to determine if the "survivor" is always the same control plane host, ie, host 0, and make sure the approver pod exists on that host before starting the test to try and reproduce further.

---

Note, I'm removing the blocker flag as, based on what I'm observing, this is not a release or upgrade blocker

Comment 2 Joel Speed 2021-05-12 11:00:30 UTC

Looking at the code, the survivor is randomly chosen: https://github.com/openshift/origin/blob/d704a4d2ab5e55731d11770c11eacd666940b944/test/extended/dr/quorum_restore.go#L95

To test my theory, I will modify the test to ensure that I can deterministically either place the machine approver pod on or off the surviving node and observe how it behaves in both scenarios

Comment 3 Joel Speed 2021-05-12 11:10:28 UTC

In all 3 of the linked cases, the pod creation timestamp for the machine-approver lines up with the creation timestamp of the remaining master node.
Had it been evicted from another machine it would be much newer than this.

I've seen a situation before where, because the pod on the node was trying to reach the API server service, but the only healthy target was on the node that the traffic originated from, the iptables rule actually blocked the traffic.
I wonder if this could be a similar situation.

Comment 4 Joel Speed 2021-05-12 12:24:08 UTC

Having now tried with a pod that wasn't rescheduled during the test, I still have been unable to reproduce the issue.
Will give it another go, but at this point I'm out of ideas on other things to try to reproduce this.

Comment 5 Michael Gugino 2021-05-12 18:10:54 UTC

Looking at the machine-approver logs, I see: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1389227167677681664/artifacts/e2e-aws-disruptive/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-868f6686bb-qpk7s_machine-approver-controller.log

I0503 15:27:57.661847       1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I0503 15:27:57.662007       1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I0503 15:27:57.662017       1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I0503 15:27:57.662033       1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I0503 15:27:57.662046       1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF

There's no indication afterwards that those connections timed out or we reconnected to a healthy API server.  Either the timeout logic is broken for this component, or we reconnected to an API server that was partitioned and the API server accepted the connection.

We know based on previous bugs that when the machine-controller connects to the bootstrap API server, we will see the EOF.  However, that instance goes away, and the connection eventually successfully times out and we reconnect to an actual API server.  There are no reconnect statements in this log, which leads me to believe the API server is indeed partitioned and accepting connections, at least for a period of time.

Comment 6 Maru Newby 2021-05-26 03:43:26 UTC

@mgugino EOF events are to be expected. This is disruptive testing that involves taking down 2 masters and restoring the cluster. The apiserver will be be offline as part of the restore. 

The tail of the log you link to demonstrates exactly the symptom that prompted me to file this bz - the machine approver doesn't see nodes in its cache that are clearly present accordingly to must gather.

A more recent result is a different indication that the machine approver's informer is not behaving correctly on restore:

E0525 21:18:46.848937       1 csr_check.go:284] csr-m4qc4: node ci-op-28j88qkl-e3deb-z8lpk-master-2 already exists, cannot approve
I0525 21:18:46.848955       1 controller.go:172] csr-m4qc4: CSR not authorized

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-gcp-disruptive/1397271757248794624/artifacts/e2e-gcp-disruptive/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-5d5587cd4-l5t4k_machine-approver-controller.log

This despite the PR in question explicitly ensuring the deletion of the nodes in question after restore, and the must gather for the build not including the nodes that the machine approver is convinced still exist:

https://github.com/openshift/origin/pull/25774/files#diff-eafa870d969a05bf5828dc75ceaed394396b44e7beec7a8229a305a4c0f0b1c6R188

This is another good indication that the machine approver is broken by cluster restore. Maybe a consequence of the switch to controller-runtime? This underscores the importance of gating the machine approver on the disruptive test so that regressions of critical functionality like cluster restore can be caught pre-merge.

Comment 7 Joel Speed 2021-05-26 14:14:15 UTC

I've been digging into this today, looking through the informer code. I don't think this is an issue with controller-runtime itself, but more an issue with the switch to controller-runtime.

I've posted a question in slack to confirm what I'm seeing/understanding here, https://kubernetes.slack.com/archives/C0EG7JC6T/p1622038243031200

But the observation I'm making is that the informer store in client go only removes objects when it sees a delete event. There doesn't seem to be any sort of reset action for the store. This could explain why the client is returning these stale objects.

Pre-controller runtime the client we were using was a direct client, accessing the API for each request.
When we switched to controller-runtime, we now use a cached client that reads directly from the informer store.

We can continue using controller-runtime but still use a direct client, I'm tempted to raise a PR to do this and see if that helps what we are seeing with these failures.
I think our controller is sufficiently low volume in terms of API requests that we should be ok with a non-cached client.

Comment 8 Michael Gugino 2021-05-26 16:08:31 UTC

IMO, yes, the cache can go stale, but it shouldn't be stale forever.  We have observed timeouts elsewhere for a similar situation when connecting to the bootstrap API server,  Eventually, since that server actually goes away, our client times out and reconnects to a real server.

My concern is, we're connecting to a server that has not yet rejoined the cluster.  The server accepts the API connection, but we will never receives updates.  I think the proposed fix will inform us one way or the other regarding this.

Comment 9 Joel Speed 2021-05-26 16:14:18 UTC

+1 that was my theory, cutting out the cache will either resolve the issue, or show us that it is not in fact a caching issue with the CMA, in which case we can start looking at why the API is returning the wrong data.

I think given the nature of this problem and the fact we have been unable to reproduce, the only way to observe this will be to merge and then revisit within a couple of weeks to see if the CI is still reporting the same failures.

Comment 11 Maru Newby 2021-06-02 04:19:41 UTC

The problem is still occurring after the switch to non-caching client:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1398040226802176000
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1399766491112411136

> +1 that was my theory, cutting out the cache will either resolve the issue, or show us that it is not in fact a caching issue with the CMA, in which case we can start looking at why the API is returning the wrong data.

Must gather from the builds failing due to the reported issue indicate that the API has the correct data. Machines are collected despite CMA reporting they don't exist, or nodes are missing despite CMA reporting they exist. In both cases, the CMA is incorrectly denying CSR requests.

I do not believe this is the result of the CMA being connected to a stale/out-of-sync API server. This test removes 2 of the masters and restores on the 3rd. That 3rd master is the only API server left standing, and the subsequently collected must-gather can only be from that API server. The CMA's refusal to admit new nodes precludes the possibility of other API servers being involved.

To your point about the issue being potentially related to whether the CMA is rescheduled or not, what needs to happen to ensure the CMA is healthy after a restore regardless of the node it was running on before control plane nodes are removed?

Comment 12 Maru Newby 2021-06-02 15:47:21 UTC

In 4.8, CMA went from being able to consistently approve newly added machines after restore to being able to do so only some of the time. Not being able to consistently restore a cluster that has lost masters represents a severe problem for the product, and not one that suggests waiting weeks to resolve.

What do you need - and from who - to justify the time and energy needed to fix this regression? 

Note that an optional disruptive job is now enabled to always run against CMA (https://github.com/openshift/release/pull/18284/files) and the quorum restore test fixes have been merged (https://github.com/openshift/origin/pull/25774). This should simplify both reproduction and in validating any potential fix.

Note that a code fix is not the only destination on the path to resolution. A manual fix that we could document would be a good interim step for customers even if it didn't ensure the desired behavior in CI.

Comment 13 Maru Newby 2021-06-03 01:03:01 UTC

Now that https://github.com/openshift/origin/pull/25774 has merged, I am able to reproduce the failure reported in this bz on ci builds running on a no-op PR: 

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/556/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-disruptive/1400191738550685696

Comment 14 Joel Speed 2021-06-08 14:24:26 UTC

Maru, apologies for the delay in response, I was out for a little over a week and am just catching up on what has been missed.

> I do not believe this is the result of the CMA being connected to a stale/out-of-sync API server. This test removes 2 of the masters and restores on the 3rd. That 3rd master is the only API server left standing, and the subsequently collected must-gather can only be from that API server. The CMA's refusal to admit new nodes precludes the possibility of other API servers being involved.

It was my understanding that part of this test includes restoring the master. IIRC the kubelet not having been admitted to the cluster does not prevent the static pods from being created. When these new machines come up, do they have the static pod manifests already or do they only get added after the node has come up as a "worker", I thought the separate MCP meant they come up straight away as masters with those static pods there.
If that is the case, then I think this doesn't rule out a weird API server state, because it could connect/reconnect to the restored API which may have some issues just after starting? I'm not 100% sure on this one but I think you are probably right that this isn't an API problem anyway.

> To your point about the issue being potentially related to whether the CMA is rescheduled or not, what needs to happen to ensure the CMA is healthy after a restore regardless of the node it was running on before control plane nodes are removed?

We don't know the answer to this yet, I think further investigation is required. IMO, probably kicking the pod and getting a new one will probably be sufficient, but we need to be able to reproduce this and check it before we can do that. I've been unable to reproduce so far.

> What do you need - and from who - to justify the time and energy needed to fix this regression? 

Our team is under pressure from a lot of directions at the moment and I am not 100% convinced this is as urgent as you suggest. It is very easy for a customer to workaround this, they can either kick the pod during the restore or they can manually approve the certificates with a single `oc` command.
We need time to investigate this and determine where the issue actually is. We are still not 100% convinced that the issue lies within our codebase. While we have made changes, other teams have also made changes and as part of our discussion today, we came up with several scenarios in which we think we could observe stale data.

I will spend time over the next few days to investigate this further and come back with more detail once I have managed to reproduce the issue and find extra detail to add.

> Note that a code fix is not the only destination on the path to resolution. A manual fix that we could document would be a good interim step for customers even if it didn't ensure the desired behavior in CI.

Customers who are in this situation can use the standard kubectl and oc commands to list the CSRs, observe those that haven't been approved and approve them manually. Alternatively deleting the CSR pod is likely to also bring it back from the dead if it's in this scenario. A fresh pod will definitely not have any stale data and should therefore be able to approve the CSRs as it starts.

Where do you think is appropriate to document this kind of workaround for the moment, if a customer were to come across this issue in the wild?

Comment 15 Joel Speed 2021-06-09 12:20:43 UTC

I have created https://github.com/openshift/cluster-machine-approver/pull/122 which increases the log level of the cluster machine approver.
I will keep retesting this until we get a failure that matches the symptoms of this BZ. Based on previous comments this should happen as we are effectively running the same code as in #13.

Once we have the verbose log output we can determine exactly which requests and responses are being made/received and therefore should be able to work out if our current understanding and theories are true.

Comment 16 Joel Speed 2021-06-10 11:21:12 UTC

This test run failed [1] and we have the 2 master machines stuck in provisioned [2] and the logs from the CMA at level 9 [3].
I've got quite a busy schedule today but will try to find some time to dig through and work out what went wrong. If not, I'll take a look next week as I'm out tomorrow (unless anyone else has the time to look)

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/122/pull-ci-openshift-cluster-machine-approver-master-e2e-aws-disruptive/1402894467832221696 
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/122/pull-ci-openshift-cluster-machine-approver-master-e2e-aws-disruptive/1402894467832221696/artifacts/e2e-aws-disruptive/gather-extra/artifacts/machines.json
[3] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/122/pull-ci-openshift-cluster-machine-approver-master-e2e-aws-disruptive/1402894467832221696/artifacts/e2e-aws-disruptive/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-79c8548fd4-sdl5v_machine-approver-controller.log

Comment 17 Joel Speed 2021-06-10 11:48:01 UTC

Interesting notes from the log line:

Throughout the entire log, the list call is only made twice:

> I0610 08:21:15.590013       1 round_trippers.go:435] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: machine-approver/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" 'https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?limit=500&resourceVersion=0'
> I0610 08:21:15.637426       1 round_trippers.go:454] GET https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?limit=500&resourceVersion=0 200 OK in 47 milliseconds

and just over 10 minutes later (10 minutes is the resync period):

> I0610 08:32:46.807883       1 round_trippers.go:435] curl -k -v -XGET  -H "Authorization: Bearer <masked>" -H "Accept: application/json, */*" -H "User-Agent: machine-approver/v0.0.0 (linux/amd64) kubernetes/$Format" 'https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?resourceVersion=28499'
> I0610 08:32:46.821522       1 round_trippers.go:454] GET https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?resourceVersion=28499 200 OK in 13 milliseconds

I would have expected to see a full list every 10-12 minutes based on the code with the client-go informer codebase. For some reason this isn't happening.

For the rest of the time, we see watches attempting to start but finishing very quickly, this is unexpected, watches should last for some period of time:

> I0610 08:32:46.822748       1 round_trippers.go:435] curl -k -v -XGET  -H "User-Agent: machine-approver/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" -H "Accept: application/json, */*" 'https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?allowWatchBookmarks=true&resourceVersion=28937&timeoutSeconds=389&watch=true'
> I0610 08:32:46.823810       1 round_trippers.go:454] GET https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?allowWatchBookmarks=true&resourceVersion=28937&timeoutSeconds=389&watch=true 200 OK in 0 milliseconds
> I0610 08:32:46.823881       1 round_trippers.go:460] Response Headers:
> I0610 08:32:46.823911       1 round_trippers.go:463]     Date: Thu, 10 Jun 2021 08:32:46 GMT
> I0610 08:32:46.823937       1 round_trippers.go:463]     Audit-Id: dd18b896-fa83-4153-bb44-f081a4e83064
> I0610 08:32:46.823948       1 round_trippers.go:463]     Cache-Control: no-cache, private
> I0610 08:32:46.823954       1 round_trippers.go:463]     Content-Type: application/json
> I0610 08:32:46.823958       1 round_trippers.go:463]     X-Kubernetes-Pf-Flowschema-Uid: ed937d16-bb40-4e9a-9ec2-1b056fac2ed6
> I0610 08:32:46.823963       1 round_trippers.go:463]     X-Kubernetes-Pf-Prioritylevel-Uid: 0d9b12a4-2e75-45cc-9dc7-fc6c5442ffad

On some of these occasions, headers are returned, on others, they aren't. (This one is around the time the test is running, the new machines were created at 09:13):

> I0610 09:06:08.676659       1 round_trippers.go:435] curl -k -v -XGET  -H "User-Agent: machine-approver/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" -H "Accept: application/json, */*" 'https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?allowWatchBookmarks=true&resourceVersion=49660&timeoutSeconds=327&watch=true'
> I0610 09:06:08.677083       1 round_trippers.go:454] GET https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?allowWatchBookmarks=true&resourceVersion=49660&timeoutSeconds=327&watch=true  in 0 milliseconds
> I0610 09:06:08.677101       1 round_trippers.go:460] Response Headers:

In the minutes after the new machines are added, we can see this continue, including GOAWAY signal from the API server and the fact that no events for machines are sent during the watch.

> I0610 09:16:34.551581       1 streamwatcher.go:114] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=5, ErrCode=NO_ERROR, debug=""
> I0610 09:16:34.551662       1 reflector.go:530] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:241: Watch close - *v1beta1.Machine total 0 items received
> I0610 09:16:34.551850       1 round_trippers.go:435] curl -k -v -XGET  -H "Authorization: Bearer <masked>" -H "User-Agent: machine-approver/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" 'https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?allowWatchBookmarks=true&resourceVersion=49660&timeoutSeconds=525&watch=true'
> I0610 09:16:34.552052       1 round_trippers.go:454] GET https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines?allowWatchBookmarks=true&resourceVersion=49660&timeoutSeconds=525&watch=true  in 0 milliseconds
> I0610 09:16:34.552101       1 round_trippers.go:460] Response Headers:

Looking at the requests that are being made here, for the new watch events, we can see that the resourceVersion is set to 49660. This means that any event prior to this would not be caught.
The first time this requested resource version is made, the log starts with `I0610 09:01:03.706503`. This is BEFORE the new machine objects are created.
Yet, the new Machine objects have resource versions of 42214 and 42131, which is before the start of the watch call resource version.

What has happened here is that the backup that is being restored in the test is not absolutely up to date with the lastest state of etcd.
The resource version has been rewound due to this restore of the etcd backup and this is causing the controller to not get the events from the Machine objects being created.

I want to be clear this isn't a problem that is limited to CMA, this will effect all controllers as this is core informer code that is running here.

Open Questions:
1. Why did my remove the cache fix not work, I don't see the list calls happening which implies we are still going through the cache?
2. Why aren't we seeing list calls? These should be happening every 10 minutes as far as I'm aware
3. How can we make the informer intelligent enough to detect the resource version going backwards and reset?

Comment 19 Maru Newby 2021-06-14 17:03:44 UTC

(In reply to Joel Speed from comment #14)

> > I do not believe this is the result of the CMA being connected to a stale/out-of-sync API server. This test removes 2 of the masters and restores on the 3rd. That 3rd master is the only API server left standing, and the subsequently collected must-gather can only be from that API server. The CMA's refusal to admit new nodes precludes the possibility of other API servers being involved.
> 
> It was my understanding that part of this test includes restoring the
> master. IIRC the kubelet not having been admitted to the cluster does not
> prevent the static pods from being created. When these new machines come up,
> do they have the static pod manifests already or do they only get added
> after the node has come up as a "worker", I thought the separate MCP meant
> they come up straight away as masters with those static pods there.
> If that is the case, then I think this doesn't rule out a weird API server
> state, because it could connect/reconnect to the restored API which may have
> some issues just after starting? I'm not 100% sure on this one but I think
> you are probably right that this isn't an API problem anyway.

Static pods are created by operators. Creation involves the scheduling of a non-static installation pod to a node, which mounts the manifest path on the host and writes the pod to that mounted path. Ergo, it is impossible for a new apiserver static pod to be running on a machine that has not yet been successfully registered as a node.

Comment 20 Joel Speed 2021-06-15 09:22:31 UTC

Just wanted to add a note here that based on the PR I was using to debug this [1], I ran a new test [2] yesterday which will included the merged fix for this PR.

We can see from the machine approver logs [3] in this run that it is listing on every reconcile and the cache is not being used, ergo, the issue should be resolved now.

> I0614 15:58:59.798965       1 controller.go:114] Reconciling CSR: csr-7tb67
> I0614 15:58:59.799221       1 round_trippers.go:435] curl -k -v -XGET  -H "User-Agent: machine-approver/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" -H "Authorization: Bearer <masked>" 'https://172.30.0.1:443/apis/machine.openshift.io/v1beta1/machines'

[1]: https://github.com/openshift/cluster-machine-approver/pull/122
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/122/pull-ci-openshift-cluster-machine-approver-master-e2e-aws-disruptive/1404425266184327168
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-machine-approver/122/pull-ci-openshift-cluster-machine-approver-master-e2e-aws-disruptive/1404425266184327168/artifacts/e2e-aws-disruptive/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-5c4cdfc8f5-2b5b4_machine-approver-controller.log

Comment 22 sunzhaohua 2021-06-18 05:50:59 UTC

@jspeed @mnewby I have no idea how to reproduce and verify this bug, do we have detail steps how to replace a master node or based on Comment 20, we can move this to verified?

Comment 23 Joel Speed 2021-06-18 09:09:55 UTC

I think this is incredibly hard to actually reproduce. I think the verification here is a combination of comment 20 and the fact the disruptive test is no longer failing at the rate it was before.

I think this can be moved to verified

Comment 24 sunzhaohua 2021-06-18 14:31:42 UTC

thanks Joel, move to verified.

Comment 26 errata-xmlrpc 2021-07-27 23:06:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.