It appears that the machine approver is detects the addition of machines added during the disruptive job's quorum restore test slowly or not at all. This results in CSRs from the kubelet on those machines being denied due to the approver not being able to find the machines indicated in the CSRs.
The test in question has failed the following disruptive builds:
The common symptom from the machine approver logs is that it can't find the machine for the request e.g.:
`csr-457n7: failed to find machine for node ip-10-0-155-29.us-west-2.compute.internal, cannot approve`
All 3 builds show the machines for which CSRs are being denied as being present for a sustained period of time (>45 minutes) before the test times out, and a stream of 'failed to find machine' in the approver logs for that time period. This may be an indication that the mechanism the machine approver is using to read the machines (apparently controller-runtime) is not reading the latest API state and is instead returning stale cached data.
An additional build does show approval, but approval too long enough that the cluster didn't have time to stabilize before timeout:
Given the apparent lack of existing test coverage on the machine approver to prevent this kind of regression, I'll be enabling a disruptive job on the cluster-machine-approver repo.
I've spent a bit of time over the last couple of days to investigate reproducing this, which has involved familiarising myself with the test suite and getting clusters up and the right combinations of options to get the test to run, so I'll post here in case anyone else wants to try
1. Ensure you have an SSH key pair on the AWS region you want
2. Create a cluster using an IPI install and provide the key from step 1
3. Create a bastion host https://docs.openshift.com/container-platform/4.7/networking/accessing-hosts.html, again using the same key.
a. I'd recommend using a container linux style AMI as this will automatically create a "core" user for you
b. If you use something else, you'll need to set up a user called "core" and ensure that the public key is present in their `authorized_keys` file.
4. Note down the public IP of the bastion host
5. Checkout the branch from https://github.com/openshift/origin/pull/25774
6. Build the test binary using make: `make WHAT=cmd/openshift-tests`
7. Ensure your KUBECONFIG env var is set correctly
8. Run the test: KUBE_SSH_USER=core KUBE_SSH_BASTION=<bastion-host-ip>:22 KUBE_SSH_KEY_PATH=<path-to-private-key> ./openshift-tests run all --run "Cluster should restore itself after quorum loss" --timeout 90m
a. I found the timeout is important, if you leave it out the test will cancel after 15 minutes (which in my case was just after the masters were deleted, but before the restore)
In my attempts so far at running this test, I have not managed to reproduce the issue.
That said, in both attempts, the cluster-machine-approver pod has been rescheduled during the test execution.
I suspect that this problem may be dependent on which host the pod is initially running on, ie, if it is running on the "survivor" control plane host, it shouldn't be rescheduled and therefore might exhibit the issues described.
I need to determine if the "survivor" is always the same control plane host, ie, host 0, and make sure the approver pod exists on that host before starting the test to try and reproduce further.
Note, I'm removing the blocker flag as, based on what I'm observing, this is not a release or upgrade blocker
Looking at the code, the survivor is randomly chosen: https://github.com/openshift/origin/blob/d704a4d2ab5e55731d11770c11eacd666940b944/test/extended/dr/quorum_restore.go#L95
To test my theory, I will modify the test to ensure that I can deterministically either place the machine approver pod on or off the surviving node and observe how it behaves in both scenarios
In all 3 of the linked cases, the pod creation timestamp for the machine-approver lines up with the creation timestamp of the remaining master node.
Had it been evicted from another machine it would be much newer than this.
I've seen a situation before where, because the pod on the node was trying to reach the API server service, but the only healthy target was on the node that the traffic originated from, the iptables rule actually blocked the traffic.
I wonder if this could be a similar situation.
Having now tried with a pod that wasn't rescheduled during the test, I still have been unable to reproduce the issue.
Will give it another go, but at this point I'm out of ideas on other things to try to reproduce this.
Looking at the machine-approver logs, I see: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25774/pull-ci-openshift-origin-master-e2e-aws-disruptive/1389227167677681664/artifacts/e2e-aws-disruptive/gather-extra/artifacts/pods/openshift-cluster-machine-approver_machine-approver-868f6686bb-qpk7s_machine-approver-controller.log
I0503 15:27:57.661847 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I0503 15:27:57.662007 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I0503 15:27:57.662017 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I0503 15:27:57.662033 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
I0503 15:27:57.662046 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
There's no indication afterwards that those connections timed out or we reconnected to a healthy API server. Either the timeout logic is broken for this component, or we reconnected to an API server that was partitioned and the API server accepted the connection.
We know based on previous bugs that when the machine-controller connects to the bootstrap API server, we will see the EOF. However, that instance goes away, and the connection eventually successfully times out and we reconnect to an actual API server. There are no reconnect statements in this log, which leads me to believe the API server is indeed partitioned and accepting connections, at least for a period of time.