Description of problem: After successfully installing a spoke cluster, the assisted-service pod continuously outputs error logs that the cluster does not exist. time="2021-05-19T19:13:24Z" level=error msg="cluster 44c6d686-3dfc-4d92-8244-3efd0166b09e does not exist" func="github.com/openshift/assisted-service/pkg/auth.(*LocalAuthenticator).AuthAgentAuth" file="/go/src/github.com/openshift/origin/pkg/auth/local_authenticator.go:77" pkg=auth How reproducible: 100% Steps to Reproduce: 1. Create ClusterDeployment, AgentClusterInstall, InfraEnv, and BMH 2. Wait for installation to finish 3. View logs from assisted-service pod Actual results: Error messages that cluster does not exist are output every 3 seconds. Expected results: No error messages regarding the previously installed cluster Additional info: I imagine that this pertains to detaching the cluster after installation is complete, but the error messages are not good for user experience or reflective of the state of the cluster.
I'm working to reproduce it so I can investigate. Trey, any chance I can take a look at your setup?
I was able to reproduce this with test-infra, and with no operator involved. $ make deploy_nodes_with_install NUM_MASTERS=1 ENABLE_KUBE_API=true I suspect what is described in the bug, are responses to requests from the ctrl pod. Result: service log: https://gist.github.com/nmagnezi/e056103d286a95d1d96a65ef8159643a ctrl pod log: https://gist.github.com/nmagnezi/4343aed8bda850141c00040380cd7dc9
@itsoiref Isn't this similar to the issue we had in the past where the reported continuously tried to report to a cluster which was deleted?
I guess the controller should just check the error code and quite in case of 404. Since this behavior doesn't affect anything, I'm setting the severity to low.
Previously was decided that controller should stop running 10 minutes after it started to get 401 or 404. We have monitor for it and controller stopped itself after 10 minutes. DO we want to make this period shorter?
So let's update the logic to: in the case of 404 the monitor will stop the controller immediately. In the case of 401 the monitor will keep trying for 3 minutes in order to accommodate possible issues in the cloud deployment authentication service. We can optimize the behavior for 401 if we tell the controller that the service is running on-prem / managed by the operator.
I think this is more critical since in the SNO case, with 1000 clusters being installed concurrently, we'll have 1000 cluster spamming the service (CPU and logs) for 5-10 minutes
In case of 404 it will take up to minutes for controller to exit In case of 401 it will take 3 minutes to exit. I think this is good behavior. We tested previous behavior that took 10 minutes to stop with 1000 clusters and it was looking good. This change will optimize it even more.
*** Bug 1964720 has been marked as a duplicate of this bug. ***
Fix is here: https://github.com/openshift/assisted-installer/pull/287
Verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438