Bug 1962347 - Cluster does not exist logs after successful installation
Summary: Cluster does not exist logs after successful installation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.8.0
Assignee: Igal Tsoiref
QA Contact: bjacot
URL:
Whiteboard: AI-Team-Core KNI-EDGE-4.8
: 1964720 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-19 19:17 UTC by Trey West
Modified: 2021-07-27 23:09 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:09:14 UTC
Target Upstream Version:
Embargoed:
nmagnezi: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:09:36 UTC

Description Trey West 2021-05-19 19:17:24 UTC
Description of problem:

After successfully installing a spoke cluster, the assisted-service pod continuously outputs error logs that the cluster does not exist.

time="2021-05-19T19:13:24Z" level=error msg="cluster 44c6d686-3dfc-4d92-8244-3efd0166b09e does not exist" func="github.com/openshift/assisted-service/pkg/auth.(*LocalAuthenticator).AuthAgentAuth" file="/go/src/github.com/openshift/origin/pkg/auth/local_authenticator.go:77" pkg=auth


How reproducible:
100%

Steps to Reproduce:
1. Create ClusterDeployment, AgentClusterInstall, InfraEnv, and BMH
2. Wait for installation to finish
3. View logs from assisted-service pod

Actual results:

Error messages that cluster does not exist are output every 3 seconds.


Expected results:

No error messages regarding the previously installed cluster


Additional info:
I imagine that this pertains to detaching the cluster after installation is complete, but the error messages are not good for user experience or reflective of the state of the cluster.

Comment 1 Nir Magnezi 2021-05-25 11:17:46 UTC
I'm working to reproduce it so I can investigate.

Trey, any chance I can take a look at your setup?

Comment 2 Nir Magnezi 2021-05-25 15:52:44 UTC
I was able to reproduce this with test-infra, and with no operator involved.

$ make deploy_nodes_with_install NUM_MASTERS=1 ENABLE_KUBE_API=true


I suspect what is described in the bug, are responses to requests from the ctrl pod.

Result:
service log: https://gist.github.com/nmagnezi/e056103d286a95d1d96a65ef8159643a
ctrl pod log: https://gist.github.com/nmagnezi/4343aed8bda850141c00040380cd7dc9

Comment 3 Ronnie Lazar 2021-05-25 17:08:58 UTC
@itsoiref Isn't this similar to the issue we had in the past where the reported continuously tried to report to a cluster which was deleted?

Comment 4 Eran Cohen 2021-05-25 17:31:51 UTC
I guess the controller should just check the error code and quite in case of 404.
Since this behavior doesn't affect anything, I'm setting the severity to low.

Comment 5 Igal Tsoiref 2021-05-25 19:00:39 UTC
Previously was decided that controller should stop running 10 minutes after it started to get 401 or 404. We have monitor for it and controller stopped itself after 10 minutes.
DO we want to make this period shorter?

Comment 6 Eran Cohen 2021-05-26 07:14:33 UTC
So let's update the logic to:
in the case of 404 the monitor will stop the controller immediately.
In the case of 401 the monitor will keep trying for 3 minutes in order to accommodate possible issues in the cloud deployment authentication service.

We can optimize the behavior for 401 if we tell the controller that the service is running on-prem / managed by the operator.

Comment 7 Eran Cohen 2021-05-26 07:14:33 UTC
So let's update the logic to:
in the case of 404 the monitor will stop the controller immediately.
In the case of 401 the monitor will keep trying for 3 minutes in order to accommodate possible issues in the cloud deployment authentication service.

We can optimize the behavior for 401 if we tell the controller that the service is running on-prem / managed by the operator.

Comment 8 Ronnie Lazar 2021-05-26 08:27:41 UTC
I think this is more critical since in the SNO case, with 1000 clusters being installed concurrently, we'll have 1000 cluster spamming the service (CPU and logs) for 5-10 minutes

Comment 9 Igal Tsoiref 2021-05-26 08:47:16 UTC
In case of 404 it will take up to minutes for controller to exit
In case of 401 it will take 3 minutes to exit. 
I think this is good behavior. We tested previous behavior that took 10 minutes to stop with 1000 clusters and it was looking good. This change will optimize it even more.

Comment 10 bjacot 2021-05-26 12:52:53 UTC
*** Bug 1964720 has been marked as a duplicate of this bug. ***

Comment 11 Eran Cohen 2021-05-31 08:27:56 UTC
Fix is here: https://github.com/openshift/assisted-installer/pull/287

Comment 14 Trey West 2021-06-22 17:42:14 UTC
Verified

Comment 16 errata-xmlrpc 2021-07-27 23:09:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.