Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1962347

Summary: Cluster does not exist logs after successful installation
Product: OpenShift Container Platform Reporter: Trey West <trwest>
Component: assisted-installerAssignee: Igal Tsoiref <itsoiref>
assisted-installer sub component: Deployment Operator QA Contact: bjacot
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: alazar, aos-bugs, bjacot, danili, ercohen, itsoiref, mfilanov
Version: 4.8Keywords: Triaged
Target Milestone: ---Flags: nmagnezi: needinfo-
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Core KNI-EDGE-4.8
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:09:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Trey West 2021-05-19 19:17:24 UTC
Description of problem:

After successfully installing a spoke cluster, the assisted-service pod continuously outputs error logs that the cluster does not exist.

time="2021-05-19T19:13:24Z" level=error msg="cluster 44c6d686-3dfc-4d92-8244-3efd0166b09e does not exist" func="github.com/openshift/assisted-service/pkg/auth.(*LocalAuthenticator).AuthAgentAuth" file="/go/src/github.com/openshift/origin/pkg/auth/local_authenticator.go:77" pkg=auth


How reproducible:
100%

Steps to Reproduce:
1. Create ClusterDeployment, AgentClusterInstall, InfraEnv, and BMH
2. Wait for installation to finish
3. View logs from assisted-service pod

Actual results:

Error messages that cluster does not exist are output every 3 seconds.


Expected results:

No error messages regarding the previously installed cluster


Additional info:
I imagine that this pertains to detaching the cluster after installation is complete, but the error messages are not good for user experience or reflective of the state of the cluster.

Comment 1 Nir Magnezi 2021-05-25 11:17:46 UTC
I'm working to reproduce it so I can investigate.

Trey, any chance I can take a look at your setup?

Comment 2 Nir Magnezi 2021-05-25 15:52:44 UTC
I was able to reproduce this with test-infra, and with no operator involved.

$ make deploy_nodes_with_install NUM_MASTERS=1 ENABLE_KUBE_API=true


I suspect what is described in the bug, are responses to requests from the ctrl pod.

Result:
service log: https://gist.github.com/nmagnezi/e056103d286a95d1d96a65ef8159643a
ctrl pod log: https://gist.github.com/nmagnezi/4343aed8bda850141c00040380cd7dc9

Comment 3 Ronnie Lazar 2021-05-25 17:08:58 UTC
@itsoiref Isn't this similar to the issue we had in the past where the reported continuously tried to report to a cluster which was deleted?

Comment 4 Eran Cohen 2021-05-25 17:31:51 UTC
I guess the controller should just check the error code and quite in case of 404.
Since this behavior doesn't affect anything, I'm setting the severity to low.

Comment 5 Igal Tsoiref 2021-05-25 19:00:39 UTC
Previously was decided that controller should stop running 10 minutes after it started to get 401 or 404. We have monitor for it and controller stopped itself after 10 minutes.
DO we want to make this period shorter?

Comment 6 Eran Cohen 2021-05-26 07:14:33 UTC
So let's update the logic to:
in the case of 404 the monitor will stop the controller immediately.
In the case of 401 the monitor will keep trying for 3 minutes in order to accommodate possible issues in the cloud deployment authentication service.

We can optimize the behavior for 401 if we tell the controller that the service is running on-prem / managed by the operator.

Comment 7 Eran Cohen 2021-05-26 07:14:33 UTC
So let's update the logic to:
in the case of 404 the monitor will stop the controller immediately.
In the case of 401 the monitor will keep trying for 3 minutes in order to accommodate possible issues in the cloud deployment authentication service.

We can optimize the behavior for 401 if we tell the controller that the service is running on-prem / managed by the operator.

Comment 8 Ronnie Lazar 2021-05-26 08:27:41 UTC
I think this is more critical since in the SNO case, with 1000 clusters being installed concurrently, we'll have 1000 cluster spamming the service (CPU and logs) for 5-10 minutes

Comment 9 Igal Tsoiref 2021-05-26 08:47:16 UTC
In case of 404 it will take up to minutes for controller to exit
In case of 401 it will take 3 minutes to exit. 
I think this is good behavior. We tested previous behavior that took 10 minutes to stop with 1000 clusters and it was looking good. This change will optimize it even more.

Comment 10 bjacot 2021-05-26 12:52:53 UTC
*** Bug 1964720 has been marked as a duplicate of this bug. ***

Comment 11 Eran Cohen 2021-05-31 08:27:56 UTC
Fix is here: https://github.com/openshift/assisted-installer/pull/287

Comment 14 Trey West 2021-06-22 17:42:14 UTC
Verified

Comment 16 errata-xmlrpc 2021-07-27 23:09:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438