Bug 1962347
| Summary: | Cluster does not exist logs after successful installation | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Trey West <trwest> |
| Component: | assisted-installer | Assignee: | Igal Tsoiref <itsoiref> |
| assisted-installer sub component: | Deployment Operator | QA Contact: | bjacot |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | urgent | CC: | alazar, aos-bugs, bjacot, danili, ercohen, itsoiref, mfilanov |
| Version: | 4.8 | Keywords: | Triaged |
| Target Milestone: | --- | Flags: | nmagnezi:
needinfo-
|
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | AI-Team-Core KNI-EDGE-4.8 | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 23:09:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Trey West
2021-05-19 19:17:24 UTC
I'm working to reproduce it so I can investigate. Trey, any chance I can take a look at your setup? I was able to reproduce this with test-infra, and with no operator involved. $ make deploy_nodes_with_install NUM_MASTERS=1 ENABLE_KUBE_API=true I suspect what is described in the bug, are responses to requests from the ctrl pod. Result: service log: https://gist.github.com/nmagnezi/e056103d286a95d1d96a65ef8159643a ctrl pod log: https://gist.github.com/nmagnezi/4343aed8bda850141c00040380cd7dc9 @itsoiref Isn't this similar to the issue we had in the past where the reported continuously tried to report to a cluster which was deleted? I guess the controller should just check the error code and quite in case of 404. Since this behavior doesn't affect anything, I'm setting the severity to low. Previously was decided that controller should stop running 10 minutes after it started to get 401 or 404. We have monitor for it and controller stopped itself after 10 minutes. DO we want to make this period shorter? So let's update the logic to: in the case of 404 the monitor will stop the controller immediately. In the case of 401 the monitor will keep trying for 3 minutes in order to accommodate possible issues in the cloud deployment authentication service. We can optimize the behavior for 401 if we tell the controller that the service is running on-prem / managed by the operator. So let's update the logic to: in the case of 404 the monitor will stop the controller immediately. In the case of 401 the monitor will keep trying for 3 minutes in order to accommodate possible issues in the cloud deployment authentication service. We can optimize the behavior for 401 if we tell the controller that the service is running on-prem / managed by the operator. I think this is more critical since in the SNO case, with 1000 clusters being installed concurrently, we'll have 1000 cluster spamming the service (CPU and logs) for 5-10 minutes In case of 404 it will take up to minutes for controller to exit In case of 401 it will take 3 minutes to exit. I think this is good behavior. We tested previous behavior that took 10 minutes to stop with 1000 clusters and it was looking good. This change will optimize it even more. *** Bug 1964720 has been marked as a duplicate of this bug. *** Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |