Bug 1980930
Summary: | Machine-api-operator is going through leader election even when API rollout takes ~60 sec in SNO | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> |
Component: | Cloud Compute | Assignee: | Mike Fedosin <mfedosin> |
Cloud Compute sub component: | Other Providers | QA Contact: | Milind Yadav <miyadav> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | mfedosin, nelluri, rfreiman |
Version: | 4.9 | ||
Target Milestone: | --- | ||
Target Release: | 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | chaos | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-18 17:39:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1984730 |
Description
Naga Ravi Chaitanya Elluri
2021-07-09 22:52:56 UTC
> We can see that the machine-api-operator restarted at 18:57:37 UTC and it was trying to acquire the lease looking at the exited container on the node during that time: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/machine-api-operator-leader-election/machine-api-operator-container.log.
Looking at this log, I can see that the leader election was lost 60 seconds after the first error that is logged.
I think this is entirely reasonable behaviour as the lease may have already been renewed 30 seconds before, this seems very much like a timing issue.
After we lose the election, the only safe thing to do is to panic, we have to ensure the controller shuts down to prevent duplicate controllers running at the same time.
The numbers for the lease duration/refresh have all been chosen based on other components within openshift and some advice from Clayton after we initially set these too aggressively (we were refreshing every 15 seconds IIRC).
We were causing too much load on the API server so had to back this off to what we have now.
I don't think there's actually an issue here, or at least, I don't think there's one we can safely protect against without extending the lease considerably, and ensure that renewal happens before 60 seconds is remaining on the lease.
Thanks for taking a look Joel. One of the goals for the SNO is to bring down the API rollout time to 60 seconds and have various cluster operators and respective components handle the downtime of API gracefully: https://github.com/openshift/library-go/pull/1104. Is it possible to wait for 90 seconds after the error/hitting an issue talking to the API before triggering the election and restarting to handle it gracefully instead? Thoughts? > Is it possible to wait for 90 seconds after the error/hitting an issue talking to the API before triggering the election and restarting to handle it gracefully instead? Thoughts?
No, this is not safe. If we can't access the API, we have to assume it's our fault or the node that we are running on has become partitioned. In this scenario, another pod would be scheduled elsewhere and will start up and take the lease from us as soon as it has expired. Waiting beyond the lease is not safe.
The only possible option here is extending the lease durations and refresh periods, but the values we are using I believe are standard across OCP. Are you not seeing similar issues in other components?
Also, given that our component will come back and work again after the restart, is this really an issue?
@jspeed you can find more info here: https://github.com/openshift/enhancements/pull/832 For SNO, our goal is for all the operators to be able to tolerate 60s of api downtime. I've reviewed the enhancement, I think the proposed changes are acceptable to our team. Will make sure that our components get updated to the recommended values from the enhancement. For reference, these are LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. As per the discussions in comments and validations done , We are able to confirm no panics and leader elections happening , after the fix merged . Moving to VERIFIED . Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |