Bug 2022050
Summary: | [BM][IPI] Failed during bootstrap - unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yurii Prokulevych <yprokule> |
Component: | Networking | Assignee: | Ben Nemec <bnemec> |
Networking sub component: | runtime-cfg | QA Contact: | Victor Voronkov <vvoronko> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | unspecified | CC: | achernet, aos-bugs, bfournie, bnemec, lshilin, mcornea, ncarboni, shardy, stbenjam, vvoronko |
Version: | 4.9 | Keywords: | Regression |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Timing issue with bootstrap keepalived process management
Consequence: Keepalived stopped when it should be running
Fix: Prevented multiple keepalived commands from being sent in a short time period, preventing the timing issue
Result: Keepalived runs and stops only when it should
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:26:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2049903 |
Description
Yurii Prokulevych
2021-11-10 16:11:17 UTC
Hmm, this is weird. It looks like the problem is that we're sending simultaneous stop and reload messages to keepalived, and apparently the stop is winning even though reload is sent second: monitor logs: time="2021-11-10T13:57:25Z" level=info msg="Command message successfully sent to Keepalived container control socket: stop\n" time="2021-11-10T13:57:25Z" level=info msg="Command message successfully sent to Keepalived container control socket: reload\n" keepalived logs: The client sent: stop The client sent: reload Wed Nov 10 13:57:25 2021: Stopping This causes the VIP to be unavailable even though it seems the API has come back up. My current theory is that something in the main eventloop of the monitor is taking an unusually long time and the handleBootstrapStopKeepalived messages are stacking up and being sent in close proximity to each other instead of with a gap. The stop and reload functions on the keepalived side get called essentially simultaneously and the reload executes before the stop has finished killing the process. As a result, reload sends a signal to a process that's about to die instead of restarting it like we want. I think we can fix this by just adding a short delay between stop and reload messages. I'll have a patch up shortly. *** Bug 2049892 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |