Bug 1949664 - Spurious keepalived liveness probe failures
Summary: Spurious keepalived liveness probe failures
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Ben Nemec
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1989751
TreeView+ depends on / blocked
 
Reported: 2021-04-14 18:15 UTC by Ben Nemec
Modified: 2021-08-04 15:38 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:00:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2528 0 None open Bug 1949664: [on-prem] Disable liveness probe until keepalived.conf exists 2021-04-14 21:54:43 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:01:03 UTC

Description Ben Nemec 2021-04-14 18:15:22 UTC
Description of problem: The keepalived liveness probe frequently fails during deployment with "/bin/bash: line 0: kill: `': not a pid or valid job spec". There's no indication in the logs that anything is actually wrong with keepalived, so I suspect there might be an issue with the liveness probe itself.


Version-Release number of selected component (if applicable): 4.8


How reproducible: Seems to be intermittent


Steps to Reproduce:
1. Deploy using dev-scripts
2. Check journal on one of the nodes. Sometimes the message above will be present and the keepalived container will have been restarted.

Actual results: Liveness probe errors and unexpected keepalived restarts.


Expected results: Keepalived starts and runs normally.


Additional info: My working theory right now is that sending the pgrep output to kill in [0] is occasionally getting tripped up. I want to try using pkill directly to avoid the shell output passing.

0: https://github.com/openshift/machine-config-operator/blob/3fe1270f5d11040119b1f977d6a5604b4e9a80a2/templates/common/on-prem/files/keepalived.yaml#L128

Comment 1 Ben Nemec 2021-04-14 21:53:18 UTC
I was mistaken about the problem here. It's just that it takes too long to populate keepalived.conf on the first start. I have a patch proposed to fix that.

Comment 3 Nataf Sharabi 2021-06-03 10:50:19 UTC
After the fix,
The bug didn't reproduced.

Verifying.

Comment 7 errata-xmlrpc 2021-07-27 23:00:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.