Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1949664

Summary: Spurious keepalived liveness probe failures
Product: OpenShift Container Platform Reporter: Ben Nemec <bnemec>
Component: Machine Config OperatorAssignee: Ben Nemec <bnemec>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: bperkins, rioliu
Target Milestone: ---Keywords: Triaged
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:00:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1989751    

Description Ben Nemec 2021-04-14 18:15:22 UTC
Description of problem: The keepalived liveness probe frequently fails during deployment with "/bin/bash: line 0: kill: `': not a pid or valid job spec". There's no indication in the logs that anything is actually wrong with keepalived, so I suspect there might be an issue with the liveness probe itself.


Version-Release number of selected component (if applicable): 4.8


How reproducible: Seems to be intermittent


Steps to Reproduce:
1. Deploy using dev-scripts
2. Check journal on one of the nodes. Sometimes the message above will be present and the keepalived container will have been restarted.

Actual results: Liveness probe errors and unexpected keepalived restarts.


Expected results: Keepalived starts and runs normally.


Additional info: My working theory right now is that sending the pgrep output to kill in [0] is occasionally getting tripped up. I want to try using pkill directly to avoid the shell output passing.

0: https://github.com/openshift/machine-config-operator/blob/3fe1270f5d11040119b1f977d6a5604b4e9a80a2/templates/common/on-prem/files/keepalived.yaml#L128

Comment 1 Ben Nemec 2021-04-14 21:53:18 UTC
I was mistaken about the problem here. It's just that it takes too long to populate keepalived.conf on the first start. I have a patch proposed to fix that.

Comment 3 Nataf Sharabi 2021-06-03 10:50:19 UTC
After the fix,
The bug didn't reproduced.

Verifying.

Comment 7 errata-xmlrpc 2021-07-27 23:00:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438