Description of problem: The keepalived liveness probe frequently fails during deployment with "/bin/bash: line 0: kill: `': not a pid or valid job spec". There's no indication in the logs that anything is actually wrong with keepalived, so I suspect there might be an issue with the liveness probe itself. Version-Release number of selected component (if applicable): 4.8 How reproducible: Seems to be intermittent Steps to Reproduce: 1. Deploy using dev-scripts 2. Check journal on one of the nodes. Sometimes the message above will be present and the keepalived container will have been restarted. Actual results: Liveness probe errors and unexpected keepalived restarts. Expected results: Keepalived starts and runs normally. Additional info: My working theory right now is that sending the pgrep output to kill in [0] is occasionally getting tripped up. I want to try using pkill directly to avoid the shell output passing. 0: https://github.com/openshift/machine-config-operator/blob/3fe1270f5d11040119b1f977d6a5604b4e9a80a2/templates/common/on-prem/files/keepalived.yaml#L128
I was mistaken about the problem here. It's just that it takes too long to populate keepalived.conf on the first start. I have a patch proposed to fix that.
After the fix, The bug didn't reproduced. Verifying.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438