Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1820362

Summary: crio configuration error on RHEL 7.8 workers
Product: OpenShift Container Platform Reporter: David Critch <dcritch>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.zCC: aos-bugs, jokerman
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-02 20:32:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Critch 2020-04-02 20:29:45 UTC
Description of problem:
When attempting to add RHEL7.8 workers to an OCP 4.3 cluster, crio is improperly configured. The crio config file points to a nonexistent 'conmon' binary.

Version-Release number of selected component (if applicable):
RHEL 7.8 workers
OCP 4.3.8 - UPI installation

How reproducible:
Always

Steps to Reproduce:
1. Deploy OCP4 RHCOS workers
2. Provision and prep RHEL7.8 server to join cluster
3. Run openshift-ansible/playbooks/scaleup.yml


Actual results:
On the first run, the playbook errors with:
TASK [openshift_node : Restart the CRI-O service] *******************************************************************************************************************************************************************
Thursday 02 April 2020  19:33:54 +0000 (0:00:01.910)       0:04:51.209 ********
fatal: [worker-2.dc.cloud.lab.eng.bos.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to start service crio: Job for crio.service failed because the control process exited with error code. See \"systemctl status crio.service\" and \"journalctl -xe\" for details.\n"}


Checking the logs on the host:

Apr  2 14:08:29 worker-0 crio: version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or directory. Triggering wipe
Apr  2 14:08:29 worker-0 crio: time="2020-04-02 14:08:29.197775094Z" level=fatal msg="runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory"
Apr  2 14:08:29 worker-0 systemd: crio.service: main process exited, code=exited, status=1/FAILURE
Apr  2 14:08:29 worker-0 systemd: Unit crio.service entered failed state.
Apr  2 14:08:29 worker-0 systemd: crio.service failed.

The issue is that on RHEL7.8, conmon is installed under /usr/bin/conmon, not /usr/libexec/crio/conmon. After fixing /etc/crio/crio.conf, the play gets further but fails:

TASK [openshift_node : Approve node CSR] ****************************************************************************************************************************************************************************
Thursday 02 April 2020  20:09:40 +0000 (0:00:00.734)       0:06:01.697 ********
FAILED - RETRYING: Approve node CSR (6 retries left).
FAILED - RETRYING: Approve node CSR (5 retries left).
FAILED - RETRYING: Approve node CSR (4 retries left).
FAILED - RETRYING: Approve node CSR (3 retries left).
FAILED - RETRYING: Approve node CSR (2 retries left).
FAILED - RETRYING: Approve node CSR (1 retries left).
failed: [worker-2.dc.cloud.lab.eng.bos.redhat.com -> localhost] (item=worker-2.dc.cloud.lab.eng.bos.redhat.com) => {"ansible_loop_var": "item", "attempts": 6, "changed": true, "cmd": "count=0; for csr in `oc --config=/root/rc/ocp/dc-ocp4/auth/kubeconfig get csr --no-headers  | grep \" system:node:worker-2.dc.cloud.lab.eng.bos.redhat.com \"  | cut -d \" \" -f1`;\ndo\n  oc --config=/root/rc/ocp/dc-ocp4/auth/kubeconfig adm certificate approve ${csr};\n  if [ $? -eq 0 ];\n  then\n    count=$((count+1));\n  fi;\ndone; exit $((!count));\n", "delta": "0:00:00.158457", "end": "2020-04-02 20:10:12.107882", "item": "worker-2.dc.cloud.lab.eng.bos.redhat.com", "msg": "non-zero return code", "rc": 1, "start": "2020-04-02 20:10:11.949425", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

At some point during the 2nd run, crio.conf is reverted to the broken version. After fixing the configuration file again and restarting services, the node finally joins the cluster and the CSR must be manually approved.

Expected results:
Clean run of scaleup.yml

Additional info:
I did a grep for conmon in openshift-ansible and I didn't find anything, which is why I'm thinking this is a config pushed down by OpenShift.

Comment 1 David Critch 2020-04-02 20:32:36 UTC

*** This bug has been marked as a duplicate of bug 1819679 ***