Bug 1820362
| Summary: | crio configuration error on RHEL 7.8 workers | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | David Critch <dcritch> |
| Component: | Node | Assignee: | Ryan Phillips <rphillips> |
| Status: | CLOSED DUPLICATE | QA Contact: | Sunil Choudhary <schoudha> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.3.z | CC: | aos-bugs, jokerman |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-02 20:32:36 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
*** This bug has been marked as a duplicate of bug 1819679 *** |
Description of problem: When attempting to add RHEL7.8 workers to an OCP 4.3 cluster, crio is improperly configured. The crio config file points to a nonexistent 'conmon' binary. Version-Release number of selected component (if applicable): RHEL 7.8 workers OCP 4.3.8 - UPI installation How reproducible: Always Steps to Reproduce: 1. Deploy OCP4 RHCOS workers 2. Provision and prep RHEL7.8 server to join cluster 3. Run openshift-ansible/playbooks/scaleup.yml Actual results: On the first run, the playbook errors with: TASK [openshift_node : Restart the CRI-O service] ******************************************************************************************************************************************************************* Thursday 02 April 2020 19:33:54 +0000 (0:00:01.910) 0:04:51.209 ******** fatal: [worker-2.dc.cloud.lab.eng.bos.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to start service crio: Job for crio.service failed because the control process exited with error code. See \"systemctl status crio.service\" and \"journalctl -xe\" for details.\n"} Checking the logs on the host: Apr 2 14:08:29 worker-0 crio: version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or directory. Triggering wipe Apr 2 14:08:29 worker-0 crio: time="2020-04-02 14:08:29.197775094Z" level=fatal msg="runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory" Apr 2 14:08:29 worker-0 systemd: crio.service: main process exited, code=exited, status=1/FAILURE Apr 2 14:08:29 worker-0 systemd: Unit crio.service entered failed state. Apr 2 14:08:29 worker-0 systemd: crio.service failed. The issue is that on RHEL7.8, conmon is installed under /usr/bin/conmon, not /usr/libexec/crio/conmon. After fixing /etc/crio/crio.conf, the play gets further but fails: TASK [openshift_node : Approve node CSR] **************************************************************************************************************************************************************************** Thursday 02 April 2020 20:09:40 +0000 (0:00:00.734) 0:06:01.697 ******** FAILED - RETRYING: Approve node CSR (6 retries left). FAILED - RETRYING: Approve node CSR (5 retries left). FAILED - RETRYING: Approve node CSR (4 retries left). FAILED - RETRYING: Approve node CSR (3 retries left). FAILED - RETRYING: Approve node CSR (2 retries left). FAILED - RETRYING: Approve node CSR (1 retries left). failed: [worker-2.dc.cloud.lab.eng.bos.redhat.com -> localhost] (item=worker-2.dc.cloud.lab.eng.bos.redhat.com) => {"ansible_loop_var": "item", "attempts": 6, "changed": true, "cmd": "count=0; for csr in `oc --config=/root/rc/ocp/dc-ocp4/auth/kubeconfig get csr --no-headers | grep \" system:node:worker-2.dc.cloud.lab.eng.bos.redhat.com \" | cut -d \" \" -f1`;\ndo\n oc --config=/root/rc/ocp/dc-ocp4/auth/kubeconfig adm certificate approve ${csr};\n if [ $? -eq 0 ];\n then\n count=$((count+1));\n fi;\ndone; exit $((!count));\n", "delta": "0:00:00.158457", "end": "2020-04-02 20:10:12.107882", "item": "worker-2.dc.cloud.lab.eng.bos.redhat.com", "msg": "non-zero return code", "rc": 1, "start": "2020-04-02 20:10:11.949425", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} At some point during the 2nd run, crio.conf is reverted to the broken version. After fixing the configuration file again and restarting services, the node finally joins the cluster and the CSR must be manually approved. Expected results: Clean run of scaleup.yml Additional info: I did a grep for conmon in openshift-ansible and I didn't find anything, which is why I'm thinking this is a config pushed down by OpenShift.