Bug 1809906 - [4.4] crio service can't be started on rhel78 rc worker nodes
Summary: [4.4] crio service can't be started on rhel78 rc worker nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.4.0
Assignee: Jindrich Novy
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1819679 1820051
TreeView+ depends on / blocked
 
Reported: 2020-03-04 06:53 UTC by Yadan Pei
Modified: 2020-05-04 11:44 UTC (History)
7 users (show)

Fixed In Version: conmon-2.0.11-2.rhaos4.4.el8
Doc Type: Bug Fix
Doc Text:
Cause: RHEL7 would not join an OCP cluster correctly due to crio not starting. Consequence: Fix: Bump conmon for fix. Result:
Clone Of:
: 1819679 1820051 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:44:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:44:47 UTC

Description Yadan Pei 2020-03-04 06:53:06 UTC
Description of problem:
The conmon version is conmon-2.0.8-1.el7.x86_64.rpm in the rhel7_extra we use, it is installed but conmon-2.0.8-1.el7.x86_64.rpm don't provide /usr/libexec/crio/conmon file so crio failed to start. This file can be found in conmon-2.0.8-2.el7.x86_64.rpm


Version-Release number of selected component (if applicable):
OCP 4.3.2
rhel_repo: http://download.eng.bos.redhat.com/rhel-7/rel-eng/RHEL-7/RHEL-7.8-20200225.1/compose/Server/x86_64/os
rhel_optional_repo: http://download.eng.bos.redhat.com/rhel-7/rel-eng/RHEL-7/RHEL-7.8-20200225.1/compose/Server-optional/x86_64/os

How reproducible:
Always

Steps to Reproduce:
1.Set up an OCP cluster at version 4.3.2, then add RHEL 78 RC worker nodes 


Actual results:
1. crio service on RHEL 78 rc worker nodes can't be started, here are some logs and messages
$ systemctl status crio
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2020-03-04 01:24:27 EST; 3min 4s ago
     Docs: https://github.com/cri-o/cri-o
  Process: 10693 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 10693 (code=exited, status=1/FAILURE)

Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 systemd[1]: Starting Open Container Initiative Daemon...
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 crio[10693]: time="2020-03-04 01:24:27.618774609-05:00" level=fatal msg="runtime config: ...ctory"
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 systemd[1]: crio.service: main process exited, code=exited, status=1/FAILURE
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 systemd[1]: Failed to start Open Container Initiative Daemon.
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 systemd[1]: Unit crio.service entered failed state.
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 systemd[1]: crio.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

$ journalctl -a | grep crio
Mar 04 01:24:19 qe-lpt-4613-k6nlr-rhel-3 ansible-systemd[10394]: Invoked with no_block=False force=None name=crio daemon_reexec=False enabled=True daemon_reload=False state=None masked=None scope=None user=None
Mar 04 01:24:25 qe-lpt-4613-k6nlr-rhel-3 ansible-systemd[10665]: Invoked with no_block=False force=None name=crio daemon_reexec=False enabled=None daemon_reload=False state=restarted masked=None scope=None user=None
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 crio[10670]: version file /var/lib/crio/version not found: open /var/lib/crio/version: no such file or directory. Triggering wipe
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 crio[10693]: time="2020-03-04 01:24:27.618774609-05:00" level=fatal msg="runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory"
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 systemd[1]: crio.service: main process exited, code=exited, status=1/FAILURE
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 systemd[1]: Unit crio.service entered failed state.
Mar 04 01:24:27 qe-lpt-4613-k6nlr-rhel-3 systemd[1]: crio.service failed.

$ rpm -qa | grep conmon
conmon-2.0.8-1.el7.x86_64

$ rpm -qi conmon-2.0.8-1.el7.x86_64
Name        : conmon
Epoch       : 2
Version     : 2.0.8
Release     : 1.el7
Architecture: x86_64
Install Date: Wed Mar  4 01:23:04 2020
Group       : Unspecified
Size        : 82409
License     : ASL 2.0
Signature   : RSA/SHA256, Fri Jan 17 10:14:59 2020, Key ID 199e2f91fd431d51
Source RPM  : conmon-2.0.8-1.el7.src.rpm
Build Date  : Mon Dec 16 08:01:19 2019
Build Host  : x86-vm-27.build.eng.bos.redhat.com
Relocations : (not relocatable)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Vendor      : Red Hat, Inc.
URL         : https://github.com/containers/conmon
Summary     : OCI container runtime monitor
Description :
OCI container runtime monitor.

$ repoquery -i conmon-2.0.8-1.el7.x86_64
Failed to set locale, defaulting to C

Name        : conmon
Version     : 2.0.8
Release     : 1.el7
Architecture: x86_64
Size        : 82409
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Group       : Unspecified
URL         : https://github.com/containers/conmon
Repository  : rhel7-extra
Summary     : OCI container runtime monitor
Source      : conmon-2.0.8-1.el7.src.rpm
Description :
OCI container runtime monitor.

$ cat /etc/yum.repos.d/qe_additional.repo 
[rhel7]
name=rhel7
baseurl=http://download.eng.bos.redhat.com/rhel-7/rel-eng/RHEL-7/RHEL-7.8-20200225.1/compose/Server/x86_64/os
metadata_expire=1
enabled=1
gpgcheck=0
[rhel7-extra]
name=rhel7-extra
baseurl=http://download.eng.bos.redhat.com/rhel-7/rel-eng/EXTRAS-7/latest-EXTRAS-7.8-RHEL-7/compose/Server/x86_64/os
metadata_expire=1
enabled=1
gpgcheck=0
[rhel7-optional]
name=rhel7-optional
baseurl=http://download.eng.bos.redhat.com/rhel-7/rel-eng/RHEL-7/RHEL-7.8-20200225.1/compose/Server-optional/x86_64/os
metadata_expire=1
enabled=1
gpgcheck=0
[aos-v4-install]
name=aos-v4-install
baseurl=http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/4.3/latest/x86_64/os/
metadata_expire=1
enabled=1
gpgcheck=0


Expected results:
1. crio service should be started and running successfully

Additional info:
Adding rhel77 worker nodes works as expected, also adding rhel 78 works nodes to OCP 4.1 & OCP 4.2 work well too

Comment 1 Peter Hunt 2020-03-04 14:26:33 UTC
Jindrich,

There's a make target in conmon: `make crio` that will install conmon there (this will have to have happened afer we remove the local from /usr/local/bin). Would you be able to also call this in the rpm spec?

Comment 9 Yadan Pei 2020-04-02 09:49:03 UTC
1. install cluster with 4.4.0-0.nightly-2020-04-01-213929
2. Add RHEL 7.8 workers to the cluster, rhel workers can be created successfully

$ oc get node -o wide
NAME                                   STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
qe-ui-date0402-z9fjq-compute-0         Ready    worker   158m    v1.17.1   10.0.98.28    <none>        Red Hat Enterprise Linux CoreOS 44.81.202004011917-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
qe-ui-date0402-z9fjq-compute-1         Ready    worker   158m    v1.17.1   10.0.97.141   <none>        Red Hat Enterprise Linux CoreOS 44.81.202004011917-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
qe-ui-date0402-z9fjq-compute-2         Ready    worker   158m    v1.17.1   10.0.98.70    <none>        Red Hat Enterprise Linux CoreOS 44.81.202004011917-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
qe-ui-date0402-z9fjq-control-plane-0   Ready    master   179m    v1.17.1   10.0.99.228   <none>        Red Hat Enterprise Linux CoreOS 44.81.202004011917-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
qe-ui-date0402-z9fjq-control-plane-1   Ready    master   179m    v1.17.1   10.0.99.210   <none>        Red Hat Enterprise Linux CoreOS 44.81.202004011917-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
qe-ui-date0402-z9fjq-control-plane-2   Ready    master   178m    v1.17.1   10.0.97.97    <none>        Red Hat Enterprise Linux CoreOS 44.81.202004011917-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el8
qe-ui-date0402-z9fjq-rhel2-0           Ready    worker   3m24s   v1.17.1   10.0.97.162   <none>        Red Hat Enterprise Linux Server 7.8 (Maipo)                    3.10.0-1127.el7.x86_64        cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7

$ yum list conmon --showdup
Failed to set locale, defaulting to C
Loaded plugins: search-disabled-repos
Installed Packages
conmon.x86_64                                         2:2.0.11-2.rhaos4.4.el7                                         @aos-v4-devel-install
Available Packages
conmon.x86_64                                         2:2.0.8-1.el7                                                   rhel7-extra          
conmon.x86_64                                         2:2.0.11-2.rhaos4.4.el7                                         aos-v4-devel-install 


2.0.11-2.rhaos4.4.el7 is installed rather than 2.0.8-1.el7 in rhel7-extra

Comment 11 errata-xmlrpc 2020-05-04 11:44:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.