Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1992557

Summary:	failed to start cri-o service due to /usr/libexec/crio/conmon is missing
Product:	OpenShift Container Platform	Reporter:	Yunfei Jiang <yunjiang>
Component:	Node	Assignee:	Peter Hunt <pehunt>
Node sub component:	CRI-O	QA Contact:	Mike Fiedler <mifiedle>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	aos-bugs, bparees, dornelas, gpei, jchaloup, jiwei, jligon, krmoser, miabbott, mifiedle, mko, mnguyen, mrussell, nstielau, sippy, stbenjam, tsze, wking
Version:	4.9	Keywords:	FastFix, TestBlocker
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	cri-o-1.22.0-34.rhaos4.9.git78f06f2.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1993119 (view as bug list)		Environment:	job=promote-release-openshift-machine-os-content-e2e-aws-4.9=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-aws=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-serial=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-assisted=all
Last Closed:	2021-10-18 17:45:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1993119, 1993385

Description Yunfei Jiang 2021-08-11 10:06:34 UTC

Installing an arm64-based OCP cluster on AWS, the bootstrap process failed, checking crio.log logs on one master machine:

Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...
Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.549627954Z" level=info msg="Starting CRI-O, version: 1.22.0-33.rhaos4.9.git78f06f2.el8, git: ()"
Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.549833976Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.549844421Z" level=info msg="Node configuration value for pid cgroup is true"
Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.549947944Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.557624432Z" level=info msg="Node configuration value for systemd CollectMode is true"
Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.563474971Z" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.565985708Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL"
Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.616635679Z" level=fatal msg="validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory"
Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: crio.service: Failed with result 'exit-code'.
Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O).
Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: crio.service: Consumed 115ms CPU time




Version:
OCP: 4.9.0-0.nightly-arm64-2021-08-11-014517
rhcos: 49.84.202108101747-0
Crio version: 1.22.0-33.rhaos4.9.git78f06f2.el8

Platform:
ARM on AWS

How to reproduce it (as minimally and precisely as possible)?
Install an arm64-based OCP cluster on AWS via IPI

additional info:
the previous version 4.9.0-0.nightly-arm64-2021-08-09-045415 works well, the differences are:

Package (NEVR) 49.84.202108060947-0 49.84.202108101747-0
cri-o cri-o-0-1.22.0-28.rhaos4.9.git126b893.el8-aarch64 cri-o-0-1.22.0-33.rhaos4.9.git78f06f2.el8-aarch64

Comment 1 Jianli Wei 2021-08-11 12:33:00 UTC

Got the same blocker issue when trying IPI on GCP. 

Version:
OCP: 4.9.0-0.nightly-2021-08-11-014539
rhcos: 49.84.202108091543-0 (2021-08-09T15:46:28Z)
Crio version: cri-o-1.22.0-28.rhaos4.9.git126b893.el8.x86_64


Noticed below log on one of the control nodes: 

Aug 11 06:06:42 jiwei-0811-02-vkzw5-master-0.c.openshift-qe.internal crio[1455]: time="2021-08-11 06:06:42.605300219Z" level=fatal msg="validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory"

Comment 2 Stephen Benjamin 2021-08-11 13:06:56 UTC

*** Bug 1992628 has been marked as a duplicate of this bug. ***

Comment 3 Stephen Benjamin 2021-08-11 13:08:12 UTC

This appears to be blocking AWS image promotion jobs for amd64 too.

Comment 4 Jan Chaloupka 2021-08-11 13:18:06 UTC

Causes the following jobs to fail boostrapping as well:
- https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-aws
- https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-serial
- https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi
- https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6

Comment 5 Micah Abbott 2021-08-11 13:20:54 UTC

This looks like a `cri-o` + `conmon` incompatibility; sending to Node for triage

Latest RHCOS 4.9 has `cri-o-1.22.0-33.rhaos4.9.git78f06f2.el8` and `conmon-2.0.29-1.module+el8.4.0+11822+6cc1e7d7`

Comment 6 Peter Hunt 2021-08-11 13:41:06 UTC

It seems conmon is being pulled from rhel instead of the snowflake rhcos build we were previously using. the latter put it in a special cri-o specific path. I have updated the cri-o spec to not use this special path, which is compatible with both conmons

Comment 8 Stephen Benjamin 2021-08-12 10:45:30 UTC

The nightlies from today have the new package, but still exhibit this problem:

-- Logs begin at Thu 2021-08-12 09:41:56 UTC, end at Thu 2021-08-12 10:15:15 UTC. --
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)...
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.272156288Z" level=info msg="Starting CRI-O, version: 1.22.0-34.rhaos4.9.git78f06f2.el8, git: ()"
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.272580708Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.272600734Z" level=info msg="Node configuration value for pid cgroup is true"
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.272744686Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.284951055Z" level=info msg="Node configuration value for systemd CollectMode is true"
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.294647247Z" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.297825297Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL"
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.346246003Z" level=fatal msg="validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory"
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: crio.service: Failed with result 'exit-code'.
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O).
Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: crio.service: Consumed 170ms CPU time



Sample job: 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp/1425727727222132736

Comment 9 Stephen Benjamin 2021-08-12 10:46:44 UTC

This is the correct job link: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp/1425752811122987008

Comment 10 Prashanth Sundararaman 2021-08-12 14:02:49 UTC

*** Bug 1992995 has been marked as a duplicate of this bug. ***

Comment 11 Michael Nguyen 2021-08-12 14:07:43 UTC

On a single RHCOS node (OCP not involved), I can start cri-o with no issues now.

[core@cosa-devsh ~]$ sudo systemctl status crio
● crio.service - Container Runtime Interface for OCI (CRI-O)
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: https://github.com/cri-o/cri-o
[core@cosa-devsh ~]$ sudo systemctl start crio
[core@cosa-devsh ~]$ sudo systemctl status crio
● crio.service - Container Runtime Interface for OCI (CRI-O)
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2021-08-12 13:53:17 UTC; 3s ago
     Docs: https://github.com/cri-o/cri-o
 Main PID: 1673 (crio)
    Tasks: 15
   Memory: 92.2M
   CGroup: /system.slice/crio.service
           └─1673 /usr/bin/crio

Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.078425101Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NE>
Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.132764115Z" level=info msg="Conmon does support the --sync option"
Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.133483246Z" level=info msg="No seccomp profile specified, using the internal default"
Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.133611743Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time"
Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.146611033Z" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.155160092Z" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.184052352Z" level=info msg="Found CNI network podman (type=bridge) at /etc/cni/net.d/87-podman-bridge.conflist"
Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.184291202Z" level=info msg="Updated default CNI network name to crio"
Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.251163336Z" level=info msg="Serving metrics on :9537 via HTTP"
Aug 12 13:53:17 cosa-devsh systemd[1]: Started Container Runtime Interface for OCI (CRI-O).
[core@cosa-devsh ~]$ rpm -q cri-o conmon
cri-o-1.22.0-34.rhaos4.9.git78f06f2.el8.x86_64
conmon-2.0.29-1.module+el8.4.0+11822+6cc1e7d7.x86_64
[core@cosa-devsh ~]$ rpm-ostree status
State: idle
Deployments:
● ostree://b2b64c89c62afe2fc03e10a63ff66bcee8f2b6a691e8b0dd2723c6f96c46f58f
                   Version: 49.84.202108120339-0 (2021-08-12T03:42:57Z)


---------------------------------------------



This was the previous RHCOS build with the older cri-o:

[core@cosa-devsh ~]$ sudo systemctl status crio
● crio.service - Container Runtime Interface for OCI (CRI-O)
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
     Docs: https://github.com/cri-o/cri-o
[core@cosa-devsh ~]$ sudo systemctl start crio
Job for crio.service failed because the control process exited with error code.
See "systemctl status crio.service" and "journalctl -xe" for details.
[core@cosa-devsh ~]$ sudo systemctl status crio
● crio.service - Container Runtime Interface for OCI (CRI-O)
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2021-08-12 14:04:11 UTC; 3s ago
     Docs: https://github.com/cri-o/cri-o
  Process: 1641 ExecStart=/usr/bin/crio $CRIO_CONFIG_OPTIONS $CRIO_RUNTIME_OPTIONS $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 1641 (code=exited, status=1/FAILURE)

Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.009398474Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.009412791Z" level=info msg="Node configuration value for pid cgroup is true"
Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.009519135Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.021998249Z" level=info msg="Node configuration value for systemd CollectMode is true"
Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.030649093Z" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.105317646Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NE>
Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.148800742Z" level=fatal msg="validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory"
Aug 12 14:04:11 cosa-devsh systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Aug 12 14:04:11 cosa-devsh systemd[1]: crio.service: Failed with result 'exit-code'.
Aug 12 14:04:11 cosa-devsh systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O).
[core@cosa-devsh ~]$ rpm -q cri-o conmon
cri-o-1.22.0-33.rhaos4.9.git78f06f2.el8.x86_64
conmon-2.0.29-1.module+el8.4.0+11822+6cc1e7d7.x86_64
[core@cosa-devsh ~]$ rpm-ostree status
State: idle
Deployments:
● ostree://80dddf7dcfffafd4c3fa4575c87c6ee4058f6d544ba8854d2a01efb316d7750a
                   Version: 49.84.202108110218-0 (2021-08-11T02:21:55Z)

Comment 15 Mike Fiedler 2021-08-16 19:00:17 UTC

verified on 4.9.0-0.nightly-2021-08-14-065522

Comment 16 Lisa Ranjbar 2021-08-26 23:09:15 UTC

*** Bug 1992723 has been marked as a duplicate of this bug. ***

Comment 19 errata-xmlrpc 2021-10-18 17:45:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759