Installing an arm64-based OCP cluster on AWS, the bootstrap process failed, checking crio.log logs on one master machine: Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)... Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.549627954Z" level=info msg="Starting CRI-O, version: 1.22.0-33.rhaos4.9.git78f06f2.el8, git: ()" Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.549833976Z" level=info msg="Node configuration value for hugetlb cgroup is true" Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.549844421Z" level=info msg="Node configuration value for pid cgroup is true" Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.549947944Z" level=info msg="Node configuration value for memoryswap cgroup is true" Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.557624432Z" level=info msg="Node configuration value for systemd CollectMode is true" Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.563474971Z" level=info msg="Node configuration value for systemd AllowedCPUs is true" Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.565985708Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL" Aug 11 06:00:27 ip-10-0-158-196 crio[1357]: time="2021-08-11 06:00:27.616635679Z" level=fatal msg="validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory" Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: crio.service: Failed with result 'exit-code'. Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O). Aug 11 06:00:27 ip-10-0-158-196 systemd[1]: crio.service: Consumed 115ms CPU time Version: OCP: 4.9.0-0.nightly-arm64-2021-08-11-014517 rhcos: 49.84.202108101747-0 Crio version: 1.22.0-33.rhaos4.9.git78f06f2.el8 Platform: ARM on AWS How to reproduce it (as minimally and precisely as possible)? Install an arm64-based OCP cluster on AWS via IPI additional info: the previous version 4.9.0-0.nightly-arm64-2021-08-09-045415 works well, the differences are: Package (NEVR) 49.84.202108060947-0 49.84.202108101747-0 cri-o cri-o-0-1.22.0-28.rhaos4.9.git126b893.el8-aarch64 cri-o-0-1.22.0-33.rhaos4.9.git78f06f2.el8-aarch64
Got the same blocker issue when trying IPI on GCP. Version: OCP: 4.9.0-0.nightly-2021-08-11-014539 rhcos: 49.84.202108091543-0 (2021-08-09T15:46:28Z) Crio version: cri-o-1.22.0-28.rhaos4.9.git126b893.el8.x86_64 Noticed below log on one of the control nodes: Aug 11 06:06:42 jiwei-0811-02-vkzw5-master-0.c.openshift-qe.internal crio[1455]: time="2021-08-11 06:06:42.605300219Z" level=fatal msg="validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory"
*** Bug 1992628 has been marked as a duplicate of this bug. ***
This appears to be blocking AWS image promotion jobs for amd64 too.
Causes the following jobs to fail boostrapping as well: - https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-aws - https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-serial - https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi - https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-blocking#periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6
This looks like a `cri-o` + `conmon` incompatibility; sending to Node for triage Latest RHCOS 4.9 has `cri-o-1.22.0-33.rhaos4.9.git78f06f2.el8` and `conmon-2.0.29-1.module+el8.4.0+11822+6cc1e7d7`
It seems conmon is being pulled from rhel instead of the snowflake rhcos build we were previously using. the latter put it in a special cri-o specific path. I have updated the cri-o spec to not use this special path, which is compatible with both conmons
The nightlies from today have the new package, but still exhibit this problem: -- Logs begin at Thu 2021-08-12 09:41:56 UTC, end at Thu 2021-08-12 10:15:15 UTC. -- Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: Starting Container Runtime Interface for OCI (CRI-O)... Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.272156288Z" level=info msg="Starting CRI-O, version: 1.22.0-34.rhaos4.9.git78f06f2.el8, git: ()" Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.272580708Z" level=info msg="Node configuration value for hugetlb cgroup is true" Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.272600734Z" level=info msg="Node configuration value for pid cgroup is true" Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.272744686Z" level=info msg="Node configuration value for memoryswap cgroup is true" Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.284951055Z" level=info msg="Node configuration value for systemd CollectMode is true" Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.294647247Z" level=info msg="Node configuration value for systemd AllowedCPUs is true" Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.297825297Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL" Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 crio[1489]: time="2021-08-12 09:48:54.346246003Z" level=fatal msg="validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory" Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: crio.service: Failed with result 'exit-code'. Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O). Aug 12 09:48:54 ci-op-qlt3hjx0-00eff-7xjv5-master-0 systemd[1]: crio.service: Consumed 170ms CPU time Sample job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp/1425727727222132736
This is the correct job link: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp/1425752811122987008
*** Bug 1992995 has been marked as a duplicate of this bug. ***
On a single RHCOS node (OCP not involved), I can start cri-o with no issues now. [core@cosa-devsh ~]$ sudo systemctl status crio ● crio.service - Container Runtime Interface for OCI (CRI-O) Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: https://github.com/cri-o/cri-o [core@cosa-devsh ~]$ sudo systemctl start crio [core@cosa-devsh ~]$ sudo systemctl status crio ● crio.service - Container Runtime Interface for OCI (CRI-O) Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Active: active (running) since Thu 2021-08-12 13:53:17 UTC; 3s ago Docs: https://github.com/cri-o/cri-o Main PID: 1673 (crio) Tasks: 15 Memory: 92.2M CGroup: /system.slice/crio.service └─1673 /usr/bin/crio Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.078425101Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NE> Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.132764115Z" level=info msg="Conmon does support the --sync option" Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.133483246Z" level=info msg="No seccomp profile specified, using the internal default" Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.133611743Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time" Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.146611033Z" level=info msg="Found CNI network crio (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf" Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.155160092Z" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf" Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.184052352Z" level=info msg="Found CNI network podman (type=bridge) at /etc/cni/net.d/87-podman-bridge.conflist" Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.184291202Z" level=info msg="Updated default CNI network name to crio" Aug 12 13:53:17 cosa-devsh crio[1673]: time="2021-08-12 13:53:17.251163336Z" level=info msg="Serving metrics on :9537 via HTTP" Aug 12 13:53:17 cosa-devsh systemd[1]: Started Container Runtime Interface for OCI (CRI-O). [core@cosa-devsh ~]$ rpm -q cri-o conmon cri-o-1.22.0-34.rhaos4.9.git78f06f2.el8.x86_64 conmon-2.0.29-1.module+el8.4.0+11822+6cc1e7d7.x86_64 [core@cosa-devsh ~]$ rpm-ostree status State: idle Deployments: ● ostree://b2b64c89c62afe2fc03e10a63ff66bcee8f2b6a691e8b0dd2723c6f96c46f58f Version: 49.84.202108120339-0 (2021-08-12T03:42:57Z) --------------------------------------------- This was the previous RHCOS build with the older cri-o: [core@cosa-devsh ~]$ sudo systemctl status crio ● crio.service - Container Runtime Interface for OCI (CRI-O) Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: https://github.com/cri-o/cri-o [core@cosa-devsh ~]$ sudo systemctl start crio Job for crio.service failed because the control process exited with error code. See "systemctl status crio.service" and "journalctl -xe" for details. [core@cosa-devsh ~]$ sudo systemctl status crio ● crio.service - Container Runtime Interface for OCI (CRI-O) Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2021-08-12 14:04:11 UTC; 3s ago Docs: https://github.com/cri-o/cri-o Process: 1641 ExecStart=/usr/bin/crio $CRIO_CONFIG_OPTIONS $CRIO_RUNTIME_OPTIONS $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 1641 (code=exited, status=1/FAILURE) Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.009398474Z" level=info msg="Node configuration value for hugetlb cgroup is true" Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.009412791Z" level=info msg="Node configuration value for pid cgroup is true" Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.009519135Z" level=info msg="Node configuration value for memoryswap cgroup is true" Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.021998249Z" level=info msg="Node configuration value for systemd CollectMode is true" Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.030649093Z" level=info msg="Node configuration value for systemd AllowedCPUs is true" Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.105317646Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NE> Aug 12 14:04:11 cosa-devsh crio[1641]: time="2021-08-12 14:04:11.148800742Z" level=fatal msg="validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory" Aug 12 14:04:11 cosa-devsh systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Aug 12 14:04:11 cosa-devsh systemd[1]: crio.service: Failed with result 'exit-code'. Aug 12 14:04:11 cosa-devsh systemd[1]: Failed to start Container Runtime Interface for OCI (CRI-O). [core@cosa-devsh ~]$ rpm -q cri-o conmon cri-o-1.22.0-33.rhaos4.9.git78f06f2.el8.x86_64 conmon-2.0.29-1.module+el8.4.0+11822+6cc1e7d7.x86_64 [core@cosa-devsh ~]$ rpm-ostree status State: idle Deployments: ● ostree://80dddf7dcfffafd4c3fa4575c87c6ee4058f6d544ba8854d2a01efb316d7750a Version: 49.84.202108110218-0 (2021-08-11T02:21:55Z)
verified on 4.9.0-0.nightly-2021-08-14-065522
*** Bug 1992723 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759