+++ This bug was initially created as a clone of Bug #1995809 +++ +++ This bug was initially created as a clone of Bug #1995785 +++ Description of problem: Another step of the fallout of https://bugzilla.redhat.com/show_bug.cgi?id=1993385 includes an interesting interaction between rpm-ostree and older versions of MCO. If a cluster was ever at a version where the MCO configured /etc/crio/crio.conf (4.5 or earlier), then updates to the cri-o rpm won't update the crio.conf file (in ways like updating the conmon path). Since the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1993385 only updated MCO to *not* specify the conmon path (thinking it would leave it to the CRI-O default of "") in the drop in template, the pre-existing value in /etc/crio/crio.conf (unchanged from fixing the rpm) would prevail, causing cri-o to expect conmon to be at /usr/libexec/crio/conmon, which no longer exists. This causes nodes to not come up Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. upgrade a node from 4.5->affectected versions (going through each minor version) 2. notice cri-o does not come up in similar ways to https://bugzilla.redhat.com/show_bug.cgi?id=1993385 Actual results: the node does not come up Expected results: the node starts Additional info:
Successfully upgraded 4.4.31 (with containerruntimeconfig change) -> 4.5.41 -> 4.6.42 -> 4.7 + https://github.com/openshift/machine-config-operator/pull/2725 After upgrade - see below for crio config. Can this bug be considered verified? oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.ci.test-2021-08-20-095005-ci-ln-l8x290b-latest True False 7m20s Cluster version is 4.7.0-0.ci.test-2021-08-20-095005-ci-ln-l8x290b-latest access master/worker node to make sure crio service is running oc debug node/ip-10-0-142-171.us-east-2.compute.internal Starting pod/ip-10-0-142-171us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.142.171 If you don't see a command prompt, try pressing enter. sh-4.4# sh-4.4# chroot /host sh-4.4# systemctl status crio ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf Active: active (running) since Fri 2021-08-20 12:11:04 UTC; 11min ago Docs: https://github.com/cri-o/cri-o Main PID: 1426 (crio) Tasks: 28 Memory: 3.3G CPU: 3min 59.553s CGroup: /system.slice/crio.service └─1426 /usr/bin/crio --enable-metrics=true --metrics-port=9537 make sure the changes can be found at /etc/crio/crio.conf.d [crio] internal_wipe = true storage_driver = "overlay" storage_option = [ "overlay.override_kernel_check=1", ] [crio.api] stream_address = "" stream_port = "10010" [crio.runtime] selinux = true conmon = "" conmon_cgroup = "pod" default_env = [ "NSS_SDB_USE_CACHE=no", ] log_level = "info" cgroup_manager = "systemd" default_sysctls = [ "net.ipv4.ping_group_range=0 2147483647", ] hooks_dir = [ "/etc/containers/oci/hooks.d", "/run/containers/oci/hooks.d", ] manage_ns_lifecycle = true [crio.image] global_auth_file = "/var/lib/kubelet/config.json" pause_image = "registry.build01.ci.openshift.org/ci-ln-l8x290b/stable@sha256:b650d1a5798534f222e52b1d951f49f4d4b8b0af3b817055d9dc6eb9b8705054" pause_image_auth_file = "/var/lib/kubelet/config.json" pause_command = "/usr/bin/pod" [crio.network] network_dir = "/etc/kubernetes/cni/net.d/" plugin_dirs = [ "/var/lib/cni/bin", "/usr/libexec/cni", ] [crio.metrics] enable_metrics = true metrics_port = 9537
I believe that Peter Hunt found a quicker reproducer for this that does not involve going all the way back to 4.4.z. 1. Install 4.6.z (I used 4.6.42) 2. oc debug to a worker and edit /etc/crio/crio.conf and make some changes (I changed loglevel and turned metrics on) and save the file 3. I also created a containerruntime config but that might be optional 4. Upgrade to 4.7.25. Upgrade will get stuck with a node NotReady 5. ssh into the node that NotReady node and verify /etc/crio/crio.conf is still there 5. systemctl status crio Aug 20 21:43:23 ip-10-0-213-225 crio[6294]: time="2021-08-20 21:43:23.957070985Z" level=fatal msg="Validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory" Aug 20 21:43:23 ip-10-0-213-225 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Aug 20 21:43:23 ip-10-0-213-225 systemd[1]: crio.service: Failed with result 'exit-code'. Aug 20 21:43:23 ip-10-0-213-225 systemd[1]: Failed to start Open Container Initiative Daemon. Next step: repeat with an upgrade to a build of https://github.com/openshift/machine-config-operator/pull/2725 to see if it fixes the issue.
Repeated the steps in comment 2, this time upgrading to a payload built from https://github.com/openshift/machine-config-operator/pull/2725 and the upgrade was successful. Post install: # crio config | grep conmon INFO[0000] Starting CRI-O, version: 1.20.4-11.rhaos4.7.git9d682e1.el8, git: () INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL # Path to the conmon binary, used for monitoring the OCI runtime. conmon = "" # Cgroup setting for conmon conmon_cgroup = "pod" # Environment variable list for the conmon process, used for passing necessary # environment variables to conmon or the runtime. conmon_env = [
Verified on 4.7.0-0.nightly-2021-08-21-153346 using the updated reproducer steps in comment 2 1. Install 4.6.42 2. oc debug to a worker and edit /etc/crio/crio.conf and make some changes (I changed loglevel and turned metrics on) and save the file 3. Create a containerruntime config with the following contents apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: machineConfigPoolSelector: matchLabels: custom-crio: high-pid-limit containerRuntimeConfig: pidsLimit: 2048 4. oc label machineconfigpool worker custom-crio=high-pid-limit 5. oc get mcp worker -w and watch for all workers to be ready 6. oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-08-21-153346 - verify upgrade successful - oc debug to the node where crio.conf was modified and verify customizations are still in place - crio config | grep conmon and verify value is "" and not /usr/libexec/crio/conmon
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.7.28 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3262