Bug 1995810
Summary: | long living clusters may fail to upgrade because of an invalid conmon path | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Node | Assignee: | Peter Hunt <pehunt> |
Node sub component: | CRI-O | QA Contact: | Mike Fiedler <mifiedle> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aos-bugs, kuiwang, mifiedle, pehunt, schoudha, smilner, wking |
Version: | 4.7 | Keywords: | FastFix, Regression, Upgrades |
Target Milestone: | --- | ||
Target Release: | 4.7.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1995809 | Environment: | |
Last Closed: | 2021-09-01 18:24:07 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1995809 | ||
Bug Blocks: |
Description
W. Trevor King
2021-08-19 19:35:52 UTC
Successfully upgraded 4.4.31 (with containerruntimeconfig change) -> 4.5.41 -> 4.6.42 -> 4.7 + https://github.com/openshift/machine-config-operator/pull/2725 After upgrade - see below for crio config. Can this bug be considered verified? oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.ci.test-2021-08-20-095005-ci-ln-l8x290b-latest True False 7m20s Cluster version is 4.7.0-0.ci.test-2021-08-20-095005-ci-ln-l8x290b-latest access master/worker node to make sure crio service is running oc debug node/ip-10-0-142-171.us-east-2.compute.internal Starting pod/ip-10-0-142-171us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.142.171 If you don't see a command prompt, try pressing enter. sh-4.4# sh-4.4# chroot /host sh-4.4# systemctl status crio ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf Active: active (running) since Fri 2021-08-20 12:11:04 UTC; 11min ago Docs: https://github.com/cri-o/cri-o Main PID: 1426 (crio) Tasks: 28 Memory: 3.3G CPU: 3min 59.553s CGroup: /system.slice/crio.service └─1426 /usr/bin/crio --enable-metrics=true --metrics-port=9537 make sure the changes can be found at /etc/crio/crio.conf.d [crio] internal_wipe = true storage_driver = "overlay" storage_option = [ "overlay.override_kernel_check=1", ] [crio.api] stream_address = "" stream_port = "10010" [crio.runtime] selinux = true conmon = "" conmon_cgroup = "pod" default_env = [ "NSS_SDB_USE_CACHE=no", ] log_level = "info" cgroup_manager = "systemd" default_sysctls = [ "net.ipv4.ping_group_range=0 2147483647", ] hooks_dir = [ "/etc/containers/oci/hooks.d", "/run/containers/oci/hooks.d", ] manage_ns_lifecycle = true [crio.image] global_auth_file = "/var/lib/kubelet/config.json" pause_image = "registry.build01.ci.openshift.org/ci-ln-l8x290b/stable@sha256:b650d1a5798534f222e52b1d951f49f4d4b8b0af3b817055d9dc6eb9b8705054" pause_image_auth_file = "/var/lib/kubelet/config.json" pause_command = "/usr/bin/pod" [crio.network] network_dir = "/etc/kubernetes/cni/net.d/" plugin_dirs = [ "/var/lib/cni/bin", "/usr/libexec/cni", ] [crio.metrics] enable_metrics = true metrics_port = 9537 I believe that Peter Hunt found a quicker reproducer for this that does not involve going all the way back to 4.4.z. 1. Install 4.6.z (I used 4.6.42) 2. oc debug to a worker and edit /etc/crio/crio.conf and make some changes (I changed loglevel and turned metrics on) and save the file 3. I also created a containerruntime config but that might be optional 4. Upgrade to 4.7.25. Upgrade will get stuck with a node NotReady 5. ssh into the node that NotReady node and verify /etc/crio/crio.conf is still there 5. systemctl status crio Aug 20 21:43:23 ip-10-0-213-225 crio[6294]: time="2021-08-20 21:43:23.957070985Z" level=fatal msg="Validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory" Aug 20 21:43:23 ip-10-0-213-225 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Aug 20 21:43:23 ip-10-0-213-225 systemd[1]: crio.service: Failed with result 'exit-code'. Aug 20 21:43:23 ip-10-0-213-225 systemd[1]: Failed to start Open Container Initiative Daemon. Next step: repeat with an upgrade to a build of https://github.com/openshift/machine-config-operator/pull/2725 to see if it fixes the issue. Repeated the steps in comment 2, this time upgrading to a payload built from https://github.com/openshift/machine-config-operator/pull/2725 and the upgrade was successful. Post install: # crio config | grep conmon INFO[0000] Starting CRI-O, version: 1.20.4-11.rhaos4.7.git9d682e1.el8, git: () INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL # Path to the conmon binary, used for monitoring the OCI runtime. conmon = "" # Cgroup setting for conmon conmon_cgroup = "pod" # Environment variable list for the conmon process, used for passing necessary # environment variables to conmon or the runtime. conmon_env = [ Verified on 4.7.0-0.nightly-2021-08-21-153346 using the updated reproducer steps in comment 2 1. Install 4.6.42 2. oc debug to a worker and edit /etc/crio/crio.conf and make some changes (I changed loglevel and turned metrics on) and save the file 3. Create a containerruntime config with the following contents apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: machineConfigPoolSelector: matchLabels: custom-crio: high-pid-limit containerRuntimeConfig: pidsLimit: 2048 4. oc label machineconfigpool worker custom-crio=high-pid-limit 5. oc get mcp worker -w and watch for all workers to be ready 6. oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-08-21-153346 - verify upgrade successful - oc debug to the node where crio.conf was modified and verify customizations are still in place - crio config | grep conmon and verify value is "" and not /usr/libexec/crio/conmon Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.7.28 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3262 |