1995810 – long living clusters may fail to upgrade because of an invalid conmon path

Bug 1995810 - long living clusters may fail to upgrade because of an invalid conmon path

Summary: long living clusters may fail to upgrade because of an invalid conmon path

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Peter Hunt
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:	1995809
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-19 19:35 UTC by W. Trevor King
Modified:	2021-09-01 18:24 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1995809
Environment:
Last Closed:	2021-09-01 18:24:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2725	0	None	None	None	2021-08-19 19:40:46 UTC
Red Hat Product Errata	RHSA-2021:3262	0	None	None	None	2021-09-01 18:24:17 UTC

Description W. Trevor King 2021-08-19 19:35:52 UTC

+++ This bug was initially created as a clone of Bug #1995809 +++

+++ This bug was initially created as a clone of Bug #1995785 +++

Description of problem:
Another step of the fallout of https://bugzilla.redhat.com/show_bug.cgi?id=1993385 includes an interesting interaction between rpm-ostree and older versions of MCO. If a cluster was ever at a version where the MCO configured /etc/crio/crio.conf (4.5 or earlier), then updates to the cri-o rpm won't update the crio.conf file (in ways like updating the conmon path). Since the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1993385 only updated MCO to *not* specify the conmon  path (thinking it would leave it to the CRI-O default of "") in the drop in template, the pre-existing value in /etc/crio/crio.conf (unchanged from fixing the rpm) would prevail, causing cri-o to expect conmon to be at /usr/libexec/crio/conmon, which no longer exists. This causes nodes to not come up

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. upgrade a node from 4.5->affectected versions (going through each minor version)
2. notice cri-o does not come up in similar ways to https://bugzilla.redhat.com/show_bug.cgi?id=1993385


Actual results:
the node does not come up

Expected results:
the node starts

Additional info:

Comment 1 Mike Fiedler 2021-08-20 15:54:51 UTC

Successfully upgraded 4.4.31 (with containerruntimeconfig change) -> 4.5.41 -> 4.6.42 -> 4.7 + https://github.com/openshift/machine-config-operator/pull/2725

After upgrade - see below for crio config.   Can this bug be considered verified? 

oc get clusterversion
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.ci.test-2021-08-20-095005-ci-ln-l8x290b-latest   True        False         7m20s   Cluster version is 4.7.0-0.ci.test-2021-08-20-095005-ci-ln-l8x290b-latest

access master/worker node to make sure crio service is running

oc debug node/ip-10-0-142-171.us-east-2.compute.internal
Starting pod/ip-10-0-142-171us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.142.171
If you don't see a command prompt, try pressing enter.
sh-4.4#
sh-4.4# chroot /host
sh-4.4# systemctl status crio
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf
   Active: active (running) since Fri 2021-08-20 12:11:04 UTC; 11min ago
     Docs: https://github.com/cri-o/cri-o
 Main PID: 1426 (crio)
    Tasks: 28
   Memory: 3.3G
      CPU: 3min 59.553s
   CGroup: /system.slice/crio.service
           └─1426 /usr/bin/crio --enable-metrics=true --metrics-port=9537

make sure the changes can be found at /etc/crio/crio.conf.d

[crio]
internal_wipe = true
storage_driver = "overlay"
storage_option = [
    "overlay.override_kernel_check=1",
]

[crio.api]
stream_address = ""
stream_port = "10010"

[crio.runtime]
selinux = true
conmon = ""
conmon_cgroup = "pod"
default_env = [
    "NSS_SDB_USE_CACHE=no",
]
log_level = "info"
cgroup_manager = "systemd"
default_sysctls = [
    "net.ipv4.ping_group_range=0 2147483647",
]
hooks_dir = [
    "/etc/containers/oci/hooks.d",
    "/run/containers/oci/hooks.d",
]
manage_ns_lifecycle = true

[crio.image]
global_auth_file = "/var/lib/kubelet/config.json"
pause_image = "registry.build01.ci.openshift.org/ci-ln-l8x290b/stable@sha256:b650d1a5798534f222e52b1d951f49f4d4b8b0af3b817055d9dc6eb9b8705054"
pause_image_auth_file = "/var/lib/kubelet/config.json"
pause_command = "/usr/bin/pod"

[crio.network]
network_dir = "/etc/kubernetes/cni/net.d/"
plugin_dirs = [
    "/var/lib/cni/bin",
    "/usr/libexec/cni",
]

[crio.metrics]
enable_metrics = true
metrics_port = 9537

Comment 2 Mike Fiedler 2021-08-20 21:52:02 UTC

I believe that Peter Hunt found a quicker reproducer for this that does not involve going all the way back to 4.4.z.

1. Install 4.6.z (I used 4.6.42)
2. oc debug to a worker and edit /etc/crio/crio.conf and make some changes (I changed loglevel and turned metrics on) and save the file
3. I also created a containerruntime config but that might be optional
4. Upgrade to 4.7.25.   Upgrade will get stuck with a node NotReady
5. ssh into the node that NotReady node and verify /etc/crio/crio.conf is still there
5. systemctl status crio

Aug 20 21:43:23 ip-10-0-213-225 crio[6294]: time="2021-08-20 21:43:23.957070985Z" level=fatal msg="Validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory"
Aug 20 21:43:23 ip-10-0-213-225 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Aug 20 21:43:23 ip-10-0-213-225 systemd[1]: crio.service: Failed with result 'exit-code'.
Aug 20 21:43:23 ip-10-0-213-225 systemd[1]: Failed to start Open Container Initiative Daemon.

Next step:  repeat with an upgrade to a build of https://github.com/openshift/machine-config-operator/pull/2725 to see if it fixes the issue.

Comment 3 Mike Fiedler 2021-08-21 00:03:12 UTC

Repeated the steps in comment 2, this time upgrading to a payload built from https://github.com/openshift/machine-config-operator/pull/2725 and the upgrade was successful.

Post install:

# crio config | grep conmon
INFO[0000] Starting CRI-O, version: 1.20.4-11.rhaos4.7.git9d682e1.el8, git: () 
INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL 
# Path to the conmon binary, used for monitoring the OCI runtime.
conmon = ""
# Cgroup setting for conmon
conmon_cgroup = "pod"
# Environment variable list for the conmon process, used for passing necessary
# environment variables to conmon or the runtime.
conmon_env = [

Comment 5 Mike Fiedler 2021-08-21 17:16:14 UTC

Verified on 4.7.0-0.nightly-2021-08-21-153346 using the updated reproducer steps in comment 2

1. Install 4.6.42
2. oc debug to a worker and edit /etc/crio/crio.conf and make some changes (I changed loglevel and turned metrics on) and save the file
3. Create a containerruntime config with the following contents

apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
 name: set-pids-limit
spec:
 machineConfigPoolSelector:
   matchLabels:
     custom-crio: high-pid-limit
 containerRuntimeConfig:
   pidsLimit: 2048


4. oc label machineconfigpool worker custom-crio=high-pid-limit
5. oc get mcp worker -w and watch for all workers to be ready
6. oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-08-21-153346

- verify upgrade successful
- oc debug to the node where crio.conf was modified and verify customizations are still in place
- crio config | grep conmon and verify value is "" and not /usr/libexec/crio/conmon

Comment 8 errata-xmlrpc 2021-09-01 18:24:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.7.28 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3262

Note You need to log in before you can comment on or make changes to this bug.