Bug 1995785 - long living clusters may fail to upgrade because of an invalid conmon path
Summary: long living clusters may fail to upgrade because of an invalid conmon path
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.9.0
Assignee: Peter Hunt
QA Contact: Mike Fiedler
URL:
Whiteboard: UpdateRecommendationsBlocked
Depends On:
Blocks: 1995809
TreeView+ depends on / blocked
 
Reported: 2021-08-19 18:32 UTC by Peter Hunt
Modified: 2022-11-24 08:20 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1995809 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:47:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2723 0 None None None 2021-08-19 18:47:05 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:47:42 UTC

Description Peter Hunt 2021-08-19 18:32:09 UTC
Description of problem:
Another step of the fallout of https://bugzilla.redhat.com/show_bug.cgi?id=1993385 includes an interesting interaction between rpm-ostree and older versions of MCO. If a cluster was ever at a version where the MCO configured /etc/crio/crio.conf (4.5 or earlier), then updates to the cri-o rpm won't update the crio.conf file (in ways like updating the conmon path). Since the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1993385 only updated MCO to *not* specify the conmon  path (thinking it would leave it to the CRI-O default of "") in the drop in template, the pre-existing value in /etc/crio/crio.conf (unchanged from fixing the rpm) would prevail, causing cri-o to expect conmon to be at /usr/libexec/crio/conmon, which no longer exists. This causes nodes to not come up

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. upgrade a node from 4.5->affectected versions (going through each minor version)
2. notice cri-o does not come up in similar ways to https://bugzilla.redhat.com/show_bug.cgi?id=1993385


Actual results:
the node does not come up

Expected results:
the node starts

Additional info:

Comment 1 W. Trevor King 2021-08-19 19:00:51 UTC
We've tombstoned 4.7.25 and 4.8.6 on this in https://github.com/openshift/cincinnati-graph-data/pull/995

Comment 3 W. Trevor King 2021-08-20 01:39:52 UTC
Working through a reproducer, I sent cluster-bot a 'launch 4.5.41'.  Confirming the version after receiving the cluster:

  $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
  4.5.41

Pulling in a ContainerRuntimeConfig from [1], because I hear that we need some kind of divergence from stock to trigger the bug:

  $ cat highpids.yaml 
  apiVersion: machineconfiguration.openshift.io/v1
  kind: ContainerRuntimeConfig
  metadata:
   name: set-pids-limit
  spec:
   machineConfigPoolSelector:
     matchLabels:
       custom-crio: high-pid-limit
   containerRuntimeConfig:
     pidsLimit: 2048
  $ oc apply -f highpids.yaml 
  $ oc label -n openshift-machine-api machineconfigpool worker custom-crio=high-pid-limit
  $ oc get -n openshift-machine-api machineconfigpool worker -w
  NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   
UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
  worker   rendered-worker-a0892f651d1c00e2f9456596af125622   True      False      False      3              3                   3                     
0                      23m
  ...
  worker   rendered-worker-a0892f651d1c00e2f9456596af125622   False     True       False      3              1                   1                     0                      26m

That's far enough.  We only need one, and more that pick up the new config before the MCO gets bumped during the update just helps make the problem more obvious later.  Trigger the update to 4.6, setting the channel, because we clear channel in CI [2], and cluster-bot is using that CI config.

  $ oc adm upgrade channel stable-4.6  # requires a 4.9+ 'oc' binary
  warning: No channels known to be compatible with the current version "4.5.41"; unable to validate "stable-4.6".
  $ oc adm upgrade --to 4.6.42

All three compute ended up catching up before the CVO started updating the MCO to 4.6:

  $ oc get -n openshift-machine-api machineconfigpool worker   
  NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
  worker   rendered-worker-12a38dd697f5c238e747cdcdecdd98cc   True      False      False      3              3                   3                     0                      32m
  $ oc adm upgrade
  info: An upgrade is in progress. Working towards 4.6.42: 15% complete
  ...

Update eventually completes:

  $ oc adm upgrade
  Cluster version is 4.6.42

And off to the vulnerable 4.7.25, to try and reproduce the "Validating runtime config: conmon validation: invalid conmon path: stat /usr/libexec/crio/conmon: no such file or directory":

  $ oc adm upgrade channel candidate-4.7  # requires a 4.9+ 'oc' binary
  $ oc adm upgrade --to 4.7.25

And then a while later:

  $ oc adm upgrade
  Cluster version is 4.7.25
  $ oc get -n openshift-machine-api machineconfigpools
  NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
  master   rendered-master-f4a193c15abb46959e590b6254c7bb22   True      False      False      3              3                   3                     0                      174m
  worker   rendered-worker-bddea6e49777a4615148d9fd7412a2b7   True      False      False      3              3                   3                     0                      174m

So failed to reproduce the original bug.  I'll try again starting with 4.4.33...

[1]: https://github.com/openshift/machine-config-operator/blob/release-4.5/docs/ContainerRuntimeConfigDesign.md#example
[2]: https://github.com/openshift/release/pull/8631

Comment 4 Sunil Choudhary 2021-08-20 12:26:19 UTC
Followed upgrade path 4.5.41 -> 4.6.42 -> 4.7.25. Applied container runtime confing on 4.5.41 before starting upgrade.

Failed to reproduce the bug. Currently one upgrade from 4.4.33 is in progress.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.25    True        False         6m26s   Cluster version is 4.7.25


$ oc describe clusterversion
Name:         version
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterVersion
Metadata:
  Creation Timestamp:  2021-08-20T08:17:58Z
  Generation:          5
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:clusterID:
        f:upstream:
    Manager:      cluster-bootstrap
    Operation:    Update
    Time:         2021-08-20T08:17:58Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:channel:
        f:desiredUpdate:
          .:
          f:force:
          f:image:
          f:version:
    Manager:      oc
    Operation:    Update
    Time:         2021-08-20T11:02:13Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:availableUpdates:
        f:conditions:
        f:desired:
          .:
          f:channels:
          f:image:
          f:url:
          f:version:
        f:history:
        f:observedGeneration:
        f:versionHash:
    Manager:         cluster-version-operator
    Operation:       Update
    Time:            2021-08-20T11:56:18Z
  Resource Version:  132274
  Self Link:         /apis/config.openshift.io/v1/clusterversions/version
  UID:               5cf3aab5-a992-4524-9bc2-b0ee6d32711c
Spec:
  Channel:     candidate-4.7
  Cluster ID:  f532fd70-41ef-4be7-8847-56f591c189b7
  Desired Update:
    Force:    false
    Image:    quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
    Version:  4.7.25
  Upstream:   https://api.openshift.com/api/upgrades_info/v1/graph
Status:
  Available Updates:  <nil>
  Conditions:
    Last Transition Time:  2021-08-20T08:54:35Z
    Message:               Done applying 4.7.25
    Status:                True
    Type:                  Available
    Last Transition Time:  2021-08-20T12:09:22Z
    Status:                False
    Type:                  Failing
    Last Transition Time:  2021-08-20T12:09:52Z
    Message:               Cluster version is 4.7.25
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2021-08-20T08:18:12Z
    Status:                True
    Type:                  RetrievedUpdates
  Desired:
    Channels:
      candidate-4.7
      candidate-4.8
    Image:    quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
    URL:      https://access.redhat.com/errata/RHBA-2021:3188
    Version:  4.7.25
  History:
    Completion Time:    2021-08-20T12:09:52Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
    Started Time:       2021-08-20T11:02:28Z
    State:              Completed
    Verified:           true
    Version:            4.7.25
    Completion Time:    2021-08-20T10:57:28Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:59e2e85f5d1bcb4440765c310b6261387ffc3f16ed55ca0a79012367e15b558b
    Started Time:       2021-08-20T09:52:35Z
    State:              Completed
    Verified:           true
    Version:            4.6.42
    Completion Time:    2021-08-20T08:54:35Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:c67fe644d1c06e6d7694e648a40199cb06e25e1c3cfd5cd4fdac87fd696d2297
    Started Time:       2021-08-20T08:18:12Z
    State:              Completed
    Verified:           false
    Version:            4.5.41
  Observed Generation:  5
  Version Hash:         N_wDQ8h9xO8=
Events:                 <none>


$ oc describe containerruntimeconfig set-pids-limit
Name:         set-pids-limit
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         ContainerRuntimeConfig
Metadata:
  Creation Timestamp:  2021-08-20T09:35:34Z
  Finalizers:
    99-worker-12fbe9f1-357e-47c4-bf8d-f33e9272bc46-containerruntime
    99-worker-generated-containerruntime
  Generation:  1
  Managed Fields:
    API Version:  machineconfiguration.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:containerRuntimeConfig:
          .:
          f:pidsLimit:
        f:machineConfigPoolSelector:
          .:
          f:matchLabels:
            .:
            f:custom-crio:
    Manager:      kubectl-create
    Operation:    Update
    Time:         2021-08-20T09:35:34Z
    API Version:  machineconfiguration.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"99-worker-12fbe9f1-357e-47c4-bf8d-f33e9272bc46-containerruntime":
          v:"99-worker-generated-containerruntime":
      f:spec:
        f:containerRuntimeConfig:
          f:logSizeMax:
          f:overlaySize:
      f:status:
        .:
        f:conditions:
        f:observedGeneration:
    Manager:         machine-config-controller
    Operation:       Update
    Time:            2021-08-20T10:37:27Z
  Resource Version:  118733
  Self Link:         /apis/machineconfiguration.openshift.io/v1/containerruntimeconfigs/set-pids-limit
  UID:               0691d0c1-7a9b-4c29-88ea-341aaf900ea0
Spec:
  Container Runtime Config:
    Pids Limit:  2048
  Machine Config Pool Selector:
    Match Labels:
      Custom - Crio:  high-pid-limit
Status:
  Conditions:
    Last Transition Time:  2021-08-20T09:35:39Z
    Message:               Error: could not find any MachineConfigPool set for ContainerRuntimeConfig set-pids-limit
    Status:                False
    Type:                  Failure
    Last Transition Time:  2021-08-20T11:49:56Z
    Message:               Success
    Status:                True
    Type:                  Success
  Observed Generation:     1
Events:                    <none>

Comment 5 Sunil Choudhary 2021-08-20 15:52:50 UTC
Followed upgrade path 4.4.33 -> 4.5.41 -> 4.6.42 -> 4.7.25. Applied container runtime confing on 4.4.33 before starting upgrade.

Could not trigger bug.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.25    True        False         19m     Cluster version is 4.7.25

$ oc describe clusterversion
Name:         version
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterVersion
Metadata:
  Creation Timestamp:  2021-08-20T10:18:19Z
  Generation:          8
  Resource Version:    158626
  Self Link:           /apis/config.openshift.io/v1/clusterversions/version
  UID:                 f67762f4-e704-4cf8-aa98-efe822557da5
Spec:
  Channel:     candidate-4.7
  Cluster ID:  3d9b11f4-1742-47df-b491-47e96446e8dc
  Desired Update:
    Force:    false
    Image:    quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
    Version:  4.7.25
  Upstream:   https://api.openshift.com/api/upgrades_info/v1/graph
Status:
  Available Updates:  <nil>
  Conditions:
    Last Transition Time:  2021-08-20T10:40:41Z
    Message:               Done applying 4.7.25
    Status:                True
    Type:                  Available
    Last Transition Time:  2021-08-20T13:27:23Z
    Status:                False
    Type:                  Failing
    Last Transition Time:  2021-08-20T15:00:54Z
    Message:               Cluster version is 4.7.25
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2021-08-20T10:18:25Z
    Status:                True
    Type:                  RetrievedUpdates
  Desired:
    Channels:
      candidate-4.7
      candidate-4.8
    Image:    quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
    URL:      https://access.redhat.com/errata/RHBA-2021:3188
    Version:  4.7.25
  History:
    Completion Time:    2021-08-20T15:00:54Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:d1cb6c18cb7bd7207855101752e05a7c8a7f99c8e339af9c23cec364055169f3
    Started Time:       2021-08-20T13:53:08Z
    State:              Completed
    Verified:           true
    Version:            4.7.25
    Completion Time:    2021-08-20T13:27:53Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:59e2e85f5d1bcb4440765c310b6261387ffc3f16ed55ca0a79012367e15b558b
    Started Time:       2021-08-20T12:09:06Z
    State:              Completed
    Verified:           true
    Version:            4.6.42
    Completion Time:    2021-08-20T12:05:21Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:c67fe644d1c06e6d7694e648a40199cb06e25e1c3cfd5cd4fdac87fd696d2297
    Started Time:       2021-08-20T11:11:11Z
    State:              Completed
    Verified:           true
    Version:            4.5.41
    Completion Time:    2021-08-20T10:40:41Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:a035dddd8a5e5c99484138951ef4aba021799b77eb9046f683a5466c23717738
    Started Time:       2021-08-20T10:18:25Z
    State:              Completed
    Verified:           false
    Version:            4.4.33
  Observed Generation:  8
  Version Hash:         N_wDQ8h9xO8=
Events:                 <none>

$ oc get containerruntimeconfig
NAME             AGE
set-pids-limit   4h22m

Comment 6 Petr Muller 2021-08-20 16:18:05 UTC
We were able to recover our cluster by doing the following (needs SSH access):

1. The cluster gets stuck mid-upgrade, with one master node NotReady 
2. On the two ready master nodes, create the following file:

# cat /etc/crio/crio.conf.d/02-conmon 
[crio.runtime]
conmon = ""

Note that with ready masters, you can use `oc debug` - in that case the path will be  /host/etc/crio/crio.conf.d/02-conmon

3. On the NotReady master, create the same file. On this master, `oc debug` will not work, you'll need to use SSH (without SSH configured, we were able to connect to the EC2 instance through serial terminal, boot it into single user mode and added a ssh key)
4. On the NotReady master, restart cri-o (first) and then kubelet services

This revived the master node, and then the upgrade process proceeded normally.

Comment 7 Peter Hunt 2021-08-23 15:05:35 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* Users who have ever manually changed their /etc/crio/crio.conf and attempt to upgrade to the affected versions (4.7.24 or 4.8.5)
* Potentially, users who have applied a ContainerRuntimeConfig before Openshift 4.4, and who have kept upgrading their clusters all the way to the affected  versions.

What is the impact?  Is it serious enough to warrant blocking edges?
* Nodes that upgrade go NotReady and require manual intervention to fix. 

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* Admin must SSH to the node and apply a drop-in cri-o config file. Since cri-o does not start, `oc debug node/` is not sufficient.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* Yes, this is a regression

Comment 8 Mike Fiedler 2021-08-23 19:46:13 UTC
Verified on 4.9.0-0.nightly-2021-08-22-070405

1. Install 4.8.5
2. oc debug to a worker and edit /etc/crio/crio.conf and make some changes (I changed loglevel and turned metrics on) and save the file
3. Create a containerruntime config with the following contents

apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
 name: set-pids-limit
spec:
 machineConfigPoolSelector:
   matchLabels:
     custom-crio: high-pid-limit
 containerRuntimeConfig:
   pidsLimit: 2048


4. oc label machineconfigpool worker custom-crio=high-pid-limit
5. oc get mcp worker -w and watch for all workers to be ready
6. oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-08-22-070405

- verify upgrade successful
- oc debug to the node where crio.conf was modified and verify customizations are still in place
- crio config | grep conmon and verify value is "" and not /usr/libexec/crio/conmon

Comment 11 errata-xmlrpc 2021-10-18 17:47:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.