Bug 2100894

Summary:	Possible to cause misconfiguration of container runtime soon after cluster creation
Product:	OpenShift Container Platform	Reporter:	Naveen Malik <nmalik>
Component:	Node	Assignee:	Qi Wang <qiwan>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	cblecker, kmudalia, openshift-bugzilla-robot, pmagotra
Version:	4.10	Keywords:	ServiceDeliveryBlocker
Target Milestone:	---
Target Release:	4.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-01 11:35:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2076355
Bug Blocks:

Description Naveen Malik 2022-06-24 15:21:05 UTC

Description of problem:
It is possible to trigger duplication of a ContainerRuntimeConfiguration when multiple exist for a given set of nodes. And the duplicated config is the first in the list. When each config is managing the same configuration it effectively means what was the second config is now overridden.

Version-Release number of selected component (if applicable):
Tested and reproduced on 4.10.18 OSD clusters.
Observed on production customer OSD cluster version 4.10.6.

How reproducible:
About 25%.

Steps to Reproduce:
1. Create OSD cluster. Setup IDP. Note SRE used backplane for access and did not setup IDP.
2. Login to cluster as soon as possible.
3. Wait for at least one worker to have pids_limit = 4096, applied by custom-crio ContainerRuntimeConfiguration

oc -n default debug node/$(oc get nodes | grep worker | grep -v infra | awk '{print $1}' | head -n1) -- "chroot /host crio config | grep pids_limit"

4. Apply new ContainerRuntimeConfiguration to bump pids_limit to 65000

cat << EOF | oc create -f-
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
name: new-large-pidlimit
spec:
containerRuntimeConfig:
pidsLimit: 65000
machineConfigPoolSelector:
matchExpressions:
- key: pools.operator.machineconfiguration.openshift.io/worker
operator: Exists
EOF

5. Wait for at least one worker to have pids_limit = 65000, applied by new-large-pidlimit ContainerRuntimeConfiguration

oc -n default debug node/$(oc get nodes | grep worker | grep -v infra | awk '{print $1}' | head -n1) -- "chroot /host crio config | grep pids_limit"

6. Verify there are only 2 machineconfig for containerruntime

oc get machineconfig | grep containerruntime

7. Force CVO to reconcile things

oc -n openshift-cluster-version scale deployment cluster-version-operator --replicas=0
sleep 5
oc -n openshift-cluster-version scale deployment cluster-version-operator --replicas=1

8. Check machineconfig for containerruntime again. If the problem is triggered (25% chance observed in testing) you'll now see a 3rd. This 3rd one (with -2 post-fix) will be a duplicate of the original machineconfig created for "custom-crio".

oc get machineconfig | grep containerruntime

Actual results:
3 MachineConfig for containerruntime exist in this order:
1. custom-crio
2. new-large-pidlimit
3. custom-crio (duplicate)

Expected results:
2 MachineConfig for containerruntime exist in this order:
1. custom-crio
2. new-large-pidlimit

Additional info:
OSD creates a ContainerRuntimeConfiguration called "custom-crio" that sets pids_limit for workers to 4096. We support customers creating a second ContainerRuntimeConfiguration to adjust that limit and other configurations. Therefore the second customer ContainerRuntimeConfiguration is expected (and usually does) get rendered in MachineConfig.

Given this is reproduce while cluster is new while Nodes are being updated and CO's are progressing it's likely some timing issue. And while this is happening the "master" nodes are being updated. To reproduce more consistently CVO was scaled down then up to trigger reconcile which creates the 3rd rogue ContainerRuntimeConfiguration.

Must gather's will be provided in private comment.

Comment 2 Naveen Malik 2022-06-24 15:34:15 UTC

Note I tested my theory of a race condition at startup on 11 clusters (user error on the 12th!).  I did NOT reproduce the issue if all nodes were done progressing and all CO's were done progressing and none were degraded.  The test was the same other than conditions to wait.

Changes:
* after login, wait for all nodes to finish progressing and CO to be done progressing and none degraded
* after creating second ContainerRuntimeConfig wait for pids_limit to be updated on all nodes before scaling CVO

Comment 3 Naveen Malik 2022-06-24 18:50:17 UTC

Timeline on customer cluster that shows this is hard to be 100% certain on.  What I do see is the creation timestamp on resources in cluster. Further complicating this is additional changes were done on the cluster since this triggered, so the -1 machineconfig has been deleted.  What is of interest though is the age of 99-worker-generated-containerruntime-2, which is a duplicate of 99-worker-generated-containerruntime.  It was created 44 days after!

$ oc get machineconfig | grep containerruntime
99-worker-generated-containerruntime               e6ba00b885558712d660a3704c071490d999de6f   3.2.0             79d
99-worker-generated-containerruntime-2             e6ba00b885558712d660a3704c071490d999de6f   3.2.0             35d
99-worker-generated-containerruntime-3             e6ba00b885558712d660a3704c071490d999de6f   3.2.0             17d

Comment 6 Qi Wang 2022-07-05 17:07:04 UTC

*** Bug 2104160 has been marked as a duplicate of this bug. ***

Comment 9 Sunil Choudhary 2022-07-27 13:23:39 UTC

% oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-07-26-232654   True        False         79m     Cluster version is 4.10.0-0.nightly-2022-07-26-232654

% oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-142-35.us-east-2.compute.internal    Ready    worker   88m   v1.23.5+012e945
ip-10-0-149-126.us-east-2.compute.internal   Ready    master   93m   v1.23.5+012e945
ip-10-0-168-61.us-east-2.compute.internal    Ready    master   93m   v1.23.5+012e945
ip-10-0-179-76.us-east-2.compute.internal    Ready    worker   88m   v1.23.5+012e945
ip-10-0-218-35.us-east-2.compute.internal    Ready    master   94m   v1.23.5+012e945
ip-10-0-219-184.us-east-2.compute.internal   Ready    worker   88m   v1.23.5+012e945

% oc debug node/ip-10-0-142-35.us-east-2.compute.internal                                                                                             
Starting pod/ip-10-0-142-35us-east-2computeinternal-debug ...
…

sh-4.4# crio config | grep pids_limit
INFO[2022-07-27 13:09:33.787028081Z] Starting CRI-O, version: 1.23.3-11.rhaos4.10.gitddf4b1a.1.el8, git: () 
INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL 
pids_limit = 4096


% oc get containerruntimeconfig
NAME               AGE
new-max-pidlimit   6m35s
pidlimit           23m

% oc get mc                    
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
00-worker                                          dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
01-master-container-runtime                        dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
01-master-kubelet                                  dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
01-worker-container-runtime                        dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
01-worker-kubelet                                  dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
99-master-generated-crio-seccomp-use-default                                                  3.2.0             88m
99-master-generated-registries                     dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
99-master-ssh                                                                                 3.2.0             90m
99-worker-generated-containerruntime               dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             23m
99-worker-generated-containerruntime-1             dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             6m40s
99-worker-generated-crio-seccomp-use-default                                                  3.2.0             88m
99-worker-generated-registries                     dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
99-worker-ssh                                                                                 3.2.0             90m
rendered-master-1f5449d03a8fb49f0ff3d741eb363a4c   dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
rendered-worker-d229647baf68ce03bce6557c7890110d   dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             23m
rendered-worker-d92fd0744b797e11843570f0b681e971   dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             88m
rendered-worker-efaf76f5ebf797d15ef5c6014919afed   dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             6m35s

% oc debug node/ip-10-0-142-35.us-east-2.compute.internal                                                                                             
Starting pod/ip-10-0-142-35us-east-2computeinternal-debug ...
…

sh-4.4# crio config | grep pids_limit
INFO[2022-07-27 13:17:32.805457991Z] Starting CRI-O, version: 1.23.3-11.rhaos4.10.gitddf4b1a.1.el8, git: () 
INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL 
pids_limit = 65000


% oc get mc | grep -i containerruntime      
99-worker-generated-containerruntime               dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             29m
99-worker-generated-containerruntime-1             dc29945da95a65f460ad50ad1bbc10e1918a9c61   3.2.0             12m

Comment 11 errata-xmlrpc 2022-08-01 11:35:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.25 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5730