Bug 2100894
Summary: | Possible to cause misconfiguration of container runtime soon after cluster creation | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Naveen Malik <nmalik> |
Component: | Node | Assignee: | Qi Wang <qiwan> |
Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | cblecker, kmudalia, openshift-bugzilla-robot, pmagotra |
Version: | 4.10 | Keywords: | ServiceDeliveryBlocker |
Target Milestone: | --- | ||
Target Release: | 4.10.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-01 11:35:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2076355 | ||
Bug Blocks: |
Description
Naveen Malik
2022-06-24 15:21:05 UTC
Note I tested my theory of a race condition at startup on 11 clusters (user error on the 12th!). I did NOT reproduce the issue if all nodes were done progressing and all CO's were done progressing and none were degraded. The test was the same other than conditions to wait. Changes: * after login, wait for all nodes to finish progressing and CO to be done progressing and none degraded * after creating second ContainerRuntimeConfig wait for pids_limit to be updated on all nodes before scaling CVO Timeline on customer cluster that shows this is hard to be 100% certain on. What I do see is the creation timestamp on resources in cluster. Further complicating this is additional changes were done on the cluster since this triggered, so the -1 machineconfig has been deleted. What is of interest though is the age of 99-worker-generated-containerruntime-2, which is a duplicate of 99-worker-generated-containerruntime. It was created 44 days after! $ oc get machineconfig | grep containerruntime 99-worker-generated-containerruntime e6ba00b885558712d660a3704c071490d999de6f 3.2.0 79d 99-worker-generated-containerruntime-2 e6ba00b885558712d660a3704c071490d999de6f 3.2.0 35d 99-worker-generated-containerruntime-3 e6ba00b885558712d660a3704c071490d999de6f 3.2.0 17d *** Bug 2104160 has been marked as a duplicate of this bug. *** % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-07-26-232654 True False 79m Cluster version is 4.10.0-0.nightly-2022-07-26-232654 % oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-142-35.us-east-2.compute.internal Ready worker 88m v1.23.5+012e945 ip-10-0-149-126.us-east-2.compute.internal Ready master 93m v1.23.5+012e945 ip-10-0-168-61.us-east-2.compute.internal Ready master 93m v1.23.5+012e945 ip-10-0-179-76.us-east-2.compute.internal Ready worker 88m v1.23.5+012e945 ip-10-0-218-35.us-east-2.compute.internal Ready master 94m v1.23.5+012e945 ip-10-0-219-184.us-east-2.compute.internal Ready worker 88m v1.23.5+012e945 % oc debug node/ip-10-0-142-35.us-east-2.compute.internal Starting pod/ip-10-0-142-35us-east-2computeinternal-debug ... … sh-4.4# crio config | grep pids_limit INFO[2022-07-27 13:09:33.787028081Z] Starting CRI-O, version: 1.23.3-11.rhaos4.10.gitddf4b1a.1.el8, git: () INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL pids_limit = 4096 % oc get containerruntimeconfig NAME AGE new-max-pidlimit 6m35s pidlimit 23m % oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m 00-worker dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m 01-master-container-runtime dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m 01-master-kubelet dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m 01-worker-container-runtime dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m 01-worker-kubelet dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m 99-master-generated-crio-seccomp-use-default 3.2.0 88m 99-master-generated-registries dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m 99-master-ssh 3.2.0 90m 99-worker-generated-containerruntime dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 23m 99-worker-generated-containerruntime-1 dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 6m40s 99-worker-generated-crio-seccomp-use-default 3.2.0 88m 99-worker-generated-registries dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m 99-worker-ssh 3.2.0 90m rendered-master-1f5449d03a8fb49f0ff3d741eb363a4c dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m rendered-worker-d229647baf68ce03bce6557c7890110d dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 23m rendered-worker-d92fd0744b797e11843570f0b681e971 dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 88m rendered-worker-efaf76f5ebf797d15ef5c6014919afed dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 6m35s % oc debug node/ip-10-0-142-35.us-east-2.compute.internal Starting pod/ip-10-0-142-35us-east-2computeinternal-debug ... … sh-4.4# crio config | grep pids_limit INFO[2022-07-27 13:17:32.805457991Z] Starting CRI-O, version: 1.23.3-11.rhaos4.10.gitddf4b1a.1.el8, git: () INFO Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL pids_limit = 65000 % oc get mc | grep -i containerruntime 99-worker-generated-containerruntime dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 29m 99-worker-generated-containerruntime-1 dc29945da95a65f460ad50ad1bbc10e1918a9c61 3.2.0 12m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.25 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5730 |