Description of problem: the invalid value in containerruntimeconfig may cause node get stuck in NotReady status, such as we set overlaySize to -8G. Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2021-02-17-224627 How reproducible: always Steps to Reproduce: 1.oc label mcp worker custom-crio-overlay=overlay-size 2.oc create -f containerRuntimeConfig.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: overlay-size spec: containerRuntimeConfig: overlaySize: -8G machineConfigPoolSelector: matchLabels: custom-crio-overlay: overlay-size 3.oc get ctrcfg overlay-size -o yaml 4.oc get mcp Actual results: 3.show create success - lastTransitionTime: "2021-02-19T08:57:01Z" message: Success status: "True" type: Success 4.mcp worker kept rolling out to new mc all the time, but the node get stuck in NotReady status. $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-b5197ba2e0836cecc7271b0ef2884dad True False False 3 3 3 0 28h worker rendered-worker-d273fbfcb7db5becb14ce943da4d1210 False True False 2 0 0 0 28h $ oc get node NAME STATUS ROLES AGE VERSION minmli0218-cthhn-m-0.c.openshift-qe.internal Ready master 28h v1.20.0+ba45583 minmli0218-cthhn-m-1.c.openshift-qe.internal Ready master 28h v1.20.0+ba45583 minmli0218-cthhn-m-2.c.openshift-qe.internal Ready master 28h v1.20.0+ba45583 minmli0218-cthhn-worker-a-5sjmq.c.openshift-qe.internal NotReady,SchedulingDisabled worker 28h v1.20.0+ba45583 minmli0218-cthhn-worker-b-xplxp.c.openshift-qe.internal Ready worker 28h v1.20.0+ba45583 Expected results: 3.tip the value -8G is invalid. Additional info: Not all the invalid value will cause node reboot fail,I tried with setting pidsLimit: 0 in ctrcfg, it also show creating success, but it didn't generate any new mc, so no mcp roll out. I think either we add validation for ctrcfg field or we add clarification in doc to remind customer the risk of setting invalid value in ctrcfg.
not fixed on version: 4.8.0-0.nightly-2021-03-08-184701 for parameter "overlaySize", the ctrcfg CR can't recognize non-digital value(such as 9asadG) as invalid value, neither prompt error messages, nor trigger mcp roll out. I think this should refer to the way of patameter "pidsLimit".(when you input non-digital value, you can't save the ctrcfg cr file successfully): spec: containerRuntimeConfig: overlaySize: 9asadG machineConfigPoolSelector: matchLabels: custom-crio-overlay: overlay-size status: conditions: - lastTransitionTime: "2021-03-09T08:42:15Z" message: 'Error: invalid overlaySize "-3G", cannot be less than 0' status: "False" type: Failure observedGeneration: 1 And another issue is the historical lastTransitionTime should keep historical data, not latest time: spec: containerRuntimeConfig: overlaySize: 0G machineConfigPoolSelector: matchLabels: custom-crio-overlay: overlay-size status: conditions: - lastTransitionTime: "2021-03-09T08:44:47Z" // this time should keep historical data, but it sync to latest time as the next item. message: 'Error: invalid overlaySize "-3G", cannot be less than 0' status: "False" type: Failure - lastTransitionTime: "2021-03-09T08:44:47Z" message: Success status: "True" type: Success observedGeneration: 3 for parameter "pidsLimit", ctrcfg description won't show error messages: $ oc get ctrcfg set-pids-limit-master -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: creationTimestamp: "2021-03-09T09:36:32Z" generation: 1 managedFields: ... manager: oc operation: Update time: "2021-03-09T09:36:32Z" name: set-pids-limit-master resourceVersion: "75129" uid: 7416f64e-a60b-42a7-8d38-dca3d72a1d86 spec: containerRuntimeConfig: pidsLimit: -90 machineConfigPoolSelector: matchLabels: custom-crio: pidLimit [lyman@localhost env]$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-ea560dfc69f54e1eb2b30821f68e56cc True False False 3 3 3 0 157m worker rendered-worker-c14e3bbca7d1434ad8024ce7587120c4 True False False 3 3 3 0 157m for parameter "log_size_max", whatever value I set, 0、positive or negative, the ctrcfg description will reset null for this item: spec: containerRuntimeConfig: {} // this line show empty machineConfigPoolSelector: matchLabels: custom-crio-max: log_size_max
@minmli Could you double-check the pidsLimit, I tested the release payload I built from the MCO master, the error message showed up. (it has a print verb issue though, I will fix it). $ oc describe containerruntimeconfig.machineconfiguration.openshift.io/set-pids ... ... Spec: Container Runtime Config: Pids Limit: -10 Machine Config Pool Selector: Match Labels: Custom - Crio - Pids: pids Status: Conditions: Last Transition Time: 2021-03-18T18:05:27Z Message: Error: invalid PidsLimit %!q(int64=-10) Status: False Type: Failure Observed Generation: 1 Events: <none> Could you provide the "log_size_max" specification? That parameter is logSizeMax, I checked the status can update after I applied the crd. apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig ... spec: containerRuntimeConfig: logSizeMax: 10K machineConfigPoolSelector: matchLabels: custom-crio-log: log
Hi, Qi Wang I checked the pidsLimit and got the same results as you pasted. So my previous issue no longer exists. As to "log_size_max", I refer to https://docs.openshift.com/container-platform/4.7/post_installation_configuration/machine-configuration-tasks.html#create-a-containerruntimeconfig_post-install-machine-configuration-tasks. This page doesn't introduce the correct paramter "logSizeMax", I will file a doc bug to fix it.
@minmli @qiwang given MinLi's last comment, can we set this bug to "ON_QA" again or perhaps verified?
The `Last Transition Time` seems to work fine, could you also double-check this? Spec: Container Runtime Config: Pids Limit: 10 Machine Config Pool Selector: Match Labels: Custom - Crio - Log: log Status: Conditions: Last Transition Time: 2021-03-22T22:21:01Z Message: Success Status: True Type: Success Last Transition Time: 2021-03-22T22:33:03Z Message: Error: invalid PidsLimit '\n' Status: False Type: Failure Observed Generation: 2 Events: <none>
@Tom Sweeney, this bug still has some flakes to fix, can not move it to on_qa
@Qi Wang, oh, I get the point, if the current parameter's value is invalid, then the kubelet will probe every minutes to check if the error value not exist any more, so the 'Last Transition Time' keep up to date, yet if the check is successful, then the 'Last Transition Time' will not update. So the 'Last Transition Time' really works fine.
(In reply to MinLi from comment #9) > @Qi Wang, oh, I get the point, if the current parameter's value is invalid, > then the kubelet will probe every minutes to check if the error value not > exist any more, so the 'Last Transition Time' keep up to date, yet if the > check is successful, then the 'Last Transition Time' will not update. > So the 'Last Transition Time' really works fine. Thanks for the explanation! Discussed the non-digital overlaySize bug, this is a WONTFIX. The crd decode validation has been taken out from the MCO code. I could only set up a PR as a follow up of this BZ to fix the print verb issue.
The reason we dropped the support for validation of kubeletconfig https://github.com/openshift/machine-config-operator/issues/2357
@ Qi Wang, according to Comment 12, does it mean we can handle the non-digital overlaySize bug?
(In reply to MinLi from comment #13) > @ Qi Wang, according to Comment 12, does it mean we can handle the > non-digital overlaySize bug? No. ContainerRuntimeConfig is defined locally, but the overlaySize field is defined as resource.Quantity that is from kubernetes/apimachinery/pkg/api/resource. So it can not be validated inside MCO. @harpatil Is my understanding correct? If it is fixable, at least I saw the pod log has the error W0316 16:45:13.344669 1 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.ContainerRuntimeConfig ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: v1.ContainerRuntimeConfig.Spec: v1.ContainerRuntimeConfigSpec.MachineConfigPoolSelector: ContainerRuntimeConfig: v1.ContainerRuntimeConfiguration.OverlaySize: unmarshalerDecoder: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$', error found in #10 byte of ...|\":\"9asadG\"},\"machine|..., bigger context ...|: Maybe MCO can capture this error somewhere? but I got lost in the code for now to have this fix.
@Harshal Patil thanks for your explanation, to my understanding, we can fix the validation of overlaySize. cc:@Qi Wang
@ Harshal Patil, ok, got you, it's acceptable if we add specification about the validation of ContainerRuntimeConfig in doc. Btw, will you fix the print verb issue which @Qi Wang mentioned in Comment 10? If yes, we can reopen this bug to track it. Spec: Container Runtime Config: Pids Limit: -10 Machine Config Pool Selector: Match Labels: Custom - Crio - Pids: pids Status: Conditions: Last Transition Time: 2021-03-18T18:05:27Z Message: Error: invalid PidsLimit %!q(int64=-10) Status: False Type: Failure Observed Generation: 1 Events: <none>
(In reply to MinLi from comment #20) > @ Harshal Patil, ok, got you, it's acceptable if we add specification about > the validation of ContainerRuntimeConfig in doc. > > Btw, will you fix the print verb issue which @Qi Wang mentioned in Comment > 10? If yes, we can reopen this bug to track it. > > Spec: > Container Runtime Config: > Pids Limit: -10 > Machine Config Pool Selector: > Match Labels: > Custom - Crio - Pids: pids > Status: > Conditions: > Last Transition Time: 2021-03-18T18:05:27Z > Message: Error: invalid PidsLimit %!q(int64=-10) > Status: False > Type: Failure > Observed Generation: 1 > Events: <none> Hi, there's the fix for the print issue https://github.com/openshift/machine-config-operator/pull/2485. It's merged already.
Thanks Qi for the link.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days