1930636 – the invalid value in containerruntimeconfig may cause node get stuck in NotReady status

Bug 1930636 - the invalid value in containerruntimeconfig may cause node get stuck in NotReady status

Summary: the invalid value in containerruntimeconfig may cause node get stuck in NotRe...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.7
Hardware:	x86_64
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Harshal Patil
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-19 10:25 UTC by MinLi
Modified:	2023-09-15 01:01 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	Cause: The MCO failed to check the corner case of overlaySize and pidsLimit. Consequence: The node status stays NotReady. Fix: MCO will fail if the overlaySize is negative and pidsLimit is less than 0. Result: Roll out correct CRDs to cluster and fail if the CRD has invalid configurations.
Clone Of:
Environment:
Last Closed:	2021-03-31 07:16:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2448	0	None	open	Bug 1930636: Fix overlaySize & pidslimit ctrcfg	2021-03-01 21:35:29 UTC
Github	openshift machine-config-operator pull 2485	0	None	open	Bug 1930636: fix error print verb	2021-03-23 16:54:21 UTC

Description MinLi 2021-02-19 10:25:30 UTC

Description of problem:
the invalid value in containerruntimeconfig may cause node get stuck in NotReady status, such as we set overlaySize to -8G.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-02-17-224627

How reproducible:
always

Steps to Reproduce:
1.oc label mcp worker custom-crio-overlay=overlay-size

2.oc create -f containerRuntimeConfig.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
  name: overlay-size
spec:
  containerRuntimeConfig:
    overlaySize: -8G
  machineConfigPoolSelector:
    matchLabels:
      custom-crio-overlay: overlay-size


3.oc get ctrcfg overlay-size -o yaml 

4.oc get mcp 

Actual results:
3.show create success
  - lastTransitionTime: "2021-02-19T08:57:01Z"
    message: Success
    status: "True"
    type: Success

4.mcp worker kept rolling out to new mc all the time, but the node get stuck in NotReady status.
$ oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-b5197ba2e0836cecc7271b0ef2884dad   True      False      False      3              3                   3                     0                      28h
worker   rendered-worker-d273fbfcb7db5becb14ce943da4d1210   False     True       False      2              0                   0                     0                      28h

$ oc get node 
NAME                                                      STATUS                        ROLES    AGE   VERSION
minmli0218-cthhn-m-0.c.openshift-qe.internal              Ready                         master   28h   v1.20.0+ba45583
minmli0218-cthhn-m-1.c.openshift-qe.internal              Ready                         master   28h   v1.20.0+ba45583
minmli0218-cthhn-m-2.c.openshift-qe.internal              Ready                         master   28h   v1.20.0+ba45583
minmli0218-cthhn-worker-a-5sjmq.c.openshift-qe.internal   NotReady,SchedulingDisabled   worker   28h   v1.20.0+ba45583
minmli0218-cthhn-worker-b-xplxp.c.openshift-qe.internal   Ready                         worker   28h   v1.20.0+ba45583


Expected results:
3.tip the value -8G is invalid.

Additional info:
Not all the invalid value will cause node reboot fail,I tried with setting pidsLimit: 0 in ctrcfg, it also show creating success, but it didn't generate any new mc, so no mcp roll out. 

I think either we add validation for ctrcfg field or we add clarification in doc  to remind customer the risk of setting invalid value in ctrcfg.

Comment 3 MinLi 2021-03-10 03:08:22 UTC

not fixed on version: 4.8.0-0.nightly-2021-03-08-184701

for parameter "overlaySize", the ctrcfg CR can't recognize non-digital value(such as 9asadG) as invalid value, neither prompt error messages, nor trigger mcp roll out. I think this should refer to the way of patameter "pidsLimit".(when you input non-digital value, you can't save the ctrcfg cr file successfully):
spec:
  containerRuntimeConfig:
    overlaySize: 9asadG
  machineConfigPoolSelector:
    matchLabels:
      custom-crio-overlay: overlay-size
status:
  conditions:
  - lastTransitionTime: "2021-03-09T08:42:15Z"
    message: 'Error: invalid overlaySize "-3G", cannot be less than 0'
    status: "False"
    type: Failure
  observedGeneration: 1



And another issue is the historical lastTransitionTime should keep historical data, not latest time:
spec:
  containerRuntimeConfig:
    overlaySize: 0G
  machineConfigPoolSelector:
    matchLabels:
      custom-crio-overlay: overlay-size
status:
  conditions:
  - lastTransitionTime: "2021-03-09T08:44:47Z" // this time should keep historical data, but it sync to latest time as the next item.
    message: 'Error: invalid overlaySize "-3G", cannot be less than 0'
    status: "False"
    type: Failure
  - lastTransitionTime: "2021-03-09T08:44:47Z"
    message: Success
    status: "True"
    type: Success
  observedGeneration: 3



for parameter "pidsLimit", ctrcfg description won't show error messages:
$ oc get ctrcfg set-pids-limit-master -o yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
  creationTimestamp: "2021-03-09T09:36:32Z"
  generation: 1
  managedFields:
...
    manager: oc
    operation: Update
    time: "2021-03-09T09:36:32Z"
  name: set-pids-limit-master
  resourceVersion: "75129"
  uid: 7416f64e-a60b-42a7-8d38-dca3d72a1d86
spec:
  containerRuntimeConfig:
    pidsLimit: -90
  machineConfigPoolSelector:
    matchLabels:
      custom-crio: pidLimit
[lyman@localhost env]$ oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-ea560dfc69f54e1eb2b30821f68e56cc   True      False      False      3              3                   3                     0                      157m
worker   rendered-worker-c14e3bbca7d1434ad8024ce7587120c4   True      False      False      3              3                   3                     0                      157m


for parameter "log_size_max", whatever value I set, 0、positive or negative, the ctrcfg description will reset null for this item:
spec:
  containerRuntimeConfig: {} // this line show empty
  machineConfigPoolSelector:
    matchLabels:
      custom-crio-max: log_size_max

Comment 4 Qi Wang 2021-03-18 18:56:43 UTC

@minmli Could you double-check the pidsLimit, I tested the release payload I built from the MCO master, the error message showed up. (it has a print verb issue though, I will fix it).

$ oc describe containerruntimeconfig.machineconfiguration.openshift.io/set-pids

...
...
Spec:
  Container Runtime Config:
    Pids Limit:  -10
  Machine Config Pool Selector:
    Match Labels:
      Custom - Crio - Pids:  pids
Status:
  Conditions:
    Last Transition Time:  2021-03-18T18:05:27Z
    Message:               Error: invalid PidsLimit %!q(int64=-10)
    Status:                False
    Type:                  Failure
  Observed Generation:     1
Events:                    <none>

Could you provide the "log_size_max" specification? That parameter is logSizeMax, I checked the status can update after I applied the crd.

apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
...
spec:
  containerRuntimeConfig:
    logSizeMax: 10K
  machineConfigPoolSelector:
    matchLabels:
      custom-crio-log: log

Comment 5 MinLi 2021-03-19 08:43:58 UTC

Hi, Qi Wang

I checked the pidsLimit and got the same results as you pasted. So my previous issue no longer exists.

As to "log_size_max", I refer to https://docs.openshift.com/container-platform/4.7/post_installation_configuration/machine-configuration-tasks.html#create-a-containerruntimeconfig_post-install-machine-configuration-tasks. This page doesn't introduce the correct paramter "logSizeMax", I will file a doc bug to fix it.

Comment 6 Tom Sweeney 2021-03-19 19:38:00 UTC

@minmli @qiwang given MinLi's last comment, can we set this bug to "ON_QA" again or perhaps verified?

Comment 7 Qi Wang 2021-03-22 22:48:23 UTC

The `Last Transition Time` seems to work fine, could you also double-check this?

Spec:
  Container Runtime Config:
    Pids Limit:  10
  Machine Config Pool Selector:
    Match Labels:
      Custom - Crio - Log:  log
Status:
  Conditions:
    Last Transition Time:  2021-03-22T22:21:01Z
    Message:               Success
    Status:                True
    Type:                  Success
    Last Transition Time:  2021-03-22T22:33:03Z
    Message:               Error: invalid PidsLimit '\n'
    Status:                False
    Type:                  Failure
  Observed Generation:     2
Events:                    <none>

Comment 8 MinLi 2021-03-23 08:42:41 UTC

@Tom Sweeney, this bug still has some flakes to fix, can not move it to on_qa

Comment 9 MinLi 2021-03-23 09:33:41 UTC

@Qi Wang, oh, I get the point, if the current parameter's value is invalid, then the kubelet will probe every minutes to check if the error value not exist any more, so the 'Last Transition Time' keep up to date, yet if the check is successful, then the 'Last Transition Time' will not update. 
So the 'Last Transition Time' really works fine.

Comment 10 Qi Wang 2021-03-23 16:58:18 UTC

(In reply to MinLi from comment #9)
> @Qi Wang, oh, I get the point, if the current parameter's value is invalid,
> then the kubelet will probe every minutes to check if the error value not
> exist any more, so the 'Last Transition Time' keep up to date, yet if the
> check is successful, then the 'Last Transition Time' will not update. 
> So the 'Last Transition Time' really works fine.

Thanks for the explanation!
Discussed the non-digital overlaySize bug, this is a WONTFIX. The crd decode validation has been taken out from the MCO code.
I could only set up a PR as a follow up of this BZ to fix the print verb issue.

Comment 11 Qi Wang 2021-03-23 19:24:56 UTC

The reason we dropped the support for validation of kubeletconfig https://github.com/openshift/machine-config-operator/issues/2357

Comment 13 MinLi 2021-03-24 08:11:14 UTC

@ Qi Wang,  according to  Comment 12, does it mean we can handle the non-digital overlaySize bug?

Comment 15 Qi Wang 2021-03-24 19:42:54 UTC

(In reply to MinLi from comment #13)
> @ Qi Wang,  according to  Comment 12, does it mean we can handle the
> non-digital overlaySize bug?

No. ContainerRuntimeConfig is defined locally, but the overlaySize field is defined as resource.Quantity that is from kubernetes/apimachinery/pkg/api/resource. So it can not be validated inside MCO.  @harpatil  Is my understanding correct?


If it is fixable, at least I saw the pod log has the error 

W0316 16:45:13.344669       1 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.ContainerRuntimeConfig ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: v1.ContainerRuntimeConfig.Spec: v1.ContainerRuntimeConfigSpec.MachineConfigPoolSelector: ContainerRuntimeConfig: v1.ContainerRuntimeConfiguration.OverlaySize: unmarshalerDecoder: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$', error found in #10 byte of ...|\":\"9asadG\"},\"machine|..., bigger context ...|:

Maybe MCO can capture this error somewhere? but I got lost in the code for now to have this fix.

Comment 18 MinLi 2021-03-25 08:04:26 UTC

@Harshal Patil 
thanks for your explanation, to my understanding, we can fix the validation of overlaySize. cc:@Qi Wang

Comment 20 MinLi 2021-04-01 02:16:52 UTC

@ Harshal Patil, ok, got you, it's acceptable if we add specification about the validation of ContainerRuntimeConfig in doc.

Btw, will you fix the print verb issue which @Qi Wang mentioned in  Comment 10? If yes, we can reopen this bug to track it.

Spec:
  Container Runtime Config:
    Pids Limit:  -10
  Machine Config Pool Selector:
    Match Labels:
      Custom - Crio - Pids:  pids
Status:
  Conditions:
    Last Transition Time:  2021-03-18T18:05:27Z
    Message:               Error: invalid PidsLimit %!q(int64=-10)
    Status:                False
    Type:                  Failure
  Observed Generation:     1
Events:                    <none>

Comment 21 Qi Wang 2021-04-01 13:52:22 UTC

(In reply to MinLi from comment #20)
> @ Harshal Patil, ok, got you, it's acceptable if we add specification about
> the validation of ContainerRuntimeConfig in doc.
> 
> Btw, will you fix the print verb issue which @Qi Wang mentioned in  Comment
> 10? If yes, we can reopen this bug to track it.
> 
> Spec:
>   Container Runtime Config:
>     Pids Limit:  -10
>   Machine Config Pool Selector:
>     Match Labels:
>       Custom - Crio - Pids:  pids
> Status:
>   Conditions:
>     Last Transition Time:  2021-03-18T18:05:27Z
>     Message:               Error: invalid PidsLimit %!q(int64=-10)
>     Status:                False
>     Type:                  Failure
>   Observed Generation:     1
> Events:                    <none>

Hi, there's the fix for the print issue https://github.com/openshift/machine-config-operator/pull/2485. It's merged already.

Comment 22 Harshal Patil 2021-04-05 07:50:16 UTC

Thanks Qi for the link.

Comment 23 Red Hat Bugzilla 2023-09-15 01:01:42 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.