1805019 – Should forbidden to create the KubeletConfig object if lack of the evictionSoftGracePeriod when using the soft eviction

Bug 1805019 - Should forbidden to create the KubeletConfig object if lack of the evictionSoftGracePeriod when using the soft eviction

Summary: Should forbidden to create the KubeletConfig object if lack of the evictionSo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Harshal Patil
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1749620 1852787 1878603
TreeView+	depends on / blocked

Reported:	2020-02-20 06:05 UTC by Jian Zhang
Modified:	2020-10-27 15:56 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: evictionSoft must be specified with corresponding evictionSoftGracePeriod Consequence: Kubelet will fail to start without valid configuration Fix: During KubeletConfig creation verify that evictionSoft is supplied with corresponding evictionSoftGracePeriod Result: Kubelet should start normally with evictionSoft support
Clone Of:
Environment:
Last Closed:	2020-10-27 15:55:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1880	0	None	closed	Bug 1805019: KubeletConfig: Verify EvictionSoftGracePeriod is set when EvictionSof…	2021-02-07 07:46:33 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 15:56:00 UTC

Description Jian Zhang 2020-02-20 06:05:13 UTC

Description of problem:
After setting the pod eviction policy, the machineconfigpool worker always in `Updating` status.
mac:~ jianzhang$ oc get  machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-28f3c042508f1b5b5769bfc9d83c8243   True      False      False      3              3                   3                     0                      23h
worker   rendered-worker-1176db00e4bede3824da402987ea9141   False     True       False      3              0                   0                     0                      23h


Version-Release number of selected component (if applicable):
Cluster version is 4.4.0-0.nightly-2020-02-18-093529

How reproducible:
always

Steps to Reproduce:
1. Install OCP 4.4.
2. Install an operator on the Web console, for example, CNV.
3. Set the pod eviction policy.
$ oc label machineconfigpool worker custom-kubelet=small-pods
mac:~ jianzhang$ cat pod_evication.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: worker-kubeconfig 
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: small-pods 
  kubeletConfig:
    evictionSoft:
      memory.available: "90%"
      nodefs.available: "90%"
      nodefs.inodesFree: "90%"
    evictionPressureTransitionPeriod: 0s 
mac:~ jianzhang$ oc get kubeletconfig
NAME                AGE
worker-kubeconfig   97m


Actual results:
The machineconfigpool worker always in `Updating` status. And, one worker stay in `NotReady` for a long time.
mac:~ jianzhang$ oc get  machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-28f3c042508f1b5b5769bfc9d83c8243   True      False      False      3              3                   3                     0                      23h
worker   rendered-worker-1176db00e4bede3824da402987ea9141   False     True       False      3              0                   0                     0                      23h
mac:~ jianzhang$ oc get nodes
NAME                                         STATUS                        ROLES    AGE    VERSION
ip-10-0-135-29.us-east-2.compute.internal    Ready                         master   23h    v1.17.1
ip-10-0-136-239.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   23h    v1.17.1
ip-10-0-153-231.us-east-2.compute.internal   Ready                         master   23h    v1.17.1
ip-10-0-158-243.us-east-2.compute.internal   Ready                         worker   126m   v1.17.1
ip-10-0-161-79.us-east-2.compute.internal    Ready                         worker   126m   v1.17.1
ip-10-0-171-176.us-east-2.compute.internal   Ready                         master   23h    v1.17.1



Expected results:
The machineconfigpool worker should be updated successfully.


Additional info:
1) I find 2 clusteroperators are in False status due to the worker not ready, and Insufficient cpu. I'm not sure if this the root cause of the machineconfigpool worker updating failure. But, anyway, it shouldn't be. 
mac:~ jianzhang$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
cloud-credential                           4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
cluster-autoscaler                         4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
console                                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
csi-snapshot-controller                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
dns                                        4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
etcd                                       4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
image-registry                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      81m
ingress                                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      80m
insights                                   4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
kube-apiserver                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
kube-controller-manager                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
kube-scheduler                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
kube-storage-version-migrator              4.4.0-0.nightly-2020-02-18-093529   False       False         False      81m
machine-api                                4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
machine-config                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
marketplace                                4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
monitoring                                 4.4.0-0.nightly-2020-02-18-093529   False       True          True       75m
network                                    4.4.0-0.nightly-2020-02-18-093529   True        True          True       23h
node-tuning                                4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
openshift-apiserver                        4.4.0-0.nightly-2020-02-18-093529   True        False         False      5h22m
openshift-controller-manager               4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
openshift-samples                          4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
operator-lifecycle-manager                 4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-02-18-093529   True        False         False      5h22m
service-ca                                 4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
service-catalog-apiserver                  4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
service-catalog-controller-manager         4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h
storage                                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      23h

mac:~ jianzhang$ oc get pods -n openshift-kube-storage-version-migrator
NAME                        READY   STATUS    RESTARTS   AGE
migrator-54b9f4568d-n67p9   0/1     Pending   0          83m
mac:~ jianzhang$ oc describe pods -n openshift-kube-storage-version-migrator
Name:           migrator-54b9f4568d-n67p9
Namespace:      openshift-kube-storage-version-migrator
Priority:       0
Node:           <none>
Labels:         app=migrator
...
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.

mac:~ jianzhang$ oc describe pods alertmanager-main-0 -n openshift-monitoring
Name:                 alertmanager-main-0
Namespace:            openshift-monitoring
Priority:             2000000000
...
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.

2) Describe the failure node:
mac:~ jianzhang$ oc describe nodes ip-10-0-158-243.us-east-2.compute.internal
Name:               ip-10-0-158-243.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m4.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-158-243
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m4.large
                    node.openshift.io/os_id=rhcos
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2b
Annotations:        machine.openshift.io/machine: openshift-machine-api/qe-jiazha5-5kmbv-worker-us-east-2b-bmtj2
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-1176db00e4bede3824da402987ea9141
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-1176db00e4bede3824da402987ea9141
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 20 Feb 2020 11:46:56 +0800
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-158-243.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Thu, 20 Feb 2020 13:40:51 +0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 20 Feb 2020 13:36:12 +0800   Thu, 20 Feb 2020 11:46:56 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 20 Feb 2020 13:36:12 +0800   Thu, 20 Feb 2020 11:46:56 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 20 Feb 2020 13:36:12 +0800   Thu, 20 Feb 2020 11:46:56 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 20 Feb 2020 13:36:12 +0800   Thu, 20 Feb 2020 11:47:57 +0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.158.243
  Hostname:     ip-10-0-158-243.us-east-2.compute.internal
  InternalDNS:  ip-10-0-158-243.us-east-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         2
  ephemeral-storage:           125277164Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      8161840Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         1500m
  ephemeral-storage:           114381692328
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7010864Ki
  pods:                        250
System Info:
  Machine ID:                             62f290223949478488f198cbb669ddff
  System UUID:                            ec29c5ea-eacc-4812-62a2-66f37028a88f
  Boot ID:                                3a8d0e50-6948-4754-b2b4-ef28ae8c06af
  Kernel Version:                         4.18.0-147.5.1.el8_1.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 44.81.202002180730-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.17.0-4.dev.rhaos4.4.gitc3436cc.el8
  Kubelet Version:                        v1.17.1
  Kube-Proxy Version:                     v1.17.1
ProviderID:                               aws:///us-east-2b/i-0ba66391e2c029246
Non-terminated Pods:                      (20 in total)
  Namespace                               Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                        ------------  ----------  ---------------  -------------  ---
  openshift-cluster-node-tuning-operator  tuned-cvvjz                                 10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         114m
  openshift-csi-snapshot-controller       csi-snapshot-controller-669fcbbb8f-8mqcp    10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         84m
  openshift-dns                           dns-default-698pg                           110m (7%)     0 (0%)      70Mi (1%)        512Mi (7%)     113m
  openshift-image-registry                node-ca-j52ph                               10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         114m
  openshift-ingress                       router-default-5cc767646d-dswmf             100m (6%)     0 (0%)      256Mi (3%)       0 (0%)         23h
  openshift-machine-config-operator       machine-config-daemon-p6bv5                 40m (2%)      0 (0%)      100Mi (1%)       0 (0%)         113m
  openshift-marketplace                   redhat-marketplace-5b9dfc7d66-qllqt         10m (0%)      0 (0%)      100Mi (1%)       0 (0%)         86m
  openshift-marketplace                   redhat-operators-55f686f4d9-r7f9n           10m (0%)      0 (0%)      100Mi (1%)       0 (0%)         86m
  openshift-monitoring                    alertmanager-main-2                         110m (7%)     100m (6%)   245Mi (3%)       25Mi (0%)      86m
  openshift-monitoring                    grafana-755b7df4f9-rv7cn                    110m (7%)     0 (0%)      120Mi (1%)       0 (0%)         23h
  openshift-monitoring                    node-exporter-4wtgw                         112m (7%)     0 (0%)      200Mi (2%)       0 (0%)         114m
  openshift-monitoring                    prometheus-adapter-5cd6485bbb-8gf6b         10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         86m
  openshift-monitoring                    prometheus-k8s-1                            480m (32%)    200m (13%)  1234Mi (18%)     50Mi (0%)      23h
  openshift-monitoring                    telemeter-client-7c6587467-5vrff            10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         86m
  openshift-multus                        multus-c5vjz                                10m (0%)      0 (0%)      150Mi (2%)       0 (0%)         114m
  openshift-operators                     cdi-operator-67887974b-lwtl5                0 (0%)        0 (0%)      0 (0%)           0 (0%)         86m
  openshift-operators                     hco-operator-54cd7db78c-6m5hb               0 (0%)        0 (0%)      0 (0%)           0 (0%)         86m
  openshift-operators                     virt-operator-546775946c-lz4lt              0 (0%)        0 (0%)      0 (0%)           0 (0%)         86m
  openshift-sdn                           ovs-4wxw7                                   200m (13%)    0 (0%)      400Mi (5%)       0 (0%)         114m
  openshift-sdn                           sdn-84ffv                                   100m (6%)     0 (0%)      200Mi (2%)       0 (0%)         114m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1442m (96%)   300m (20%)
  memory                      3325Mi (48%)  587Mi (8%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:
  Type    Reason                   Age                  From                                                 Message
  ----    ------                   ----                 ----                                                 -------
  Normal  Starting                 114m                 kubelet, ip-10-0-158-243.us-east-2.compute.internal  Starting kubelet.
  Normal  NodeHasSufficientMemory  114m (x2 over 114m)  kubelet, ip-10-0-158-243.us-east-2.compute.internal  Node ip-10-0-158-243.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    114m (x2 over 114m)  kubelet, ip-10-0-158-243.us-east-2.compute.internal  Node ip-10-0-158-243.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     114m (x2 over 114m)  kubelet, ip-10-0-158-243.us-east-2.compute.internal  Node ip-10-0-158-243.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  114m                 kubelet, ip-10-0-158-243.us-east-2.compute.internal  Updated Node Allocatable limit across pods
  Normal  NodeReady                113m                 kubelet, ip-10-0-158-243.us-east-2.compute.internal  Node ip-10-0-158-243.us-east-2.compute.internal status is now: NodeReady

3) Describe the machineconfigpool worker
mac:~ jianzhang$ oc get  machineconfigpool worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2020-02-19T06:07:52Z"
  generation: 3
  labels:
    custom-kubelet: small-pods
    machineconfiguration.openshift.io/mco-built-in: ""
  name: worker
  resourceVersion: "650473"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  uid: 30eb6291-9bea-4ef4-95c9-40ec289a2779
spec:
  configuration:
    name: rendered-worker-c9d267810c63b6a99557acc27bdfc847
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-30eb6291-9bea-4ef4-95c9-40ec289a2779-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-30eb6291-9bea-4ef4-95c9-40ec289a2779-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2020-02-19T06:08:24Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2020-02-19T06:08:29Z"
    message: ""
    reason: ""
    status: "False"
    type: NodeDegraded
  - lastTransitionTime: "2020-02-19T06:08:29Z"
    message: ""
    reason: ""
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-02-20T04:15:01Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2020-02-20T04:15:01Z"
    message: All nodes are updating to rendered-worker-c9d267810c63b6a99557acc27bdfc847
    reason: ""
    status: "True"
    type: Updating
  configuration:
    name: rendered-worker-1176db00e4bede3824da402987ea9141
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-30eb6291-9bea-4ef4-95c9-40ec289a2779-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  degradedMachineCount: 0
  machineCount: 3
  observedGeneration: 3
  readyMachineCount: 0
  unavailableMachineCount: 1
  updatedMachineCount: 0

Comment 1 Jian Zhang 2020-02-20 07:21:30 UTC

Some cluster operators (network, monitor) depends on each node healthy. Due to there is a NotReady node, they are in unsuccessful status.

# oc get co/monitoring -oyaml
    message: 'Failed to rollout the stack. Error: running task Updating node-exporter
      failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object
      failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter
      is not ready. status: (desired: 8, updated: 8, ready: 7, unavailable: 1)'
    reason: UpdatingnodeExporterFailed
    status: "True"
    type: Degraded

mac:~ jianzhang$ oc get co network -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  annotations:
    network.operator.openshift.io/last-seen-state: '{"DaemonsetStates":[{"Namespace":"openshift-sdn","Name":"sdn","LastSeenStatus":{"currentNumberScheduled":8,"numberMisscheduled":0,"desiredNumberScheduled":8,"numberReady":7,"observedGeneration":1,"updatedNumberScheduled":8,"numberAvailable":7,"numberUnavailable":1},"LastChangeTime":"2020-02-20T06:25:32.736963814Z"},{"Namespace":"openshift-multus","Name":"multus","LastSeenStatus":{"currentNumberScheduled":8,"numberMisscheduled":0,"desiredNumberScheduled":8,"numberReady":7,"observedGeneration":1,"updatedNumberScheduled":8,"numberAvailable":7,"numberUnavailable":1},"LastChangeTime":"2020-02-20T06:25:32.319305138Z"},{"Namespace":"openshift-sdn","Name":"ovs","LastSeenStatus":{"currentNumberScheduled":8,"numberMisscheduled":0,"desiredNumberScheduled":8,"numberReady":7,"observedGeneration":1,"updatedNumberScheduled":8,"numberAvailable":7,"numberUnavailable":1},"LastChangeTime":"2020-02-20T06:25:42.025978461Z"}],"DeploymentStates":[]}'
  creationTimestamp: "2020-02-19T06:03:12Z"
  generation: 1
  name: network

...
  - lastTransitionTime: "2020-02-20T04:15:56Z"
    message: |-
      DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes)
      DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes)
      DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes)

mac:~ jianzhang$ oc get nodes
NAME                                         STATUS                        ROLES    AGE     VERSION
ip-10-0-135-29.us-east-2.compute.internal    Ready                         master   25h     v1.17.1
ip-10-0-136-239.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   25h     v1.17.1
ip-10-0-148-217.us-east-2.compute.internal   Ready                         worker   55m     v1.17.1
ip-10-0-153-231.us-east-2.compute.internal   Ready                         master   25h     v1.17.1
ip-10-0-158-243.us-east-2.compute.internal   Ready                         worker   3h32m   v1.17.1
ip-10-0-158-34.us-east-2.compute.internal    Ready                         worker   55m     v1.17.1
ip-10-0-161-79.us-east-2.compute.internal    Ready                         worker   3h32m   v1.17.1
ip-10-0-171-176.us-east-2.compute.internal   Ready                         master   25h     v1.17.1

mac:~ jianzhang$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      24h
cloud-credential                           4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
cluster-autoscaler                         4.4.0-0.nightly-2020-02-18-093529   True        False         False      24h
console                                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      24h
csi-snapshot-controller                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      24h
dns                                        4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
etcd                                       4.4.0-0.nightly-2020-02-18-093529   True        False         False      24h
image-registry                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      176m
ingress                                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      175m
insights                                   4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
kube-apiserver                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
kube-controller-manager                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
kube-scheduler                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
kube-storage-version-migrator              4.4.0-0.nightly-2020-02-18-093529   True        False         False      45m
machine-api                                4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
machine-config                             4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
marketplace                                4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
monitoring                                 4.4.0-0.nightly-2020-02-18-093529   False       True          True       170m
network                                    4.4.0-0.nightly-2020-02-18-093529   True        True          True       25h
node-tuning                                4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
openshift-apiserver                        4.4.0-0.nightly-2020-02-18-093529   True        False         False      6h57m
openshift-controller-manager               4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
openshift-samples                          4.4.0-0.nightly-2020-02-18-093529   True        False         False      24h
operator-lifecycle-manager                 4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-02-18-093529   True        False         False      6h57m
service-ca                                 4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
service-catalog-apiserver                  4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
service-catalog-controller-manager         4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h
storage                                    4.4.0-0.nightly-2020-02-18-093529   True        False         False      25h

Comment 2 Jian Zhang 2020-02-20 09:58:25 UTC

Besides, when I create a new pod, it still is scheduled to this NotReady node: ip-10-0-136-239.us-east-2.compute.internal

mac:~ jianzhang$ oc get pods -o wide
NAME                                    READY   STATUS    RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
certified-operators-8d7796766-f229f     1/1     Running   0          5h41m   10.129.2.6    ip-10-0-161-79.us-east-2.compute.internal    <none>           <none>
community-operators-9d887b488-wvlc9     1/1     Running   0          5h41m   10.129.2.11   ip-10-0-161-79.us-east-2.compute.internal    <none>           <none>
marketplace-operator-6d9f75cc47-7lgxq   1/1     Running   0          27h     10.129.0.20   ip-10-0-153-231.us-east-2.compute.internal   <none>           <none>
poll-test-ksbx7                         0/1     Pending   0          10m     <none>        ip-10-0-136-239.us-east-2.compute.internal   <none>           <none>
redhat-marketplace-5b9dfc7d66-qllqt     1/1     Running   0          5h41m   10.128.2.11   ip-10-0-158-243.us-east-2.compute.internal   <none>           <none>
redhat-operators-b747df6b6-9pzgf        1/1     Running   0          42m     10.130.2.5    ip-10-0-158-34.us-east-2.compute.internal    <none>           <none>
mac:~ jianzhang$ oc get nodes
NAME                                         STATUS                        ROLES    AGE     VERSION
ip-10-0-135-29.us-east-2.compute.internal    Ready                         master   27h     v1.17.1
ip-10-0-136-239.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   27h     v1.17.1
ip-10-0-148-217.us-east-2.compute.internal   Ready                         worker   3h31m   v1.17.1
ip-10-0-153-231.us-east-2.compute.internal   Ready                         master   27h     v1.17.1
ip-10-0-158-243.us-east-2.compute.internal   Ready                         worker   6h9m    v1.17.1
ip-10-0-158-34.us-east-2.compute.internal    Ready                         worker   3h31m   v1.17.1
ip-10-0-161-79.us-east-2.compute.internal    Ready                         worker   6h9m    v1.17.1
ip-10-0-171-176.us-east-2.compute.internal   Ready                         master   27h     v1.17.1

mac:~ jianzhang$ oc describe  pods poll-test-ksbx7 
Name:         poll-test-ksbx7
Namespace:    openshift-marketplace
Priority:     0
Node:         ip-10-0-136-239.us-east-2.compute.internal/
Labels:       olm.catalogSource=poll-test
Annotations:  openshift.io/scc: anyuid
Status:       Pending
IP:           
IPs:          <none>
Containers:
  registry-server:
    Image:      quay.io/my-catalogs/my-catalog:master
    Port:       50051/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:        10m
      memory:     50Mi
    Liveness:     exec [grpc_health_probe -addr=localhost:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=localhost:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-z78cv (ro)
Conditions:
  Type           Status
  PodScheduled   True 
Volumes:
  default-token-z78cv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-z78cv
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     
Events:
  Type    Reason     Age        From               Message
  ----    ------     ----       ----               -------
  Normal  Scheduled  <unknown>  default-scheduler  Successfully assigned openshift-marketplace/poll-test-ksbx7 to ip-10-0-136-239.us-east-2.compute.internal

cannot gather this node logs.
mac:~ jianzhang$ oc adm node-logs ip-10-0-136-239.us-east-2.compute.internal 
error: the server is currently unable to handle the request
  Error trying to reach service: 'dial tcp 10.0.136.239:10250: connect: connection refused'

Comment 3 Jian Zhang 2020-02-20 10:00:36 UTC

mac:~ jianzhang$ oc adm cordon ip-10-0-136-239.us-east-2.compute.internal
node/ip-10-0-136-239.us-east-2.compute.internal already cordoned

Comment 4 Ryan Phillips 2020-02-25 17:41:15 UTC

Backported to 4.4 - the CPU reservations should be back to where they were https://github.com/openshift/machine-config-operator/pull/1476

Comment 7 Jian Zhang 2020-03-09 06:50:05 UTC

Cluster version is 4.4.0-0.nightly-2020-03-08-235004

1, Install CNV operator and make sure they are running on the same worker. As follows
mac:~ jianzhang$ oc adm cordon ip-10-0-153-230.us-east-2.compute.internal ip-10-0-168-220.us-east-2.compute.internal 
node/ip-10-0-153-230.us-east-2.compute.internal cordoned
node/ip-10-0-168-220.us-east-2.compute.internal cordoned

mac:~ jianzhang$ oc get nodes
NAME                                         STATUS                     ROLES    AGE   VERSION
ip-10-0-131-77.us-east-2.compute.internal    Ready                      worker   37m   v1.17.1
ip-10-0-133-75.us-east-2.compute.internal    Ready                      master   45m   v1.17.1
ip-10-0-146-18.us-east-2.compute.internal    Ready                      master   45m   v1.17.1
ip-10-0-153-230.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   39m   v1.17.1
ip-10-0-168-220.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   37m   v1.17.1
ip-10-0-173-167.us-east-2.compute.internal   Ready                      master   45m   v1.17.1

mac:~ jianzhang$ oc get csv -n openshift-operators
NAME                                      DISPLAY                           VERSION   REPLACES   PHASE
kubevirt-hyperconverged-operator.v2.2.0   Container-native virtualization   2.2.0                Succeeded
mac:~ jianzhang$ oc get pods -n openshift-operators -o wide
NAME                                               READY   STATUS    RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
cdi-operator-67887974b-vm85q                       1/1     Running   0          4m34s   10.128.2.13   ip-10-0-131-77.us-east-2.compute.internal   <none>           <none>
cluster-network-addons-operator-7c95c8659b-wkqwm   1/1     Running   0          4m34s   10.128.2.10   ip-10-0-131-77.us-east-2.compute.internal   <none>           <none>
hco-operator-54cd7db78c-qm76t                      1/1     Running   0          4m34s   10.128.2.11   ip-10-0-131-77.us-east-2.compute.internal   <none>           <none>
hostpath-provisioner-operator-fb9cbc8b7-bsssd      1/1     Running   0          4m34s   10.128.2.16   ip-10-0-131-77.us-east-2.compute.internal   <none>           <none>
kubevirt-ssp-operator-7885c98fb9-b6ztk             1/1     Running   0          4m34s   10.128.2.14   ip-10-0-131-77.us-east-2.compute.internal   <none>           <none>
node-maintenance-operator-99556c65f-d6vzn          1/1     Running   0          4m34s   10.130.0.41   ip-10-0-146-18.us-east-2.compute.internal   <none>           <none>
virt-operator-546775946c-5jhlj                     1/1     Running   0          4m34s   10.128.2.15   ip-10-0-131-77.us-east-2.compute.internal   <none>           <none>
virt-operator-546775946c-689dd                     1/1     Running   0          4m34s   10.128.2.12   ip-10-0-131-77.us-east-2.compute.internal   <none>           <none>

mac:~ jianzhang$ oc describe nodes ip-10-0-131-77.us-east-2.compute.internal
Name:               ip-10-0-131-77.us-east-2.compute.internal
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1392m (92%)   200m (13%)
  memory                      3010Mi (43%)  562Mi (8%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0

2, Change back to uncordon.
mac:~ jianzhang$ oc adm uncordon ip-10-0-153-230.us-east-2.compute.internal
node/ip-10-0-153-230.us-east-2.compute.internal uncordoned
mac:~ jianzhang$ oc adm uncordon ip-10-0-168-220.us-east-2.compute.internal
node/ip-10-0-168-220.us-east-2.compute.internal uncordoned
mac:~ jianzhang$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-131-77.us-east-2.compute.internal    Ready    worker   49m   v1.17.1
ip-10-0-133-75.us-east-2.compute.internal    Ready    master   56m   v1.17.1
ip-10-0-146-18.us-east-2.compute.internal    Ready    master   57m   v1.17.1
ip-10-0-153-230.us-east-2.compute.internal   Ready    worker   51m   v1.17.1
ip-10-0-168-220.us-east-2.compute.internal   Ready    worker   49m   v1.17.1
ip-10-0-173-167.us-east-2.compute.internal   Ready    master   57m   v1.17.1

3, Create the KubeletConfig object.
mac:~ jianzhang$ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-e3490b51f062c3a94b508209b44089f6   True      False      False      3              3                   3                     0                      56m
worker   rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5   True      False      False      3              3                   3                     0                      56m
mac:~ jianzhang$ oc label machineconfigpool worker custom-kubelet=small-pods
machineconfigpool.machineconfiguration.openshift.io/worker labeled
mac:~ jianzhang$ cat pod_evication.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: worker-kubeconfig 
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: small-pods 
  kubeletConfig:
    evictionSoft:
      memory.available: "90%"
      nodefs.available: "90%"
      nodefs.inodesFree: "90%"
    evictionPressureTransitionPeriod: 0s 
mac:~ jianzhang$ oc create -f pod_evication.yaml 
The KubeletConfig "worker-kubeconfig" is invalid: 
* spec.kubeletConfig.apiVersion: Required value: must not be empty
* spec.kubeletConfig.kind: Required value: must not be empty

Report a bug for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1811493

4, I work around the above issue, but, the node ip-10-0-153-230.us-east-2.compute.internal is in NotReady status for hours.
machineconfigpool work still in 'Updating' status. Change the bug status to ASSIGNED.
mac:~ jianzhang$ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-e3490b51f062c3a94b508209b44089f6   True      False      False      3              3                   3                     0                      94m
worker   rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5   False     True       False      3              0                   0                     0                      94m
mac:~ jianzhang$ date
Mon Mar  9 11:58:34 CST 2020

mac:~ jianzhang$ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-e3490b51f062c3a94b508209b44089f6   True      False      False      3              3                   3                     0                      4h22m
worker   rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5   False     True       False      3              0                   0                     0                      4h22m
mac:~ jianzhang$ date
Mon Mar  9 14:41:22 CST 2020
mac:~ jianzhang$ oc get nodes
NAME                                         STATUS                        ROLES    AGE     VERSION
ip-10-0-131-77.us-east-2.compute.internal    Ready                         worker   4h19m   v1.17.1
ip-10-0-133-75.us-east-2.compute.internal    Ready                         master   4h26m   v1.17.1
ip-10-0-146-18.us-east-2.compute.internal    Ready                         master   4h27m   v1.17.1
ip-10-0-153-230.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   4h21m   v1.17.1
ip-10-0-168-220.us-east-2.compute.internal   Ready                         worker   4h19m   v1.17.1
ip-10-0-173-167.us-east-2.compute.internal   Ready                         master   4h27m   v1.17.1

mac:~ jianzhang$ oc adm node-logs ip-10-0-153-230.us-east-2.compute.internal 
error: the server is currently unable to handle the request
  Error trying to reach service: 'dial tcp 10.0.153.230:10250: connect: connection refused'

mac:~ jianzhang$ oc describe nodes ip-10-0-153-230.us-east-2.compute.internal
Name:               ip-10-0-153-230.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m4.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-153-230
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m4.large
                    node.openshift.io/os_id=rhcos
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2b
Annotations:        machine.openshift.io/machine: openshift-machine-api/qe-jiazha39-m47lh-worker-us-east-2b-wkft5
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-8b585c5657a1115cd7e3c19c03e2070d
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Working
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 09 Mar 2020 10:22:44 +0800
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
                    node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  ip-10-0-153-230.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Mon, 09 Mar 2020 11:51:09 +0800
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Mon, 09 Mar 2020 11:46:18 +0800   Mon, 09 Mar 2020 11:51:50 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Mon, 09 Mar 2020 11:46:18 +0800   Mon, 09 Mar 2020 11:51:50 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Mon, 09 Mar 2020 11:46:18 +0800   Mon, 09 Mar 2020 11:51:50 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Mon, 09 Mar 2020 11:46:18 +0800   Mon, 09 Mar 2020 11:51:50 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:   10.0.153.230
  Hostname:     ip-10-0-153-230.us-east-2.compute.internal
  InternalDNS:  ip-10-0-153-230.us-east-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         2
  ephemeral-storage:           125277164Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      8161848Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         1500m
  ephemeral-storage:           114381692328
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7010872Ki
  pods:                        250
System Info:
  Machine ID:                             58a6afd07a3348f9ba5c50892a718dd4
  System UUID:                            ec2ac93f-0a21-6ff0-d744-acdff0bfdc64
  Boot ID:                                58f62364-7da5-4834-a02d-c2eda73c6c39
  Kernel Version:                         4.18.0-147.5.1.el8_1.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 44.81.202003081930-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8
  Kubelet Version:                        v1.17.1
  Kube-Proxy Version:                     v1.17.1
ProviderID:                               aws:///us-east-2b/i-0a675ea932ed2145b
Non-terminated Pods:                      (8 in total)
  Namespace                               Name                           CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                           ------------  ----------  ---------------  -------------  ---
  openshift-cluster-node-tuning-operator  tuned-lrs2f                    10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         4h23m
  openshift-dns                           dns-default-c8jnb              110m (7%)     0 (0%)      70Mi (1%)        512Mi (7%)     4h22m
  openshift-image-registry                node-ca-pp7zd                  10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         4h23m
  openshift-machine-config-operator       machine-config-daemon-ms5tb    40m (2%)      0 (0%)      100Mi (1%)       0 (0%)         4h22m
  openshift-monitoring                    node-exporter-kmgjp            112m (7%)     0 (0%)      200Mi (2%)       0 (0%)         4h22m
  openshift-multus                        multus-69ctl                   10m (0%)      0 (0%)      150Mi (2%)       0 (0%)         4h23m
  openshift-sdn                           ovs-b5x9c                      200m (13%)    0 (0%)      400Mi (5%)       0 (0%)         4h23m
  openshift-sdn                           sdn-4w5vv                      100m (6%)     0 (0%)      200Mi (2%)       0 (0%)         4h23m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         592m (39%)    0 (0%)
  memory                      1180Mi (17%)  512Mi (7%)
  ephemeral-storage           0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:
  Type    Reason              Age                   From                                                 Message
  ----    ------              ----                  ----                                                 -------
  Normal  NodeNotSchedulable  175m (x2 over 3h44m)  kubelet, ip-10-0-153-230.us-east-2.compute.internal  Node ip-10-0-153-230.us-east-2.compute.internal status is now: NodeNotSchedulable


mac:~ jianzhang$ oc get machineconfigpool worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2020-03-09T02:18:17Z"
  generation: 3
  labels:
    custom-kubelet: small-pods
    machineconfiguration.openshift.io/mco-built-in: ""
  name: worker
  resourceVersion: "49930"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  uid: e548efb3-8f21-4385-b4a3-113d8b055ed5
spec:
  configuration:
    name: rendered-worker-8b585c5657a1115cd7e3c19c03e2070d
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-e548efb3-8f21-4385-b4a3-113d8b055ed5-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-e548efb3-8f21-4385-b4a3-113d8b055ed5-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2020-03-09T02:18:39Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2020-03-09T02:18:44Z"
    message: ""
    reason: ""
    status: "False"
    type: NodeDegraded
  - lastTransitionTime: "2020-03-09T02:18:44Z"
    message: ""
    reason: ""
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-03-09T03:51:01Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2020-03-09T03:51:01Z"
    message: All nodes are updating to rendered-worker-8b585c5657a1115cd7e3c19c03e2070d
    reason: ""
    status: "True"
    type: Updating
  configuration:
    name: rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-e548efb3-8f21-4385-b4a3-113d8b055ed5-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  degradedMachineCount: 0
  machineCount: 3
  observedGeneration: 3
  readyMachineCount: 0
  unavailableMachineCount: 1
  updatedMachineCount: 0

Comment 10 Jian Zhang 2020-03-10 01:14:21 UTC

Hi, Ryan

For this "The KubeletConfig "worker-kubeconfig" is invalid: " issue, I reported bug 1811493, I think bug 1811493 is a duplicate of bug 1811212, not this bug.
Reopen it since this issue still exist.

Comment 12 Harshal Patil 2020-05-13 11:30:23 UTC

Hi Jian,


When you specify 'evictionSoft' it's necessary to specify 'evictionSoftGracePeriod' too. I modified your KubeletConfig to add 'evictionSoftGracePeriod'

I followed following steps and everything worked just fine,

1. Install OCP 4.4.
2. Install an operator on the Web console, for example, CNV.
3. oc label machineconfigpool worker custom-kubelet=small-pods
4. $ cat kubelet-fix.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: worker-kubeconfig-fix
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: small-pods 
  kubeletConfig:
    evictionSoft:
      memory.available: "90%"
      nodefs.available: "90%"
      nodefs.inodesFree: "90%"
    evictionSoftGracePeriod: 
      memory.available : "1h"
      nodefs.available: "1h"
      nodefs.inodesFree: "1h"
    evictionPressureTransitionPeriod: 0s 


Please let me know if that works for you.

Comment 13 Jian Zhang 2020-05-22 08:43:14 UTC

Hi Harshal, 

Thanks for your information! It works after setting the `evictionSoftGracePeriod` field.
If the `evictionSoftGracePeriod` is a must when using the soft eviction. 
I think the kubeletconfig object should be forbidden to create if lack of the `evictionSoftGracePeriod`. 
Modify this bug title and reopen it.

Comment 14 Harshal Patil 2020-06-17 10:38:00 UTC

I will look into it in coming sprint.

Comment 15 Harshal Patil 2020-06-30 12:11:50 UTC

I have created https://github.com/openshift/machine-config-operator/pull/1880 to address this issue.

Comment 18 MinLi 2020-07-06 04:09:47 UTC

verified with version : 4.6.0-0.nightly-2020-07-05-192128

$ oc get kubeletconfig worker-kubeconfig -o yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  creationTimestamp: "2020-07-06T04:04:58Z"
  generation: 3
  managedFields:
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:conditions: {}
    manager: machine-config-controller
    operation: Update
    time: "2020-07-06T04:08:40Z"
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        .: {}
        f:kubeletConfig:
          .: {}
          f:evictionPressureTransitionPeriod: {}
          f:evictionSoft: {}
          f:evictionSoftGracePeriod: {}
        f:machineConfigPoolSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:custom-kubelet: {}
    manager: oc
    operation: Update
    time: "2020-07-06T04:08:40Z"
  name: worker-kubeconfig
  resourceVersion: "64208"
  selfLink: /apis/machineconfiguration.openshift.io/v1/kubeletconfigs/worker-kubeconfig
  uid: 4e8a2b9c-443c-4c57-9531-e1bd3f8d3ffd
spec:
  kubeletConfig:
    evictionPressureTransitionPeriod: 0s
    evictionSoft:
      memory.available: 90%
      nodefs.available: 90%
      nodefs.inodesFree: 90%
    evictionSoftGracePeriod:
      memory.available: 1h
      nodefs.available: 1h
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: small-pods
status:
  conditions:
  - lastTransitionTime: "2020-07-06T04:04:58Z"
    message: 'Error: KubeletConfiguration: EvictionSoftGracePeriod must be set when
      evictionSoft is defined, evictionSoft: map[memory.available:90% nodefs.available:90%
      nodefs.inodesFree:90%]'
    status: "False"
    type: Failure
  - lastTransitionTime: "2020-07-06T04:07:37Z"
    message: 'Error: KubeletConfiguration: evictionSoft[nodefs.available] is defined
      but EvictionSoftGracePeriod[nodefs.available] is not set'
    status: "False"
    type: Failure
  - lastTransitionTime: "2020-07-06T04:08:40Z"
    message: 'Error: KubeletConfiguration: evictionSoft[nodefs.inodesFree] is defined
      but EvictionSoftGracePeriod[nodefs.inodesFree] is not set'
    status: "False"
    type: Failure

Comment 20 errata-xmlrpc 2020-10-27 15:55:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.