Description of problem: After setting the pod eviction policy, the machineconfigpool worker always in `Updating` status. mac:~ jianzhang$ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-28f3c042508f1b5b5769bfc9d83c8243 True False False 3 3 3 0 23h worker rendered-worker-1176db00e4bede3824da402987ea9141 False True False 3 0 0 0 23h Version-Release number of selected component (if applicable): Cluster version is 4.4.0-0.nightly-2020-02-18-093529 How reproducible: always Steps to Reproduce: 1. Install OCP 4.4. 2. Install an operator on the Web console, for example, CNV. 3. Set the pod eviction policy. $ oc label machineconfigpool worker custom-kubelet=small-pods mac:~ jianzhang$ cat pod_evication.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: worker-kubeconfig spec: machineConfigPoolSelector: matchLabels: custom-kubelet: small-pods kubeletConfig: evictionSoft: memory.available: "90%" nodefs.available: "90%" nodefs.inodesFree: "90%" evictionPressureTransitionPeriod: 0s mac:~ jianzhang$ oc get kubeletconfig NAME AGE worker-kubeconfig 97m Actual results: The machineconfigpool worker always in `Updating` status. And, one worker stay in `NotReady` for a long time. mac:~ jianzhang$ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-28f3c042508f1b5b5769bfc9d83c8243 True False False 3 3 3 0 23h worker rendered-worker-1176db00e4bede3824da402987ea9141 False True False 3 0 0 0 23h mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-135-29.us-east-2.compute.internal Ready master 23h v1.17.1 ip-10-0-136-239.us-east-2.compute.internal NotReady,SchedulingDisabled worker 23h v1.17.1 ip-10-0-153-231.us-east-2.compute.internal Ready master 23h v1.17.1 ip-10-0-158-243.us-east-2.compute.internal Ready worker 126m v1.17.1 ip-10-0-161-79.us-east-2.compute.internal Ready worker 126m v1.17.1 ip-10-0-171-176.us-east-2.compute.internal Ready master 23h v1.17.1 Expected results: The machineconfigpool worker should be updated successfully. Additional info: 1) I find 2 clusteroperators are in False status due to the worker not ready, and Insufficient cpu. I'm not sure if this the root cause of the machineconfigpool worker updating failure. But, anyway, it shouldn't be. mac:~ jianzhang$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2020-02-18-093529 True False False 23h cloud-credential 4.4.0-0.nightly-2020-02-18-093529 True False False 23h cluster-autoscaler 4.4.0-0.nightly-2020-02-18-093529 True False False 23h console 4.4.0-0.nightly-2020-02-18-093529 True False False 23h csi-snapshot-controller 4.4.0-0.nightly-2020-02-18-093529 True False False 23h dns 4.4.0-0.nightly-2020-02-18-093529 True False False 23h etcd 4.4.0-0.nightly-2020-02-18-093529 True False False 23h image-registry 4.4.0-0.nightly-2020-02-18-093529 True False False 81m ingress 4.4.0-0.nightly-2020-02-18-093529 True False False 80m insights 4.4.0-0.nightly-2020-02-18-093529 True False False 23h kube-apiserver 4.4.0-0.nightly-2020-02-18-093529 True False False 23h kube-controller-manager 4.4.0-0.nightly-2020-02-18-093529 True False False 23h kube-scheduler 4.4.0-0.nightly-2020-02-18-093529 True False False 23h kube-storage-version-migrator 4.4.0-0.nightly-2020-02-18-093529 False False False 81m machine-api 4.4.0-0.nightly-2020-02-18-093529 True False False 23h machine-config 4.4.0-0.nightly-2020-02-18-093529 True False False 23h marketplace 4.4.0-0.nightly-2020-02-18-093529 True False False 23h monitoring 4.4.0-0.nightly-2020-02-18-093529 False True True 75m network 4.4.0-0.nightly-2020-02-18-093529 True True True 23h node-tuning 4.4.0-0.nightly-2020-02-18-093529 True False False 23h openshift-apiserver 4.4.0-0.nightly-2020-02-18-093529 True False False 5h22m openshift-controller-manager 4.4.0-0.nightly-2020-02-18-093529 True False False 23h openshift-samples 4.4.0-0.nightly-2020-02-18-093529 True False False 23h operator-lifecycle-manager 4.4.0-0.nightly-2020-02-18-093529 True False False 23h operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-02-18-093529 True False False 23h operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-02-18-093529 True False False 5h22m service-ca 4.4.0-0.nightly-2020-02-18-093529 True False False 23h service-catalog-apiserver 4.4.0-0.nightly-2020-02-18-093529 True False False 23h service-catalog-controller-manager 4.4.0-0.nightly-2020-02-18-093529 True False False 23h storage 4.4.0-0.nightly-2020-02-18-093529 True False False 23h mac:~ jianzhang$ oc get pods -n openshift-kube-storage-version-migrator NAME READY STATUS RESTARTS AGE migrator-54b9f4568d-n67p9 0/1 Pending 0 83m mac:~ jianzhang$ oc describe pods -n openshift-kube-storage-version-migrator Name: migrator-54b9f4568d-n67p9 Namespace: openshift-kube-storage-version-migrator Priority: 0 Node: <none> Labels: app=migrator ... Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate. mac:~ jianzhang$ oc describe pods alertmanager-main-0 -n openshift-monitoring Name: alertmanager-main-0 Namespace: openshift-monitoring Priority: 2000000000 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate. Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate. 2) Describe the failure node: mac:~ jianzhang$ oc describe nodes ip-10-0-158-243.us-east-2.compute.internal Name: ip-10-0-158-243.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m4.large beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2b kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-158-243 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m4.large node.openshift.io/os_id=rhcos topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2b Annotations: machine.openshift.io/machine: openshift-machine-api/qe-jiazha5-5kmbv-worker-us-east-2b-bmtj2 machineconfiguration.openshift.io/currentConfig: rendered-worker-1176db00e4bede3824da402987ea9141 machineconfiguration.openshift.io/desiredConfig: rendered-worker-1176db00e4bede3824da402987ea9141 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 20 Feb 2020 11:46:56 +0800 Taints: <none> Unschedulable: false Lease: HolderIdentity: ip-10-0-158-243.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Thu, 20 Feb 2020 13:40:51 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Thu, 20 Feb 2020 13:36:12 +0800 Thu, 20 Feb 2020 11:46:56 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Thu, 20 Feb 2020 13:36:12 +0800 Thu, 20 Feb 2020 11:46:56 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Thu, 20 Feb 2020 13:36:12 +0800 Thu, 20 Feb 2020 11:46:56 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Thu, 20 Feb 2020 13:36:12 +0800 Thu, 20 Feb 2020 11:47:57 +0800 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.158.243 Hostname: ip-10-0-158-243.us-east-2.compute.internal InternalDNS: ip-10-0-158-243.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 39 cpu: 2 ephemeral-storage: 125277164Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 8161840Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 1500m ephemeral-storage: 114381692328 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7010864Ki pods: 250 System Info: Machine ID: 62f290223949478488f198cbb669ddff System UUID: ec29c5ea-eacc-4812-62a2-66f37028a88f Boot ID: 3a8d0e50-6948-4754-b2b4-ef28ae8c06af Kernel Version: 4.18.0-147.5.1.el8_1.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 44.81.202002180730-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.17.0-4.dev.rhaos4.4.gitc3436cc.el8 Kubelet Version: v1.17.1 Kube-Proxy Version: v1.17.1 ProviderID: aws:///us-east-2b/i-0ba66391e2c029246 Non-terminated Pods: (20 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-node-tuning-operator tuned-cvvjz 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 114m openshift-csi-snapshot-controller csi-snapshot-controller-669fcbbb8f-8mqcp 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 84m openshift-dns dns-default-698pg 110m (7%) 0 (0%) 70Mi (1%) 512Mi (7%) 113m openshift-image-registry node-ca-j52ph 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 114m openshift-ingress router-default-5cc767646d-dswmf 100m (6%) 0 (0%) 256Mi (3%) 0 (0%) 23h openshift-machine-config-operator machine-config-daemon-p6bv5 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 113m openshift-marketplace redhat-marketplace-5b9dfc7d66-qllqt 10m (0%) 0 (0%) 100Mi (1%) 0 (0%) 86m openshift-marketplace redhat-operators-55f686f4d9-r7f9n 10m (0%) 0 (0%) 100Mi (1%) 0 (0%) 86m openshift-monitoring alertmanager-main-2 110m (7%) 100m (6%) 245Mi (3%) 25Mi (0%) 86m openshift-monitoring grafana-755b7df4f9-rv7cn 110m (7%) 0 (0%) 120Mi (1%) 0 (0%) 23h openshift-monitoring node-exporter-4wtgw 112m (7%) 0 (0%) 200Mi (2%) 0 (0%) 114m openshift-monitoring prometheus-adapter-5cd6485bbb-8gf6b 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 86m openshift-monitoring prometheus-k8s-1 480m (32%) 200m (13%) 1234Mi (18%) 50Mi (0%) 23h openshift-monitoring telemeter-client-7c6587467-5vrff 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 86m openshift-multus multus-c5vjz 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 114m openshift-operators cdi-operator-67887974b-lwtl5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 86m openshift-operators hco-operator-54cd7db78c-6m5hb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 86m openshift-operators virt-operator-546775946c-lz4lt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 86m openshift-sdn ovs-4wxw7 200m (13%) 0 (0%) 400Mi (5%) 0 (0%) 114m openshift-sdn sdn-84ffv 100m (6%) 0 (0%) 200Mi (2%) 0 (0%) 114m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1442m (96%) 300m (20%) memory 3325Mi (48%) 587Mi (8%) ephemeral-storage 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 114m kubelet, ip-10-0-158-243.us-east-2.compute.internal Starting kubelet. Normal NodeHasSufficientMemory 114m (x2 over 114m) kubelet, ip-10-0-158-243.us-east-2.compute.internal Node ip-10-0-158-243.us-east-2.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 114m (x2 over 114m) kubelet, ip-10-0-158-243.us-east-2.compute.internal Node ip-10-0-158-243.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 114m (x2 over 114m) kubelet, ip-10-0-158-243.us-east-2.compute.internal Node ip-10-0-158-243.us-east-2.compute.internal status is now: NodeHasSufficientPID Normal NodeAllocatableEnforced 114m kubelet, ip-10-0-158-243.us-east-2.compute.internal Updated Node Allocatable limit across pods Normal NodeReady 113m kubelet, ip-10-0-158-243.us-east-2.compute.internal Node ip-10-0-158-243.us-east-2.compute.internal status is now: NodeReady 3) Describe the machineconfigpool worker mac:~ jianzhang$ oc get machineconfigpool worker -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2020-02-19T06:07:52Z" generation: 3 labels: custom-kubelet: small-pods machineconfiguration.openshift.io/mco-built-in: "" name: worker resourceVersion: "650473" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker uid: 30eb6291-9bea-4ef4-95c9-40ec289a2779 spec: configuration: name: rendered-worker-c9d267810c63b6a99557acc27bdfc847 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-30eb6291-9bea-4ef4-95c9-40ec289a2779-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-30eb6291-9bea-4ef4-95c9-40ec289a2779-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" paused: false status: conditions: - lastTransitionTime: "2020-02-19T06:08:24Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2020-02-19T06:08:29Z" message: "" reason: "" status: "False" type: NodeDegraded - lastTransitionTime: "2020-02-19T06:08:29Z" message: "" reason: "" status: "False" type: Degraded - lastTransitionTime: "2020-02-20T04:15:01Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2020-02-20T04:15:01Z" message: All nodes are updating to rendered-worker-c9d267810c63b6a99557acc27bdfc847 reason: "" status: "True" type: Updating configuration: name: rendered-worker-1176db00e4bede3824da402987ea9141 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-30eb6291-9bea-4ef4-95c9-40ec289a2779-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 0 machineCount: 3 observedGeneration: 3 readyMachineCount: 0 unavailableMachineCount: 1 updatedMachineCount: 0
Some cluster operators (network, monitor) depends on each node healthy. Due to there is a NotReady node, they are in unsuccessful status. # oc get co/monitoring -oyaml message: 'Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 8, updated: 8, ready: 7, unavailable: 1)' reason: UpdatingnodeExporterFailed status: "True" type: Degraded mac:~ jianzhang$ oc get co network -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: annotations: network.operator.openshift.io/last-seen-state: '{"DaemonsetStates":[{"Namespace":"openshift-sdn","Name":"sdn","LastSeenStatus":{"currentNumberScheduled":8,"numberMisscheduled":0,"desiredNumberScheduled":8,"numberReady":7,"observedGeneration":1,"updatedNumberScheduled":8,"numberAvailable":7,"numberUnavailable":1},"LastChangeTime":"2020-02-20T06:25:32.736963814Z"},{"Namespace":"openshift-multus","Name":"multus","LastSeenStatus":{"currentNumberScheduled":8,"numberMisscheduled":0,"desiredNumberScheduled":8,"numberReady":7,"observedGeneration":1,"updatedNumberScheduled":8,"numberAvailable":7,"numberUnavailable":1},"LastChangeTime":"2020-02-20T06:25:32.319305138Z"},{"Namespace":"openshift-sdn","Name":"ovs","LastSeenStatus":{"currentNumberScheduled":8,"numberMisscheduled":0,"desiredNumberScheduled":8,"numberReady":7,"observedGeneration":1,"updatedNumberScheduled":8,"numberAvailable":7,"numberUnavailable":1},"LastChangeTime":"2020-02-20T06:25:42.025978461Z"}],"DeploymentStates":[]}' creationTimestamp: "2020-02-19T06:03:12Z" generation: 1 name: network ... - lastTransitionTime: "2020-02-20T04:15:56Z" message: |- DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes) mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-135-29.us-east-2.compute.internal Ready master 25h v1.17.1 ip-10-0-136-239.us-east-2.compute.internal NotReady,SchedulingDisabled worker 25h v1.17.1 ip-10-0-148-217.us-east-2.compute.internal Ready worker 55m v1.17.1 ip-10-0-153-231.us-east-2.compute.internal Ready master 25h v1.17.1 ip-10-0-158-243.us-east-2.compute.internal Ready worker 3h32m v1.17.1 ip-10-0-158-34.us-east-2.compute.internal Ready worker 55m v1.17.1 ip-10-0-161-79.us-east-2.compute.internal Ready worker 3h32m v1.17.1 ip-10-0-171-176.us-east-2.compute.internal Ready master 25h v1.17.1 mac:~ jianzhang$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.4.0-0.nightly-2020-02-18-093529 True False False 24h cloud-credential 4.4.0-0.nightly-2020-02-18-093529 True False False 25h cluster-autoscaler 4.4.0-0.nightly-2020-02-18-093529 True False False 24h console 4.4.0-0.nightly-2020-02-18-093529 True False False 24h csi-snapshot-controller 4.4.0-0.nightly-2020-02-18-093529 True False False 24h dns 4.4.0-0.nightly-2020-02-18-093529 True False False 25h etcd 4.4.0-0.nightly-2020-02-18-093529 True False False 24h image-registry 4.4.0-0.nightly-2020-02-18-093529 True False False 176m ingress 4.4.0-0.nightly-2020-02-18-093529 True False False 175m insights 4.4.0-0.nightly-2020-02-18-093529 True False False 25h kube-apiserver 4.4.0-0.nightly-2020-02-18-093529 True False False 25h kube-controller-manager 4.4.0-0.nightly-2020-02-18-093529 True False False 25h kube-scheduler 4.4.0-0.nightly-2020-02-18-093529 True False False 25h kube-storage-version-migrator 4.4.0-0.nightly-2020-02-18-093529 True False False 45m machine-api 4.4.0-0.nightly-2020-02-18-093529 True False False 25h machine-config 4.4.0-0.nightly-2020-02-18-093529 True False False 25h marketplace 4.4.0-0.nightly-2020-02-18-093529 True False False 25h monitoring 4.4.0-0.nightly-2020-02-18-093529 False True True 170m network 4.4.0-0.nightly-2020-02-18-093529 True True True 25h node-tuning 4.4.0-0.nightly-2020-02-18-093529 True False False 25h openshift-apiserver 4.4.0-0.nightly-2020-02-18-093529 True False False 6h57m openshift-controller-manager 4.4.0-0.nightly-2020-02-18-093529 True False False 25h openshift-samples 4.4.0-0.nightly-2020-02-18-093529 True False False 24h operator-lifecycle-manager 4.4.0-0.nightly-2020-02-18-093529 True False False 25h operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-02-18-093529 True False False 25h operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-02-18-093529 True False False 6h57m service-ca 4.4.0-0.nightly-2020-02-18-093529 True False False 25h service-catalog-apiserver 4.4.0-0.nightly-2020-02-18-093529 True False False 25h service-catalog-controller-manager 4.4.0-0.nightly-2020-02-18-093529 True False False 25h storage 4.4.0-0.nightly-2020-02-18-093529 True False False 25h
Besides, when I create a new pod, it still is scheduled to this NotReady node: ip-10-0-136-239.us-east-2.compute.internal mac:~ jianzhang$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES certified-operators-8d7796766-f229f 1/1 Running 0 5h41m 10.129.2.6 ip-10-0-161-79.us-east-2.compute.internal <none> <none> community-operators-9d887b488-wvlc9 1/1 Running 0 5h41m 10.129.2.11 ip-10-0-161-79.us-east-2.compute.internal <none> <none> marketplace-operator-6d9f75cc47-7lgxq 1/1 Running 0 27h 10.129.0.20 ip-10-0-153-231.us-east-2.compute.internal <none> <none> poll-test-ksbx7 0/1 Pending 0 10m <none> ip-10-0-136-239.us-east-2.compute.internal <none> <none> redhat-marketplace-5b9dfc7d66-qllqt 1/1 Running 0 5h41m 10.128.2.11 ip-10-0-158-243.us-east-2.compute.internal <none> <none> redhat-operators-b747df6b6-9pzgf 1/1 Running 0 42m 10.130.2.5 ip-10-0-158-34.us-east-2.compute.internal <none> <none> mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-135-29.us-east-2.compute.internal Ready master 27h v1.17.1 ip-10-0-136-239.us-east-2.compute.internal NotReady,SchedulingDisabled worker 27h v1.17.1 ip-10-0-148-217.us-east-2.compute.internal Ready worker 3h31m v1.17.1 ip-10-0-153-231.us-east-2.compute.internal Ready master 27h v1.17.1 ip-10-0-158-243.us-east-2.compute.internal Ready worker 6h9m v1.17.1 ip-10-0-158-34.us-east-2.compute.internal Ready worker 3h31m v1.17.1 ip-10-0-161-79.us-east-2.compute.internal Ready worker 6h9m v1.17.1 ip-10-0-171-176.us-east-2.compute.internal Ready master 27h v1.17.1 mac:~ jianzhang$ oc describe pods poll-test-ksbx7 Name: poll-test-ksbx7 Namespace: openshift-marketplace Priority: 0 Node: ip-10-0-136-239.us-east-2.compute.internal/ Labels: olm.catalogSource=poll-test Annotations: openshift.io/scc: anyuid Status: Pending IP: IPs: <none> Containers: registry-server: Image: quay.io/my-catalogs/my-catalog:master Port: 50051/TCP Host Port: 0/TCP Limits: cpu: 100m memory: 100Mi Requests: cpu: 10m memory: 50Mi Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=10s timeout=1s period=10s #success=1 #failure=3 Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=5s timeout=5s period=10s #success=1 #failure=3 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-z78cv (ro) Conditions: Type Status PodScheduled True Volumes: default-token-z78cv: Type: Secret (a volume populated by a Secret) SecretName: default-token-z78cv Optional: false QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux Tolerations: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-marketplace/poll-test-ksbx7 to ip-10-0-136-239.us-east-2.compute.internal cannot gather this node logs. mac:~ jianzhang$ oc adm node-logs ip-10-0-136-239.us-east-2.compute.internal error: the server is currently unable to handle the request Error trying to reach service: 'dial tcp 10.0.136.239:10250: connect: connection refused'
mac:~ jianzhang$ oc adm cordon ip-10-0-136-239.us-east-2.compute.internal node/ip-10-0-136-239.us-east-2.compute.internal already cordoned
Backported to 4.4 - the CPU reservations should be back to where they were https://github.com/openshift/machine-config-operator/pull/1476
Cluster version is 4.4.0-0.nightly-2020-03-08-235004 1, Install CNV operator and make sure they are running on the same worker. As follows mac:~ jianzhang$ oc adm cordon ip-10-0-153-230.us-east-2.compute.internal ip-10-0-168-220.us-east-2.compute.internal node/ip-10-0-153-230.us-east-2.compute.internal cordoned node/ip-10-0-168-220.us-east-2.compute.internal cordoned mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-131-77.us-east-2.compute.internal Ready worker 37m v1.17.1 ip-10-0-133-75.us-east-2.compute.internal Ready master 45m v1.17.1 ip-10-0-146-18.us-east-2.compute.internal Ready master 45m v1.17.1 ip-10-0-153-230.us-east-2.compute.internal Ready,SchedulingDisabled worker 39m v1.17.1 ip-10-0-168-220.us-east-2.compute.internal Ready,SchedulingDisabled worker 37m v1.17.1 ip-10-0-173-167.us-east-2.compute.internal Ready master 45m v1.17.1 mac:~ jianzhang$ oc get csv -n openshift-operators NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.2.0 Container-native virtualization 2.2.0 Succeeded mac:~ jianzhang$ oc get pods -n openshift-operators -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cdi-operator-67887974b-vm85q 1/1 Running 0 4m34s 10.128.2.13 ip-10-0-131-77.us-east-2.compute.internal <none> <none> cluster-network-addons-operator-7c95c8659b-wkqwm 1/1 Running 0 4m34s 10.128.2.10 ip-10-0-131-77.us-east-2.compute.internal <none> <none> hco-operator-54cd7db78c-qm76t 1/1 Running 0 4m34s 10.128.2.11 ip-10-0-131-77.us-east-2.compute.internal <none> <none> hostpath-provisioner-operator-fb9cbc8b7-bsssd 1/1 Running 0 4m34s 10.128.2.16 ip-10-0-131-77.us-east-2.compute.internal <none> <none> kubevirt-ssp-operator-7885c98fb9-b6ztk 1/1 Running 0 4m34s 10.128.2.14 ip-10-0-131-77.us-east-2.compute.internal <none> <none> node-maintenance-operator-99556c65f-d6vzn 1/1 Running 0 4m34s 10.130.0.41 ip-10-0-146-18.us-east-2.compute.internal <none> <none> virt-operator-546775946c-5jhlj 1/1 Running 0 4m34s 10.128.2.15 ip-10-0-131-77.us-east-2.compute.internal <none> <none> virt-operator-546775946c-689dd 1/1 Running 0 4m34s 10.128.2.12 ip-10-0-131-77.us-east-2.compute.internal <none> <none> mac:~ jianzhang$ oc describe nodes ip-10-0-131-77.us-east-2.compute.internal Name: ip-10-0-131-77.us-east-2.compute.internal ... Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1392m (92%) 200m (13%) memory 3010Mi (43%) 562Mi (8%) ephemeral-storage 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 2, Change back to uncordon. mac:~ jianzhang$ oc adm uncordon ip-10-0-153-230.us-east-2.compute.internal node/ip-10-0-153-230.us-east-2.compute.internal uncordoned mac:~ jianzhang$ oc adm uncordon ip-10-0-168-220.us-east-2.compute.internal node/ip-10-0-168-220.us-east-2.compute.internal uncordoned mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-131-77.us-east-2.compute.internal Ready worker 49m v1.17.1 ip-10-0-133-75.us-east-2.compute.internal Ready master 56m v1.17.1 ip-10-0-146-18.us-east-2.compute.internal Ready master 57m v1.17.1 ip-10-0-153-230.us-east-2.compute.internal Ready worker 51m v1.17.1 ip-10-0-168-220.us-east-2.compute.internal Ready worker 49m v1.17.1 ip-10-0-173-167.us-east-2.compute.internal Ready master 57m v1.17.1 3, Create the KubeletConfig object. mac:~ jianzhang$ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-e3490b51f062c3a94b508209b44089f6 True False False 3 3 3 0 56m worker rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5 True False False 3 3 3 0 56m mac:~ jianzhang$ oc label machineconfigpool worker custom-kubelet=small-pods machineconfigpool.machineconfiguration.openshift.io/worker labeled mac:~ jianzhang$ cat pod_evication.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: worker-kubeconfig spec: machineConfigPoolSelector: matchLabels: custom-kubelet: small-pods kubeletConfig: evictionSoft: memory.available: "90%" nodefs.available: "90%" nodefs.inodesFree: "90%" evictionPressureTransitionPeriod: 0s mac:~ jianzhang$ oc create -f pod_evication.yaml The KubeletConfig "worker-kubeconfig" is invalid: * spec.kubeletConfig.apiVersion: Required value: must not be empty * spec.kubeletConfig.kind: Required value: must not be empty Report a bug for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1811493 4, I work around the above issue, but, the node ip-10-0-153-230.us-east-2.compute.internal is in NotReady status for hours. machineconfigpool work still in 'Updating' status. Change the bug status to ASSIGNED. mac:~ jianzhang$ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-e3490b51f062c3a94b508209b44089f6 True False False 3 3 3 0 94m worker rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5 False True False 3 0 0 0 94m mac:~ jianzhang$ date Mon Mar 9 11:58:34 CST 2020 mac:~ jianzhang$ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-e3490b51f062c3a94b508209b44089f6 True False False 3 3 3 0 4h22m worker rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5 False True False 3 0 0 0 4h22m mac:~ jianzhang$ date Mon Mar 9 14:41:22 CST 2020 mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-131-77.us-east-2.compute.internal Ready worker 4h19m v1.17.1 ip-10-0-133-75.us-east-2.compute.internal Ready master 4h26m v1.17.1 ip-10-0-146-18.us-east-2.compute.internal Ready master 4h27m v1.17.1 ip-10-0-153-230.us-east-2.compute.internal NotReady,SchedulingDisabled worker 4h21m v1.17.1 ip-10-0-168-220.us-east-2.compute.internal Ready worker 4h19m v1.17.1 ip-10-0-173-167.us-east-2.compute.internal Ready master 4h27m v1.17.1 mac:~ jianzhang$ oc adm node-logs ip-10-0-153-230.us-east-2.compute.internal error: the server is currently unable to handle the request Error trying to reach service: 'dial tcp 10.0.153.230:10250: connect: connection refused' mac:~ jianzhang$ oc describe nodes ip-10-0-153-230.us-east-2.compute.internal Name: ip-10-0-153-230.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m4.large beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2b kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-153-230 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m4.large node.openshift.io/os_id=rhcos topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2b Annotations: machine.openshift.io/machine: openshift-machine-api/qe-jiazha39-m47lh-worker-us-east-2b-wkft5 machineconfiguration.openshift.io/currentConfig: rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5 machineconfiguration.openshift.io/desiredConfig: rendered-worker-8b585c5657a1115cd7e3c19c03e2070d machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Working volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 09 Mar 2020 10:22:44 +0800 Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoSchedule node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: ip-10-0-153-230.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Mon, 09 Mar 2020 11:51:09 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Mon, 09 Mar 2020 11:46:18 +0800 Mon, 09 Mar 2020 11:51:50 +0800 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Mon, 09 Mar 2020 11:46:18 +0800 Mon, 09 Mar 2020 11:51:50 +0800 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Mon, 09 Mar 2020 11:46:18 +0800 Mon, 09 Mar 2020 11:51:50 +0800 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Mon, 09 Mar 2020 11:46:18 +0800 Mon, 09 Mar 2020 11:51:50 +0800 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 10.0.153.230 Hostname: ip-10-0-153-230.us-east-2.compute.internal InternalDNS: ip-10-0-153-230.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 39 cpu: 2 ephemeral-storage: 125277164Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 8161848Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 1500m ephemeral-storage: 114381692328 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7010872Ki pods: 250 System Info: Machine ID: 58a6afd07a3348f9ba5c50892a718dd4 System UUID: ec2ac93f-0a21-6ff0-d744-acdff0bfdc64 Boot ID: 58f62364-7da5-4834-a02d-c2eda73c6c39 Kernel Version: 4.18.0-147.5.1.el8_1.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 44.81.202003081930-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 Kubelet Version: v1.17.1 Kube-Proxy Version: v1.17.1 ProviderID: aws:///us-east-2b/i-0a675ea932ed2145b Non-terminated Pods: (8 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-node-tuning-operator tuned-lrs2f 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h23m openshift-dns dns-default-c8jnb 110m (7%) 0 (0%) 70Mi (1%) 512Mi (7%) 4h22m openshift-image-registry node-ca-pp7zd 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 4h23m openshift-machine-config-operator machine-config-daemon-ms5tb 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 4h22m openshift-monitoring node-exporter-kmgjp 112m (7%) 0 (0%) 200Mi (2%) 0 (0%) 4h22m openshift-multus multus-69ctl 10m (0%) 0 (0%) 150Mi (2%) 0 (0%) 4h23m openshift-sdn ovs-b5x9c 200m (13%) 0 (0%) 400Mi (5%) 0 (0%) 4h23m openshift-sdn sdn-4w5vv 100m (6%) 0 (0%) 200Mi (2%) 0 (0%) 4h23m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 592m (39%) 0 (0%) memory 1180Mi (17%) 512Mi (7%) ephemeral-storage 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeNotSchedulable 175m (x2 over 3h44m) kubelet, ip-10-0-153-230.us-east-2.compute.internal Node ip-10-0-153-230.us-east-2.compute.internal status is now: NodeNotSchedulable mac:~ jianzhang$ oc get machineconfigpool worker -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2020-03-09T02:18:17Z" generation: 3 labels: custom-kubelet: small-pods machineconfiguration.openshift.io/mco-built-in: "" name: worker resourceVersion: "49930" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker uid: e548efb3-8f21-4385-b4a3-113d8b055ed5 spec: configuration: name: rendered-worker-8b585c5657a1115cd7e3c19c03e2070d source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-e548efb3-8f21-4385-b4a3-113d8b055ed5-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-e548efb3-8f21-4385-b4a3-113d8b055ed5-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" paused: false status: conditions: - lastTransitionTime: "2020-03-09T02:18:39Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2020-03-09T02:18:44Z" message: "" reason: "" status: "False" type: NodeDegraded - lastTransitionTime: "2020-03-09T02:18:44Z" message: "" reason: "" status: "False" type: Degraded - lastTransitionTime: "2020-03-09T03:51:01Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2020-03-09T03:51:01Z" message: All nodes are updating to rendered-worker-8b585c5657a1115cd7e3c19c03e2070d reason: "" status: "True" type: Updating configuration: name: rendered-worker-9bf0b68b8e317cb5aaccf7b61b2d13f5 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-e548efb3-8f21-4385-b4a3-113d8b055ed5-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 0 machineCount: 3 observedGeneration: 3 readyMachineCount: 0 unavailableMachineCount: 1 updatedMachineCount: 0
Hi, Ryan For this "The KubeletConfig "worker-kubeconfig" is invalid: " issue, I reported bug 1811493, I think bug 1811493 is a duplicate of bug 1811212, not this bug. Reopen it since this issue still exist.
Hi Jian, When you specify 'evictionSoft' it's necessary to specify 'evictionSoftGracePeriod' too. I modified your KubeletConfig to add 'evictionSoftGracePeriod' I followed following steps and everything worked just fine, 1. Install OCP 4.4. 2. Install an operator on the Web console, for example, CNV. 3. oc label machineconfigpool worker custom-kubelet=small-pods 4. $ cat kubelet-fix.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: worker-kubeconfig-fix spec: machineConfigPoolSelector: matchLabels: custom-kubelet: small-pods kubeletConfig: evictionSoft: memory.available: "90%" nodefs.available: "90%" nodefs.inodesFree: "90%" evictionSoftGracePeriod: memory.available : "1h" nodefs.available: "1h" nodefs.inodesFree: "1h" evictionPressureTransitionPeriod: 0s Please let me know if that works for you.
Hi Harshal, Thanks for your information! It works after setting the `evictionSoftGracePeriod` field. If the `evictionSoftGracePeriod` is a must when using the soft eviction. I think the kubeletconfig object should be forbidden to create if lack of the `evictionSoftGracePeriod`. Modify this bug title and reopen it.
I will look into it in coming sprint.
I have created https://github.com/openshift/machine-config-operator/pull/1880 to address this issue.
verified with version : 4.6.0-0.nightly-2020-07-05-192128 $ oc get kubeletconfig worker-kubeconfig -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: creationTimestamp: "2020-07-06T04:04:58Z" generation: 3 managedFields: - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:status: .: {} f:conditions: {} manager: machine-config-controller operation: Update time: "2020-07-06T04:08:40Z" - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: .: {} f:kubeletConfig: .: {} f:evictionPressureTransitionPeriod: {} f:evictionSoft: {} f:evictionSoftGracePeriod: {} f:machineConfigPoolSelector: .: {} f:matchLabels: .: {} f:custom-kubelet: {} manager: oc operation: Update time: "2020-07-06T04:08:40Z" name: worker-kubeconfig resourceVersion: "64208" selfLink: /apis/machineconfiguration.openshift.io/v1/kubeletconfigs/worker-kubeconfig uid: 4e8a2b9c-443c-4c57-9531-e1bd3f8d3ffd spec: kubeletConfig: evictionPressureTransitionPeriod: 0s evictionSoft: memory.available: 90% nodefs.available: 90% nodefs.inodesFree: 90% evictionSoftGracePeriod: memory.available: 1h nodefs.available: 1h machineConfigPoolSelector: matchLabels: custom-kubelet: small-pods status: conditions: - lastTransitionTime: "2020-07-06T04:04:58Z" message: 'Error: KubeletConfiguration: EvictionSoftGracePeriod must be set when evictionSoft is defined, evictionSoft: map[memory.available:90% nodefs.available:90% nodefs.inodesFree:90%]' status: "False" type: Failure - lastTransitionTime: "2020-07-06T04:07:37Z" message: 'Error: KubeletConfiguration: evictionSoft[nodefs.available] is defined but EvictionSoftGracePeriod[nodefs.available] is not set' status: "False" type: Failure - lastTransitionTime: "2020-07-06T04:08:40Z" message: 'Error: KubeletConfiguration: evictionSoft[nodefs.inodesFree] is defined but EvictionSoftGracePeriod[nodefs.inodesFree] is not set' status: "False" type: Failure
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196