Description of problem: After 'node crash' test on AWN machine-config operator is not available. Working node is in NotReady status. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-25-093627 How reproducible: So far one run - 100% Steps to Reproduce: 1. clone git repository: git clone https://github.com/openshift-scale/kraken.git $ cd kraken 2. edit configuration file: vim config/config.yml - update kubeconfig path - From `chaos_scenarios` leave only node scenarios. 3. Run kraken python3 run_kraken.py --config config/config.yaml This is automated crash test. In general it will run command on workers: $ oc debug node/$worker_node -- chroot /host -- dd if=/dev/urandom of=/proc/sysrq-trigger Actual results: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-139-15.us-east-2.compute.internal Ready master 6h3m v1.22.0-rc.0+5c2f7cd ip-10-0-149-240.us-east-2.compute.internal NotReady worker 5h52m v1.22.0-rc.0+5c2f7cd ip-10-0-177-74.us-east-2.compute.internal Ready worker 5h51m v1.22.0-rc.0+5c2f7cd ip-10-0-182-231.us-east-2.compute.internal Ready master 6h3m v1.22.0-rc.0+5c2f7cd ip-10-0-200-25.us-east-2.compute.internal Ready worker 5h51m v1.22.0-rc.0+5c2f7cd ip-10-0-201-102.us-east-2.compute.internal Ready master 6h3m v1.22.0-rc.0+5c2f7cd $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.9.0-0.nightly-2021-08-25-093627 False False True 36m Cluster not available for 4.9.0-0.nightly-2021-08-25-093627 $ oc describe node ip-10-0-149-240.us-east-2.compute.internal Name: ip-10-0-149-240.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m5.large beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2a kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-149-240.us-east-2.compute.internal kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m5.large node.openshift.io/os_id=rhcos topology.ebs.csi.aws.com/zone=us-east-2a topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2a Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-019458366e30d5e26"} k8s.ovn.org/host-addresses: ["10.0.149.240"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-149-240.us-east-2.compute.internal","mac-address":"02:81:9b:f4:23:fc","ip-addres... k8s.ovn.org/node-chassis-id: 10079734-5849-4928-8c5c-07a366df1c47 k8s.ovn.org/node-mgmt-port-mac-address: 86:3e:3a:b8:d2:22 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.149.240/19"} k8s.ovn.org/node-subnets: {"default":"10.131.0.0/23"} machine.openshift.io/machine: openshift-machine-api/skordas826a-nwzxp-worker-us-east-2a-cq88n machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-d8d7f496556446a0a6443be920044ad3 machineconfiguration.openshift.io/desiredConfig: rendered-worker-d8d7f496556446a0a6443be920044ad3 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 26 Aug 2021 07:33:46 -0400 Taints: node.kubernetes.io/unreachable:NoExecute k8s.ovn.org/network-unavailable:NoSchedule node.kubernetes.io/unreachable:NoSchedule Unschedulable: false Lease: HolderIdentity: ip-10-0-149-240.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Thu, 26 Aug 2021 12:41:25 -0400 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 10.0.149.240 Hostname: ip-10-0-149-240.us-east-2.compute.internal InternalDNS: ip-10-0-149-240.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 25 cpu: 2 ephemeral-storage: 125293548Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7935212Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 25 cpu: 1500m ephemeral-storage: 115470533646 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 6784236Ki pods: 250 System Info: Machine ID: ec291f631345c087942a9fea8fe2a126 System UUID: ec291f63-1345-c087-942a-9fea8fe2a126 Boot ID: c8dc3152-f60e-48b7-83ec-a2bd57720055 Kernel Version: 4.18.0-305.12.1.el8_4.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 49.84.202108221651-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.22.0-53.rhaos4.9.git2d289a2.el8 Kubelet Version: v1.22.0-rc.0+5c2f7cd Kube-Proxy Version: v1.22.0-rc.0+5c2f7cd ProviderID: aws:///us-east-2a/i-019458366e30d5e26 Non-terminated Pods: (25 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- default ip-10-0-149-240us-east-2computeinternal-debug 0 (0%) 0 (0%) 0 (0%) 0 (0%) 48m openshift-cluster-csi-drivers aws-ebs-csi-driver-node-wch9k 30m (2%) 0 (0%) 150Mi (2%) 0 (0%) 5h52m openshift-cluster-node-tuning-operator tuned-jjvtw 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 5h52m openshift-dns dns-default-dcxtb 60m (4%) 0 (0%) 110Mi (1%) 0 (0%) 5h51m openshift-dns node-resolver-pvbxx 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 5h52m openshift-image-registry image-registry-568957b9d6-rdkdx 100m (6%) 0 (0%) 256Mi (3%) 0 (0%) 5h52m openshift-image-registry node-ca-875ck 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 5h52m openshift-ingress-canary ingress-canary-8zj6c 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 5h51m openshift-ingress router-default-65bdc775fd-dwrbr 100m (6%) 0 (0%) 256Mi (3%) 0 (0%) 5h51m openshift-machine-config-operator machine-config-daemon-ngrhx 40m (2%) 0 (0%) 100Mi (1%) 0 (0%) 5h52m openshift-marketplace certified-operators-f6smg 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 68m openshift-marketplace community-operators-rrmq2 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 5h59m openshift-marketplace redhat-marketplace-vck2g 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 5h59m openshift-marketplace redhat-operators-n2kvh 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 92m openshift-monitoring alertmanager-main-2 8m (0%) 0 (0%) 105Mi (1%) 0 (0%) 5h49m openshift-monitoring kube-state-metrics-59b87859b8-kgw7r 4m (0%) 0 (0%) 110Mi (1%) 0 (0%) 6h openshift-monitoring node-exporter-qm46v 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 5h52m openshift-monitoring openshift-state-metrics-66585c8c7c-hdv4x 3m (0%) 0 (0%) 72Mi (1%) 0 (0%) 6h openshift-monitoring telemeter-client-6bd9bc5f84-557hq 3m (0%) 0 (0%) 70Mi (1%) 0 (0%) 6h openshift-multus multus-additional-cni-plugins-pg4qm 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 5h52m openshift-multus multus-bx467 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 5h52m openshift-multus network-metrics-daemon-c447r 20m (1%) 0 (0%) 120Mi (1%) 0 (0%) 5h52m openshift-network-diagnostics network-check-source-75749bc6b4-c8jrf 10m (0%) 0 (0%) 40Mi (0%) 0 (0%) 6h4m openshift-network-diagnostics network-check-target-tm2fs 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 5h52m openshift-ovn-kubernetes ovnkube-node-p2bmb 40m (2%) 0 (0%) 640Mi (9%) 0 (0%) 5h52m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 532m (35%) 0 (0%) memory 2467Mi (37%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 49m kubelet Starting kubelet. Normal NodeAllocatableEnforced 49m kubelet Updated Node Allocatable limit across pods Normal NodeHasSufficientMemory 49m (x2 over 49m) kubelet Node ip-10-0-149-240.us-east-2.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 49m (x2 over 49m) kubelet Node ip-10-0-149-240.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 49m (x2 over 49m) kubelet Node ip-10-0-149-240.us-east-2.compute.internal status is now: NodeHasSufficientPID Warning Rebooted 49m kubelet Node ip-10-0-149-240.us-east-2.compute.internal has been rebooted, boot id: 19ae10f2-e530-4b99-b4e5-0ab011386454 Normal NodeReady 49m kubelet Node ip-10-0-149-240.us-east-2.compute.internal status is now: NodeReady Normal Starting 45m kubelet Starting kubelet. Normal NodeAllocatableEnforced 45m kubelet Updated Node Allocatable limit across pods Normal NodeHasSufficientMemory 45m (x2 over 45m) kubelet Node ip-10-0-149-240.us-east-2.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 45m (x2 over 45m) kubelet Node ip-10-0-149-240.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 45m (x2 over 45m) kubelet Node ip-10-0-149-240.us-east-2.compute.internal status is now: NodeHasSufficientPID Warning Rebooted 45m kubelet Node ip-10-0-149-240.us-east-2.compute.internal has been rebooted, boot id: c8dc3152-f60e-48b7-83ec-a2bd57720055 Normal NodeReady 45m kubelet Node ip-10-0-149-240.us-east-2.compute.internal status is now: NodeReady $ oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-08-26T11:20:51Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:ownerReferences: .: k:{"uid":"4ae87e73-9fde-4bb0-990c-c6045f5299c6"}: f:spec: Manager: cluster-version-operator Operation: Update Time: 2021-08-26T11:20:51Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: Manager: cluster-version-operator Operation: Update Subresource: status Time: 2021-08-26T11:20:52Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: f:versions: Manager: machine-config-operator Operation: Update Subresource: status Time: 2021-08-26T11:28:22Z Owner References: API Version: config.openshift.io/v1 Kind: ClusterVersion Name: version UID: 4ae87e73-9fde-4bb0-990c-c6045f5299c6 Resource Version: 139116 UID: 512d0dab-67b1-452b-bce9-834c773d47fb Spec: Status: Conditions: Last Transition Time: 2021-08-26T11:28:23Z Message: Cluster version is 4.9.0-0.nightly-2021-08-25-093627 Status: False Type: Progressing Last Transition Time: 2021-08-26T16:49:53Z Message: One or more machine config pools are updating, please see `oc get mcp` for further details Reason: PoolUpdating Status: False Type: Upgradeable Last Transition Time: 2021-08-26T16:49:53Z Message: Failed to resync 4.9.0-0.nightly-2021-08-25-093627 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1) Reason: MachineConfigDaemonFailed Status: True Type: Degraded Last Transition Time: 2021-08-26T16:49:53Z Message: Cluster not available for 4.9.0-0.nightly-2021-08-25-093627 Status: False Type: Available Extension: Master: all 3 nodes are at latest configuration rendered-master-099495b629c595198096d3816ac65c45 Worker: 3 (ready 2) out of 3 nodes are updating to latest configuration rendered-worker-d8d7f496556446a0a6443be920044ad3 Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Group: Name: openshift-kni-infra Resource: namespaces Group: Name: openshift-openstack-infra Resource: namespaces Group: Name: openshift-ovirt-infra Resource: namespaces Group: Name: openshift-vsphere-infra Resource: namespaces Versions: Name: operator Version: 4.9.0-0.nightly-2021-08-25-093627 Events: <none> Expected results: Node should reboot and be ready.
The kubelet status from your node describe makes it look you have more problems than just the MCO being degraded, and that node ip-10-0-149-240.us-east-2.compute.internal might not be okay: Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Thu, 26 Aug 2021 12:41:25 -0400 Thu, 26 Aug 2021 12:42:08 -0400 NodeStatusUnknown Kubelet stopped posting node status. Also, with the exception of tuned, it looks like none of your pods on ip-10-0-149-240.us-east-2.compute.internal are ready: jkyros@jkyros-t590 masters]$ omg get pods -A -o wide | grep 'ip-10-0-149-240.us-east-2.compute.internal' default ip-10-0-149-240us-east-2computeinternal-debug 0/1 Pending 0 35m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-cluster-csi-drivers aws-ebs-csi-driver-node-wch9k 0/3 Running 2 5h41m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-cluster-node-tuning-operator tuned-jjvtw 1/1 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-dns dns-default-dcxtb 0/2 Running 2 5h39m ip-10-0-149-240.us-east-2.compute.internal openshift-dns node-resolver-pvbxx 0/1 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-image-registry image-registry-568957b9d6-rdkdx 0/1 Running 2 5h40m ip-10-0-149-240.us-east-2.compute.internal openshift-image-registry node-ca-875ck 0/1 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-ingress-canary ingress-canary-8zj6c 0/1 Running 2 5h38m ip-10-0-149-240.us-east-2.compute.internal openshift-ingress router-default-65bdc775fd-dwrbr 0/1 Running 2 5h38m ip-10-0-149-240.us-east-2.compute.internal openshift-machine-config-operator machine-config-daemon-ngrhx 0/2 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-marketplace certified-operators-f6smg 0/1 Running 2 55m ip-10-0-149-240.us-east-2.compute.internal openshift-marketplace community-operators-rrmq2 0/1 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-marketplace redhat-marketplace-vck2g 0/1 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-marketplace redhat-operators-n2kvh 0/1 Running 2 1h20m ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring alertmanager-main-2 0/5 Running 2 5h37m ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring kube-state-metrics-59b87859b8-kgw7r 0/3 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring node-exporter-qm46v 0/2 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring openshift-state-metrics-66585c8c7c-hdv4x 0/3 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-monitoring telemeter-client-6bd9bc5f84-557hq 0/3 Running 2 5h47m ip-10-0-149-240.us-east-2.compute.internal openshift-multus multus-additional-cni-plugins-pg4qm 0/1 Pending 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-multus multus-bx467 0/1 Running 2 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal openshift-multus network-metrics-daemon-c447r 0/2 Running 2 5h40m ip-10-0-149-240.us-east-2.compute.internal openshift-network-diagnostics network-check-source-75749bc6b4-c8jrf 0/1 Running 2 5h51m ip-10-0-149-240.us-east-2.compute.internal openshift-network-diagnostics network-check-target-tm2fs 0/1 Running 2 5h40m ip-10-0-149-240.us-east-2.compute.internal openshift-ovn-kubernetes ovnkube-node-p2bmb 0/4 Running 3 5h40m 10.0.149.240 ip-10-0-149-240.us-east-2.compute.internal This doesn't look like the MCO as cause -- more that the MCO is a victim and complaining about it. Can you still get into the "problem node" ip-10-0-149-240.us-east-2.compute.internal ? Would you be able to upload the journal logs or take a sosreport of that node so we can see what's going on in there?
This hit in the middle of whatever stability issues were happening with [1] and it appeared to resolve with later nightlies, I suspect that the root cause here was similar (and regardless, outside the MCO). I'm going to close this, as I believe the underlying problems have been resolved, but if you manage to reproduce it on a current nightly, please reopen it. Thanks! [1] https://bugzilla.redhat.com/show_bug.cgi?id=1997905
Unfortunately I'm getting the same results with 4.10.0-0.nightly-2022-03-09-224546 version. $ oc version Client Version: 4.9.0-0.nightly-2021-07-20-014024 Server Version: 4.10.0-0.nightly-2022-03-09-224546 Kubernetes Version: v1.23.3+e419edf
Simon, Can you clarify what exactly the bug is? That if a node crashes and is unavailable the MCO is degraded (which would be expected behaviour) or something else? Secondly can you explain what you are doing to your clusters, ie this is a crash test and it seems like the node crashed as intended? We need some extra details here so we can figure out what exactly is going on and whether this is a bug or not. Thanks!
Sorry for this one. That was retest. Next time when I see the same issue I'll start from node...