Description of problem: The node got stuck in SchedulingDisabled after applying a performanceprofile, and the mcp always kept Updating status. cat performance_profile.yaml: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: perf-example spec: cpu: isolated: "2-31" reserved: "0,1" hugepages: defaultHugepagesSize: 1G pages: - count: 10 size: 1G node: 0 # for 3 node converged master/worker and SNO clusters we use the masters as a selector nodeSelector: node-role.kubernetes.io/master: "" machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" numa: topologyPolicy: "restricted" realTimeKernel: enabled: true Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-05-26-021757 How reproducible: always Steps to Reproduce: 1.Install an SNO cluster on BM, and deploy the POA operator 2.check the POA operator status # oc get all NAME READY STATUS RESTARTS AGE pod/performance-operator-d74df7b97-8sjmk 1/1 Running 0 4m45s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/performance-operator-service ClusterIP 172.30.151.212 <none> 443/TCP 4m45s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/performance-operator 1/1 1 1 4m45s NAME DESIRED CURRENT READY AGE replicaset.apps/performance-operator-d74df7b97 1 1 1 4m45s 3.create the above performanceprofile # oc create -f performance_profile.yaml 4.check the performanceprofile status # oc get performanceprofile perf-example -o yaml apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: creationTimestamp: "2021-05-27T08:38:01Z" finalizers: - foreground-deletion generation: 1 name: perf-example resourceVersion: "392237" uid: ba00d096-70ef-4491-a3da-d78bf0d33e28 spec: cpu: isolated: 2-31 reserved: 0,1 hugepages: defaultHugepagesSize: 1G pages: - count: 10 node: 0 size: 1G machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" nodeSelector: node-role.kubernetes.io/master: "" numa: topologyPolicy: restricted realTimeKernel: enabled: true status: conditions: - lastHeartbeatTime: "2021-05-27T08:38:01Z" lastTransitionTime: "2021-05-27T08:38:01Z" status: "True" type: Available - lastHeartbeatTime: "2021-05-27T08:38:01Z" lastTransitionTime: "2021-05-27T08:38:01Z" status: "True" type: Upgradeable - lastHeartbeatTime: "2021-05-27T08:38:01Z" lastTransitionTime: "2021-05-27T08:38:01Z" status: "False" type: Progressing - lastHeartbeatTime: "2021-05-27T08:38:01Z" lastTransitionTime: "2021-05-27T08:38:01Z" status: "False" type: Degraded runtimeClass: performance-perf-example tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-perf-example 5.wait 20 minutes or longer, check the status of mcp and node. Actual results: the node got stuck in SchedulingDisabled, and mcp kept Updating status. And we note that the the node's currentConfig is rendered-master-40426302aec772a0212c5b09fb217b9d, and desiredConfig is rendered-master-bfe59d660cb6600da4052939d0e40b8b, but the mcp master show "All nodes are updating to rendered-master-40426302aec772a0212c5b09fb217b9d" # oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h 00-worker 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h 01-master-container-runtime 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h 01-master-kubelet 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h 01-worker-container-runtime 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h 01-worker-kubelet 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h 50-nto-master 3.2.0 3m48s 99-master-generated-kubelet 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 3m55s 99-master-generated-registries 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h 99-master-ssh 3.2.0 22h 99-worker-generated-registries 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h 99-worker-ssh 3.2.0 22h performance-perf-example 2.2.0 3m55s rendered-master-40426302aec772a0212c5b09fb217b9d 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 3m50s // **** rendered-master-bfe59d660cb6600da4052939d0e40b8b 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 3m43s // **** rendered-master-dc0efdf72fc26d95bcdcdce4b838607e 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h rendered-worker-867a6c3e0478ba859fb4d4f6a6831c9e 6b177afdbfa39ff9f3757740af10d1ca998d5e76 3.2.0 22h # oc get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME sno-0-0 Ready,SchedulingDisabled master,worker 22h v1.21.0-rc.0+9acd492 192.168.123.136 <none> Red Hat Enterprise Linux CoreOS 48.84.202105251733-0 (Ootpa) 4.18.0-305.rt7.72.el8.x86_64 cri-o://1.21.0-98.rhaos4.8.git1f3c5cb.el8 # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-dc0efdf72fc26d95bcdcdce4b838607e False True False 1 0 0 0 22h worker rendered-worker-867a6c3e0478ba859fb4d4f6a6831c9e True False False 0 0 0 0 22h Expected results: The node is Schedulable, and mcp finish Updating. Additional info: # oc get mcp master -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2021-05-26T10:33:25Z" generation: 4 labels: machineconfiguration.openshift.io/mco-built-in: "" operator.machineconfiguration.openshift.io/required-for-upgrade: "" pools.operator.machineconfiguration.openshift.io/master: "" name: master resourceVersion: "397895" uid: 76c06d9a-e86f-43bf-b1b7-51888dac066a spec: configuration: name: rendered-master-bfe59d660cb6600da4052939d0e40b8b source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-master - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-master-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-master-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 50-nto-master - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-generated-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-ssh - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: performance-perf-example machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: master nodeSelector: matchLabels: node-role.kubernetes.io/master: "" paused: false status: conditions: - lastTransitionTime: "2021-05-26T10:37:39Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2021-05-26T10:37:56Z" message: "" reason: "" status: "False" type: NodeDegraded - lastTransitionTime: "2021-05-26T10:37:56Z" message: "" reason: "" status: "False" type: Degraded - lastTransitionTime: "2021-05-27T08:38:12Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2021-05-27T08:38:12Z" message: All nodes are updating to rendered-master-40426302aec772a0212c5b09fb217b9d reason: "" status: "True" type: Updating configuration: name: rendered-master-dc0efdf72fc26d95bcdcdce4b838607e source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-master - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-master-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-master-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-master-ssh degradedMachineCount: 0 machineCount: 1 observedGeneration: 4 readyMachineCount: 0 unavailableMachineCount: 1 updatedMachineCount: 0 # oc get node -o yaml apiVersion: v1 items: - apiVersion: v1 kind: Node metadata: annotations: machineconfiguration.openshift.io/controlPlaneTopology: SingleReplica machineconfiguration.openshift.io/currentConfig: rendered-master-40426302aec772a0212c5b09fb217b9d machineconfiguration.openshift.io/desiredConfig: rendered-master-bfe59d660cb6600da4052939d0e40b8b machineconfiguration.openshift.io/reason: "" machineconfiguration.openshift.io/state: Working volumes.kubernetes.io/controller-managed-attach-detach: "true" creationTimestamp: "2021-05-26T10:26:20Z" labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/os: linux kubernetes.io/arch: amd64 kubernetes.io/hostname: sno-0-0 kubernetes.io/os: linux node-role.kubernetes.io/master: "" node-role.kubernetes.io/worker: "" node.openshift.io/os_id: rhcos name: sno-0-0 resourceVersion: "403745" uid: 62309070-661a-4f18-92c0-de2a4274b865 spec: taints: - effect: NoSchedule key: node.kubernetes.io/unschedulable timeAdded: "2021-05-27T08:52:42Z" unschedulable: true status: addresses: - address: 192.168.123.136 type: InternalIP - address: sno-0-0 type: Hostname allocatable: cpu: "30" ephemeral-storage: "127066945326" hugepages-1Gi: 10Gi hugepages-2Mi: "0" memory: 21303916Ki pods: "250" capacity: cpu: "32" ephemeral-storage: 137876460Ki hugepages-1Gi: 10Gi hugepages-2Mi: "0" memory: 32916076Ki pods: "250" conditions: - lastHeartbeatTime: "2021-05-27T09:03:47Z" lastTransitionTime: "2021-05-26T10:26:20Z" message: kubelet has sufficient memory available reason: KubeletHasSufficientMemory status: "False" type: MemoryPressure - lastHeartbeatTime: "2021-05-27T09:03:47Z" lastTransitionTime: "2021-05-26T10:26:20Z" message: kubelet has no disk pressure reason: KubeletHasNoDiskPressure status: "False" type: DiskPressure - lastHeartbeatTime: "2021-05-27T09:03:47Z" lastTransitionTime: "2021-05-26T10:26:20Z" message: kubelet has sufficient PID available reason: KubeletHasSufficientPID status: "False" type: PIDPressure - lastHeartbeatTime: "2021-05-27T09:03:47Z" lastTransitionTime: "2021-05-26T10:28:41Z" message: kubelet is posting ready status reason: KubeletReady status: "True" type: Ready daemonEndpoints: kubeletEndpoint: Port: 10250 images: - names: - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2943b9589c7bbb3144e7650e12dddd0f968d7ed507d77d01cb7b956ad554da2a sizeBytes: 1056940795 - names: - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1229e5e2f3a9978f4403857f0bbda6373500ce9577b250bb60240aa684f69ae0 sizeBytes: 749678671 - names: - registry.redhat.io/redhat/redhat-operator-index@sha256:4232c4766e0185ff4f785e76cf3e907c4d27c7cb2593a2364198f9abb1ba694d - registry.redhat.io/redhat/redhat-operator-index@sha256:adbb1da2d832dfca10ef1287b8f930be56411b05ef5bf4b194fc2b84ce2904a4 - registry.redhat.io/redhat/redhat-operator-index:v4.8 sizeBytes: 702108552 - names: ... - names: - quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e690d1294aaa109f549c9f6f8aa635f6545eee742ea4c35e31701b4e523a4fb sizeBytes: 351766397 nodeInfo: architecture: amd64 bootID: ef281a98-673e-48b5-bf5e-bdbdb5c7f019 containerRuntimeVersion: cri-o://1.21.0-98.rhaos4.8.git1f3c5cb.el8 kernelVersion: 4.18.0-305.rt7.72.el8.x86_64 kubeProxyVersion: v1.21.0-rc.0+9acd492 kubeletVersion: v1.21.0-rc.0+9acd492 machineID: 99b18607874147a986bc68477d3a2007 operatingSystem: linux osImage: Red Hat Enterprise Linux CoreOS 48.84.202105251733-0 (Ootpa) systemUUID: 99b18607-8741-47a9-86bc-68477d3a2007
Then I set the node Schedulable by command: oc adm uncordon. But when I create any pod, it stuck in Pending status and didn't get scheduled to sno node. And I didn't find relevant error in crio and kubelet log.
Hi, Ryan, Harshal I tried with version 4.8.0-0.nightly-2021-06-01-043518, the node is normal after applying a performanceprofile, and mcp master roll out successfully. But when I create any pod, it gets stuck in Pending. I will modify the title.
# oc describe pod hello-pod-1 Name: hello-pod-1 Namespace: default Priority: 0 Node: sno-0-0/ Labels: name=hello-pod Annotations: <none> Status: Pending IP: IPs: <none> Containers: hello-pod: Image: docker.io/ocpqe/hello-pod:latest Port: 8080/TCP Host Port: 0/TCP Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-w78mf (ro) Conditions: Type Status PodScheduled True Volumes: default-token-w78mf: Type: Secret (a volume populated by a Secret) SecretName: default-token-w78mf Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 25s default-scheduler Successfully assigned default/hello-pod-1 to sno-0-0 [root@ocp-edge49 ~]#
I can't collect must-gather, for must-gather pod still keep Pending. I grep some Error log from crio and kubelet. kubelet: Jun 01 11:52:56 sno-0-0 hyperkube[2265]: E0601 11:52:56.016498 2265 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = command error: time=\"2021-06-01T11:52:56Z\" level=error msg=\"exec failed: container_linux.go:367: starting container process caused: process_linux.go:152: writing config to pipe caused: write init-p: broken pipe\"\n, stdout: , stderr: , exit code -1" containerID="e94bd648b18cee1c2c23ebba62e1a2a17b992f7f18c4cb19f2fe8b37611e0ea1" cmd=[grpc_health_probe -addr=:50051] Jun 01 11:52:56 sno-0-0 hyperkube[2265]: E0601 11:52:56.908910 2265 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = command error: time=\"2021-06-01T11:52:56Z\" level=error msg=\"exec failed: container_linux.go:367: starting container process caused: process_linux.go:152: writing config to pipe caused: write init-p: broken pipe\"\n, stdout: , stderr: , exit code -1" containerID="e94bd648b18cee1c2c23ebba62e1a2a17b992f7f18c4cb19f2fe8b37611e0ea1" cmd=[grpc_health_probe -addr=:50051] Jun 01 11:52:56 sno-0-0 hyperkube[2265]: E0601 11:52:56.909666 2265 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = NotFound desc = container is not created or running: checking if PID of e94bd648b18cee1c2c23ebba62e1a2a17b992f7f18c4cb19f2fe8b37611e0ea1 is running failed: open /proc/43140/stat: no such file or directory: container process not found" containerID="e94bd648b18cee1c2c23ebba62e1a2a17b992f7f18c4cb19f2fe8b37611e0ea1" cmd=[grpc_health_probe -addr=:50051] Jun 01 11:52:56 sno-0-0 hyperkube[2265]: E0601 11:52:56.910118 2265 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = NotFound desc = container is not created or running: checking if PID of e94bd648b18cee1c2c23ebba62e1a2a17b992f7f18c4cb19f2fe8b37611e0ea1 is running failed: open /proc/43140/stat: no such file or directory: container process not found" containerID="e94bd648b18cee1c2c23ebba62e1a2a17b992f7f18c4cb19f2fe8b37611e0ea1" cmd=[grpc_health_probe -addr=:50051] crio: Jun 01 11:59:29 sno-0-0 crio[2226]: time="2021-06-01 11:59:29.871447734Z" level=error msg="Container creation error: time=\"2021-06-01T11:58:47Z\" level=warning msg=\"Timed out while waiting for StopUnit(crio-bc18714b79b9f63dbacbde40fc20e571a1f980d516c74d57ef22090a33723a7b.scope) completion signal from dbus. Continuing...\"\ntime=\"2021-06-01T11:59:29Z\" level=error msg=\"container_linux.go:367: starting container process caused: process_linux.go:340: applying cgroup configuration for process caused: Timeout waiting for systemd to create crio-bc18714b79b9f63dbacbde40fc20e571a1f980d516c74d57ef22090a33723a7b.scope\"\n" id=26774bf9-342e-4bed-9a49-b9f4c8a36d30 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jun 01 11:59:54 sno-0-0 crio[2226]: time="2021-06-01 11:59:54.586000422Z" level=error msg="Container creation error: time=\"2021-06-01T11:59:54Z\" level=error msg=\"container_linux.go:367: starting container process caused: process_linux.go:340: applying cgroup configuration for process caused: Timeout waiting for systemd to create crio-b9578589b07ff943fb52e46a4a5dfad7c6abd95da69cb4b4c417bdddc1dfb50c.scope\"\n" id=af0b5fa4-2eaa-4d68-ace2-419e9808536c name=/runtime.v1alpha2.RuntimeService/CreateContainer Jun 01 12:03:27 sno-0-0 crio[2226]: time="2021-06-01 12:03:27.292972083Z" level=warning msg="Error reserving ctr name k8s_router_router-default-655675cbb9-4vbjr_openshift-ingress_a12d28ab-8d36-4073-9ebf-c5ef03d61afa_5 for id 365fdf6d14c13710833a5acb2a923bd427945e658cee2903a8602b7027226036: name is reserved" Jun 01 12:07:08 sno-0-0 crio[2226]: time="2021-06-01 12:07:08.650944276Z" level=error msg="Container creation error: time=\"2021-06-01T12:06:20Z\" level=warning msg=\"Timed out while waiting for StopUnit(crio-a4c757934c2c16ff705ace1c3143ebc706398d4d6f78f4a649973e457f696a7f.scope) completion signal from dbus. Continuing...\"\ntime=\"2021-06-01T12:07:08Z\" level=error msg=\"container_linux.go:367: starting container process caused: process_linux.go:340: applying cgroup configuration for process caused: Timeout waiting for systemd to create crio-a4c757934c2c16ff705ace1c3143ebc706398d4d6f78f4a649973e457f696a7f.scope\"\n" id=cae96ed7-c200-4b74-8008-e4d1e4d0600b name=/runtime.v1alpha2.RuntimeService/CreateContainer Jun 01 12:07:08 sno-0-0 crio[2226]: time="2021-06-01 12:07:08.651228911Z" level=error msg="Container creation error: time=\"2021-06-01T12:06:20Z\" level=warning msg=\"Timed out while waiting for StopUnit(crio-e150e6d058ae81c14db8d39b7a4a6289172c65c96dfec6461d2cea926dadc4ee.scope) completion signal from dbus. Continuing...\"\ntime=\"2021-06-01T12:07:08Z\" level=error msg=\"container_linux.go:367: starting container process caused: process_linux.go:340: applying cgroup configuration for process caused: Timeout waiting for systemd to create crio-e150e6d058ae81c14db8d39b7a4a6289172c65c96dfec6461d2cea926dadc4ee.scope\"\n" id=fa99d53b-67d4-48be-adcd-ae1864082b1f name=/runtime.v1alpha2.RuntimeService/CreateContainer Jun 01 12:12:39 sno-0-0 crio[2226]: time="2021-06-01 12:12:39.474289420Z" level=error msg="Container creation error: time=\"2021-06-01T12:12:39Z\" level=error msg=\"container_linux.go:367: starting container process caused: process_linux.go:340: applying cgroup configuration for process caused: Timeout waiting for systemd to create crio-49c3b625e5a0e5855cde159a44a7c3719863498c2c433d9d8846af8e71905044.scope\"\n" id=88fe2087-6c7f-434e-90c1-1267273dac1b name=/runtime.v1alpha2.RuntimeService/CreateContainer
Created attachment 1788679 [details] kubelet log
Hi MinLi, I have the original issue as you stated "node got stuck in SchedulingDisabled after applying a performanceprofile", I exactly followed the same steps on my BM SNO env. OCP - 4.8.0-fc.3 PAO - performance-addon-operator.v4.7.3 PAO profile - It is quite similar as yours with RT enabled as well. Kernel - 4.18.0-293.rt7.59.el8.x86_64 CRI-O - cri-o-1.21.0-90.rhaos4.8.git07becf8.el8.x86_64 After reboot, many system pods are in "CreateContainerError" state forever and few events are below, authentication-operator pod Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 60m (x407 over 18h) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8a6f73297d8394136057c3424a155c9518156c633ca0dba3e7c41bfb5d2e5168" already present on machine Warning Failed 32m (x52 over 16h) kubelet (combined from similar events): Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_authentication-operator_authentication-operator-7ffcf8797c-mgpb9_openshift-authentication-operator_2de8f166-bb9d-40b6-97d5-3c2863944f0f_10 for id e254e9d134cd7420593a2ce547f2ab2d602fd09b712dcce144feb97816d43025: name is reserved Warning Failed 3m40s (x366 over 18h) kubelet Error: context deadline exceeded cluster-baremetal-operator pod Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Failed 102m (x723 over 14h) kubelet Error: failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable Normal Pulled 17m (x923 over 18h) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f46f210d6023bec16e68340b484a8881ce46d5e6749f7c7f86f9840709d2e080" already present on machine Normal Pulled 12m (x931 over 18h) kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dc3269f59867950fd684e08b4920195e5557569a0bde3e1bfec999eeb424861c" already present on machine Warning Failed 2m9s (x846 over 14h) kubelet Error: failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable I could reproduce often(not always), very few times it takes longer(>3 hours) to get to "Ready" but often it stuck in "SchedulingDisabled" state. I assume if you rebuild and follow the steps you might get hit with the same problem again, so I think we may need to stick to old title. Thanks!
How's this bug related to https://bugzilla.redhat.com/show_bug.cgi?id=1962074
it does look like this bug is related to https://bugzilla.redhat.com/show_bug.cgi?id=1962074, though we're getting slightly different results. Something worth noting in the original bug description: ``` cat performance_profile.yaml: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: perf-example spec: cpu: isolated: "2-31" reserved: "0,1" ``` 2 reserved cpus vs 30 isolated doesn't seem good. reserved cpus hold (at least) systemd, crio, kubelet, and all pause containers. I don't think this is a sustainable ratio (though I don't really know what is) let's move discussion of SNO container starting issues to 1962074. there we can collect all of the symptoms and see if we can alleviate them *** This bug has been marked as a duplicate of bug 1962074 ***
The bug is marked duplicate, I reply here to clear the needinfo request.
This was incorrectly marked as duplicate, as the symptoms are slightly different from https://bugzilla.redhat.com/show_bug.cgi?id=1962074. reopening
@minmli I have a couple of questions: 1. After the PerformanceProfile is applied and the node is ready, can we wait till all of the infra pods have started up? I bet there's a delay between when NodeReady happens and when the node is actually ready (multus is up, KAS is serving requests, all operators have been created, etc.) Then, once most of the pods are ready, can we create the workload and time how long it takes to be created? 2. Are you able to get me access to a node that has a performance profile applied to it? It would help me and my team a lot to have something to play around with Thanks
(In reply to Peter Hunt from comment #13) > @minmli I have a couple of questions: > > 1. After the PerformanceProfile is applied and the node is ready, can we > wait till all of the infra pods have started up? I bet there's a delay > between when NodeReady happens and when the node is actually ready (multus > is up, KAS is serving requests, all operators have been created, etc.) Then, > once most of the pods are ready, can we create the workload and time how > long it takes to be created? > > 2. Are you able to get me access to a node that has a performance profile > applied to it? It would help me and my team a lot to have something to play > around with > > Thanks Hi, Peter: 1. When I created a workload, I checked the pod,co status, all co is available without progressing or degraded and some pods in ns penshift-marketplace are non-running status(I don't think these are basic functional pods),and there is one pod cluster-baremetal-operator-8674588c96-dzbpv maybe effects, let me paste the pod results below, and it takes 8 minutes for the workload pod to start. # oc get pod -A | grep -v Running NAMESPACE NAME READY STATUS RESTARTS AGE default hello-pod-1 0/1 Pending 0 4m38s // the workload pod kept pending for about 8 mins openshift-etcd installer-2-sno-0-0 0/1 Completed 0 4h43m openshift-etcd revision-pruner-2-sno-0-0 0/1 Completed 0 4h38m openshift-kube-apiserver installer-2-sno-0-0 0/1 Completed 0 4h38m openshift-kube-apiserver installer-4-sno-0-0 0/1 Completed 0 4h35m openshift-kube-apiserver installer-5-sno-0-0 0/1 Completed 0 4h35m openshift-kube-apiserver installer-6-sno-0-0 0/1 Completed 0 4h30m openshift-kube-apiserver installer-7-retry-1-sno-0-0 0/1 Completed 0 4h27m openshift-kube-apiserver installer-7-sno-0-0 0/1 Error 0 4h29m openshift-kube-apiserver installer-8-sno-0-0 0/1 Completed 0 4h26m openshift-kube-apiserver revision-pruner-5-sno-0-0 0/1 Completed 0 4h31m openshift-kube-apiserver revision-pruner-7-sno-0-0 0/1 Completed 0 4h27m openshift-kube-apiserver revision-pruner-8-sno-0-0 0/1 Completed 0 4h22m openshift-kube-controller-manager installer-2-sno-0-0 0/1 Completed 1 4h38m openshift-kube-controller-manager installer-3-sno-0-0 0/1 Completed 0 4h35m openshift-kube-controller-manager installer-4-sno-0-0 0/1 Completed 0 4h31m openshift-kube-controller-manager installer-5-sno-0-0 0/1 Completed 0 4h30m openshift-kube-controller-manager installer-6-sno-0-0 0/1 Completed 0 4h26m openshift-kube-controller-manager revision-pruner-2-sno-0-0 0/1 Completed 0 4h35m openshift-kube-controller-manager revision-pruner-3-sno-0-0 0/1 Completed 0 4h31m openshift-kube-controller-manager revision-pruner-4-sno-0-0 0/1 Completed 0 4h30m openshift-kube-controller-manager revision-pruner-5-sno-0-0 0/1 Completed 0 4h26m openshift-kube-controller-manager revision-pruner-6-sno-0-0 0/1 Completed 0 4h26m openshift-kube-scheduler installer-2-sno-0-0 0/1 Completed 0 4h43m openshift-kube-scheduler installer-3-sno-0-0 0/1 Completed 0 4h36m openshift-kube-scheduler installer-4-sno-0-0 0/1 Completed 0 4h36m openshift-kube-scheduler installer-5-sno-0-0 0/1 Completed 0 4h36m openshift-kube-scheduler installer-6-sno-0-0 0/1 Completed 0 4h30m openshift-kube-scheduler installer-7-sno-0-0 0/1 Completed 0 4h30m openshift-kube-scheduler installer-8-sno-0-0 0/1 Completed 0 4h26m openshift-kube-scheduler revision-pruner-2-sno-0-0 0/1 Completed 0 4h38m openshift-kube-scheduler revision-pruner-5-sno-0-0 0/1 Completed 0 4h35m openshift-kube-scheduler revision-pruner-7-sno-0-0 0/1 Completed 0 4h26m openshift-kube-scheduler revision-pruner-8-sno-0-0 0/1 Completed 0 4h22m openshift-machine-api cluster-baremetal-operator-8674588c96-dzbpv 1/2 Error 10 4h35m // the pod from openshift-machine-api is error openshift-marketplace b4280e990e92eb505c5556482ed7ec4789cc315c02c7d13c2a6cb9977f8fs6z 0/1 Completed 0 71m openshift-marketplace certified-operators-5wq9g 0/1 Pending 0 7m29s openshift-marketplace certified-operators-vgdmq 1/1 Terminating 5 4h36m openshift-marketplace community-operators-pt8gh 1/1 Terminating 1 28m openshift-marketplace redhat-marketplace-827t5 1/1 Terminating 1 18m openshift-marketplace redhat-operators-wpxtp 1/1 Terminating 2 18m 2. please access the node we discussed in the way: (I will send you the passwd from slack) ssh root.eng.tlv2.redhat.com
you can get the node's mut-gather from bug: https://bugzilla.redhat.com/show_bug.cgi?id=1962074
I have deleted and recreated the pod a couple of times, and it seems this isn't reproducable after the first pod creation. On a typical node, when a worker goes "NodeReady" it's pretty much ready to accept pods, as there are no infra pods that need bootstrapping. This is not the case for SNO, obviously. I currently suspect the 8 minute delay is the control plane and node processes bootstrapping. NodeReady only means the kubelet has started, but there are many other things that must be done before the node will start user workloads. To verify, I would make sure that after the initial 8 minute delay pods start coming up faster. if so, the 8 minute delay may be a quirk of SNO, not a bug
Moving to modified because I believe this is not a bug, but would like QA validation
test on 4.8.0-0.nightly-2021-07-03-115702, apply more cpu to reserved(isolated: 8-31,reserved: 0-7), when we create pod, the pod can get Running status quickly.
Closed as stale, please re-open if this issue is still active.