Description of problem: many pod get stuck in CreateContainerError status, and prompt "failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable" after adding a performanceprofile cat performance_profile.yaml: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: perf-example spec: cpu: isolated: "16-29" reserved: "0-15,30,31" hugepages: defaultHugepagesSize: 1G pages: - count: 10 size: 1G node: 0 # for 3 node converged master/worker and SNO clusters we use the masters as a selector nodeSelector: node-role.kubernetes.io/master: "" machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" numa: topologyPolicy: "restricted" realTimeKernel: # For CU should be false enabled: true Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-05-18-205323 How reproducible: Steps to Reproduce: 1.Install an SNO cluster on BM, deploy the POA operator 2.check the POA operator status ## oc get all NAME READY STATUS RESTARTS AGE pod/performance-operator-d74df7b97-8sjmk 1/1 Running 0 4m45s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/performance-operator-service ClusterIP 172.30.151.212 <none> 443/TCP 4m45s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/performance-operator 1/1 1 1 4m45s NAME DESIRED CURRENT READY AGE replicaset.apps/performance-operator-d74df7b97 1 1 1 4m45s 3.create the above performanceprofile # oc create -f performance_profile.yaml 4.check the performanceprofile status ## oc get performanceprofile perf-example -o yaml apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: creationTimestamp: "2021-05-19T07:19:44Z" finalizers: - foreground-deletion generation: 1 name: perf-example resourceVersion: "85720" uid: 83b89352-da35-4593-9184-d553e933d7da spec: cpu: isolated: 16-29 reserved: 0-15,30,31 hugepages: defaultHugepagesSize: 1G pages: - count: 10 node: 0 size: 1G machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" nodeSelector: node-role.kubernetes.io/master: "" numa: topologyPolicy: restricted realTimeKernel: enabled: true status: conditions: - lastHeartbeatTime: "2021-05-19T07:19:45Z" lastTransitionTime: "2021-05-19T07:19:45Z" status: "True" type: Available - lastHeartbeatTime: "2021-05-19T07:19:45Z" lastTransitionTime: "2021-05-19T07:19:45Z" status: "True" type: Upgradeable - lastHeartbeatTime: "2021-05-19T07:19:45Z" lastTransitionTime: "2021-05-19T07:19:45Z" status: "False" type: Progressing - lastHeartbeatTime: "2021-05-19T07:19:45Z" lastTransitionTime: "2021-05-19T07:19:45Z" status: "False" type: Degraded runtimeClass: performance-perf-example tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-perf-example 5.wait 20 minutes or more, check mcp and node. Actual results: 5.the mcp master kept UPDATING status all the time though the sno node is Ready. co network is within PROGRESSING status. Many pod get stuck in CreateContainerError. And meanwhile, any new created pod will be pending, such as debug-pod and must-gather pod.So I can't debug node to check kubelet/crio log or must-gather logs. Expected results: Additional info: # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-eda807b096850318a467710061d40fae False True False 1 0 0 0 5h44m worker rendered-worker-d89f4b2965a80d86c8aa31cd50817b95 True False False 0 0 0 0 5h44m # oc get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME sno-0-0 Ready master,worker 4h33m v1.21.0-rc.0+9d99e1c 192.168.123.132 <none> Red Hat Enterprise Linux CoreOS 48.84.202105180118-0 (Ootpa) 4.18.0-293.rt7.59.el8.x86_64 cri-o://1.21.0-93.rhaos4.8.git8f8bcd9.el8 # oc describe node sno-0-0 Name: sno-0-0 Roles: master,worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=sno-0-0 kubernetes.io/os=linux node-role.kubernetes.io/master= node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos ... CreationTimestamp: Wed, 19 May 2021 06:18:27 +0300 Taints: <none> Unschedulable: false Lease: HolderIdentity: sno-0-0 AcquireTime: <unset> RenewTime: Wed, 19 May 2021 10:51:05 +0300 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 19 May 2021 10:46:55 +0300 Wed, 19 May 2021 06:18:27 +0300 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 19 May 2021 10:46:55 +0300 Wed, 19 May 2021 06:18:27 +0300 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 19 May 2021 10:46:55 +0300 Wed, 19 May 2021 06:18:27 +0300 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 19 May 2021 10:46:55 +0300 Wed, 19 May 2021 06:25:00 +0300 KubeletReady kubelet is posting ready status Addresses: InternalIP: 192.168.123.132 Hostname: sno-0-0 Capacity: cpu: 32 ephemeral-storage: 137876460Ki hugepages-1Gi: 10Gi hugepages-2Mi: 0 memory: 32916072Ki pods: 250 Allocatable: cpu: 14 ephemeral-storage: 127066945326 hugepages-1Gi: 10Gi hugepages-2Mi: 0 memory: 21303912Ki pods: 250
# oc get pod -A | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE openshift-apiserver-operator openshift-apiserver-operator-c4b66d4b8-k6v77 0/1 CreateContainerError 4 5h9m openshift-apiserver apiserver-59d6966f55-ln9pk 0/2 Init:CreateContainerError 0 4h52m openshift-authentication-operator authentication-operator-79458867d7-j5cnt 0/1 CreateContainerError 4 5h9m openshift-authentication oauth-openshift-6497b479c7-hvfkt 0/1 CreateContainerError 0 4h47m openshift-cloud-credential-operator cloud-credential-operator-54b95bb96c-7vjfd 0/2 CreateContainerError 0 5h9m openshift-cluster-machine-approver machine-approver-797899769d-48c6l 1/2 CreateContainerError 1 5h9m openshift-cluster-node-tuning-operator cluster-node-tuning-operator-968d8f695-cl5k9 0/1 CreateContainerError 0 5h4m openshift-cluster-samples-operator cluster-samples-operator-7c4f8d65f9-r4nnd 0/2 CreateContainerError 0 5h4m openshift-cluster-storage-operator cluster-storage-operator-9bb96976c-nrkz8 0/1 CreateContainerError 3 5h4m openshift-cluster-storage-operator csi-snapshot-controller-65b6b6f94d-b4z5w 0/1 CreateContainerError 5 5h4m openshift-cluster-storage-operator csi-snapshot-controller-operator-dd84677f4-vwnjp 0/1 CreateContainerError 3 5h4m openshift-cluster-storage-operator csi-snapshot-webhook-d5958544f-kjr55 0/1 CreateContainerError 0 5h4m openshift-config-operator openshift-config-operator-6ddfccb5b7-84l9n 0/1 CreateContainerError 6 5h9m openshift-console-operator console-operator-656959b66-5q2gw 0/1 CreateContainerError 3 4h59m openshift-console console-6c5947c444-f6xvh 0/1 CreateContainerError 0 4h52m openshift-console downloads-894b6fd6d-472dj 0/1 CreateContainerError 0 4h59m openshift-controller-manager-operator openshift-controller-manager-operator-77f98c55f5-5dl69 0/1 CreateContainerError 4 5h9m openshift-controller-manager controller-manager-pr27x 0/1 CreateContainerError 1 4h51m openshift-dns-operator dns-operator-6c5b489f4b-gpnzz 0/2 CreateContainerError 0 5h9m openshift-dns dns-default-lr7kt 0/2 CreateContainerError 0 5h5m openshift-etcd-operator etcd-operator-6bcd8b5669-5f875 0/1 CreateContainerError 4 5h9m openshift-image-registry cluster-image-registry-operator-845bd756b6-p26f5 0/1 CreateContainerError 3 5h4m openshift-image-registry image-registry-f796bb59c-kzrjn 0/1 CreateContainerError 0 4h52m openshift-ingress-canary ingress-canary-r6hnc 0/1 CreateContainerError 0 4h59m openshift-ingress-operator ingress-operator-78b8fdb7cf-mrbh2 0/2 CreateContainerError 7 5h4m openshift-ingress router-default-77c7f6699c-jnbpt 0/1 CreateContainerConfigError 1 4h59m openshift-insights insights-operator-6b46f5bd76-8gnlt 0/1 CreateContainerError 0 5h4m openshift-kube-apiserver-operator kube-apiserver-operator-75df466f75-7wkwj 0/1 CreateContainerError 3 5h4m openshift-kube-controller-manager-operator kube-controller-manager-operator-55bf67d689-x9rgb 0/1 CreateContainerError 4 5h9m openshift-kube-scheduler-operator openshift-kube-scheduler-operator-84b6488c49-kdq29 0/1 CreateContainerError 4 5h9m openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-6b565f5845-6zkr9 0/1 CreateContainerError 4 5h9m openshift-kube-storage-version-migrator migrator-b5574d49c-8j2ql 0/1 CreateContainerError 0 5h6m openshift-machine-api cluster-autoscaler-operator-5f4b4f8cdb-x7nr7 0/2 CreateContainerError 0 5h4m openshift-machine-api cluster-baremetal-operator-5c94899f6c-lcmnh 0/2 CreateContainerError 0 5h4m openshift-machine-api machine-api-operator-7849998dd5-lpq7j 0/2 CreateContainerError 3 5h4m openshift-machine-config-operator machine-config-controller-84974d8779-5bgq8 0/1 CreateContainerError 3 5h4m openshift-machine-config-operator machine-config-operator-6f4d57f75f-66b9l 0/1 CreateContainerError 3 5h4m openshift-marketplace certified-operators-chcjq 0/1 CreateContainerError 0 5h4m openshift-marketplace community-operators-8s75q 0/1 CreateContainerError 0 130m openshift-marketplace marketplace-operator-6cb74c86fd-cmbms 0/1 CreateContainerError 0 5h9m openshift-marketplace redhat-marketplace-sdxmp 0/1 CreateContainerError 0 5h4m openshift-marketplace redhat-operators-w4z5r 0/1 ContainerCreating 0 66m openshift-marketplace redhat-operators-z98tp 0/1 CreateContainerError 0 4h6m openshift-monitoring alertmanager-main-0 0/5 CreateContainerError 0 4h54m openshift-monitoring cluster-monitoring-operator-6b87ccc7bb-xg4s5 0/2 CreateContainerError 3 5h9m openshift-monitoring grafana-86dd7559df-blc6s 0/2 CreateContainerError 0 4h54m openshift-monitoring kube-state-metrics-69bf796889-bdf97 0/3 CreateContainerError 0 5h6m openshift-monitoring node-exporter-nlj4n 1/2 CreateContainerError 1 5h6m openshift-monitoring openshift-state-metrics-77c86d55b9-5zsj4 0/3 CreateContainerError 0 5h6m openshift-monitoring prometheus-adapter-946bbf6c6-zztk6 0/1 CreateContainerError 0 5h openshift-monitoring prometheus-k8s-0 0/7 CreateContainerError 1 4h54m openshift-monitoring prometheus-operator-59b4957975-q2d2b 0/2 CreateContainerError 0 4h55m openshift-monitoring telemeter-client-64d4467c98-hqb67 0/3 CreateContainerError 0 5h6m openshift-monitoring thanos-querier-5dd4b66587-jmbz8 0/5 CreateContainerError 0 4h54m openshift-multus multus-7drwt 0/1 Init:CreateContainerError 3 5h8m openshift-multus multus-admission-controller-wbhkg 0/2 CreateContainerError 0 5h7m openshift-multus network-metrics-daemon-55xnx 0/2 CreateContainerError 0 5h8m openshift-network-diagnostics network-check-source-7d77bd595b-n8s2d 0/1 CreateContainerError 0 5h8m openshift-network-diagnostics network-check-target-9pxnt 0/1 CreateContainerError 0 5h8m openshift-oauth-apiserver apiserver-66dd9ff6d-wjsph 0/1 Init:CreateContainerError 0 5h6m openshift-operator-lifecycle-manager catalog-operator-5c8dc876fc-v9wpk 0/1 CreateContainerError 0 5h4m openshift-operator-lifecycle-manager olm-operator-6487f89f75-njq75 0/1 CreateContainerError 0 5h9m openshift-operator-lifecycle-manager packageserver-5f4bbd9748-8jg8v 0/1 CreateContainerError 0 5h4m openshift-operator-lifecycle-manager packageserver-5f4bbd9748-bxzfg 0/1 CreateContainerError 0 5h4m openshift-performance-addon-operator performance-operator-d74df7b97-8sjmk 0/1 CreateContainerError 0 116m openshift-service-ca-operator service-ca-operator-7f78466ccb-lj26n 0/1 CreateContainerError 4 5h9m openshift-service-ca service-ca-7fb77576f-4wfvp 0/1 CreateContainerError 3 5h6m # oc get pod performance-operator-d74df7b97-8sjmk -o yaml -n openshift-performance-addon-operator apiVersion: v1 kind: Pod metadata: annotations: alm-examples: |- [ { "apiVersion": "performance.openshift.io/v1", "kind": "PerformanceProfile", "metadata": { "name": "example-performanceprofile" }, "spec": { "additionalKernelArgs": [ "nmi_watchdog=0", "audit=0", "mce=off", "processor.max_cstate=1", "idle=poll", "intel_idle.max_cstate=0" ], "cpu": { "isolated": "2-3", "reserved": "0-1" }, "hugepages": { "defaultHugepagesSize": "1G", "pages": [ { "count": 2, "node": 0, "size": "1G" } ] }, "nodeSelector": { "node-role.kubernetes.io/performance": "" }, "realTimeKernel": { "enabled": true } } }, { "apiVersion": "performance.openshift.io/v2", "kind": "PerformanceProfile", "metadata": { "name": "example-performanceprofile" }, "spec": { "additionalKernelArgs": [ "nmi_watchdog=0", "audit=0", "mce=off", "processor.max_cstate=1", "idle=poll", "intel_idle.max_cstate=0" ], "cpu": { "isolated": "2-3", "reserved": "0-1" }, "hugepages": { "defaultHugepagesSize": "1G", "pages": [ { "count": 2, "node": 0, "size": "1G" } ] }, "nodeSelector": { "node-role.kubernetes.io/performance": "" }, "realTimeKernel": { "enabled": true } } } ] capabilities: Basic Install categories: OpenShift Optional certified: "false" containerImage: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:08867d1ddc1bafc56a56c0a5f211c3000ba12c936772931b0728cc35409ddd94 description: Operator to optimize OpenShift clusters for applications sensitive to CPU and network latency. k8s.v1.cni.cncf.io/network-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.128.0.4" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.128.0.4" ], "default": true, "dns": {} }] olm.operatorGroup: openshift-performance-addon-operator olm.operatorNamespace: openshift-performance-addon-operator olm.skipRange: '>=4.6.0 <4.7.3' olm.targetNamespaces: "" olmcahash: 2d4f1f5ab3354c79ac434d17a0ffbf7deb9cd3b3757349d18946d76a3a90f233 operatorframework.io/properties: '{"properties":[{"type":"olm.gvk","value":{"group":"performance.openshift.io","kind":"PerformanceProfile","version":"v1"}},{"type":"olm.gvk","value":{"group":"performance.openshift.io","kind":"PerformanceProfile","version":"v1alpha1"}},{"type":"olm.gvk","value":{"group":"performance.openshift.io","kind":"PerformanceProfile","version":"v2"}},{"type":"olm.package","value":{"packageName":"performance-addon-operator","version":"4.7.3"}}]}' operators.operatorframework.io/builder: operator-sdk-v1.0.0 operators.operatorframework.io/project_layout: go.kubebuilder.io/v2 repository: https://github.com/openshift-kni/performance-addon-operators support: Red Hat creationTimestamp: "2021-05-19T06:31:36Z" generateName: performance-operator-d74df7b97- labels: name: performance-operator pod-template-hash: d74df7b97 name: performance-operator-d74df7b97-8sjmk namespace: openshift-performance-addon-operator ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: ReplicaSet name: performance-operator-d74df7b97 uid: cc929f92-9292-4412-b09b-977de33ae1c1 resourceVersion: "92163" uid: 4e864ea2-72dc-4ca2-abe9-5f7094c032bf spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: Exists containers: - command: - performance-operator env: - name: WATCH_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.annotations['olm.targetNamespaces'] - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: OPERATOR_NAME value: performance-operator - name: OPERATOR_CONDITION_NAME value: performance-addon-operator.v4.7.3 image: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:08867d1ddc1bafc56a56c0a5f211c3000ba12c936772931b0728cc35409ddd94 imagePullPolicy: Always name: performance-operator resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /apiserver.local.config/certificates name: apiservice-cert - mountPath: /tmp/k8s-webhook-server/serving-certs name: webhook-cert - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: performance-operator-token-ggl2t readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: performance-operator-dockercfg-2lgmm nodeName: sno-0-0 preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: performance-operator serviceAccountName: performance-operator terminationGracePeriodSeconds: 30 tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - name: apiservice-cert secret: defaultMode: 420 items: - key: tls.crt path: apiserver.crt - key: tls.key path: apiserver.key secretName: performance-operator-service-cert - name: webhook-cert secret: defaultMode: 420 items: - key: tls.crt path: tls.crt - key: tls.key path: tls.key secretName: performance-operator-service-cert - name: performance-operator-token-ggl2t secret: defaultMode: 420 secretName: performance-operator-token-ggl2t status: conditions: - lastProbeTime: null lastTransitionTime: "2021-05-19T06:31:36Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2021-05-19T07:26:38Z" message: 'containers with unready status: [performance-operator]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2021-05-19T07:26:38Z" message: 'containers with unready status: [performance-operator]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2021-05-19T06:31:36Z" status: "True" type: PodScheduled containerStatuses: - image: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:08867d1ddc1bafc56a56c0a5f211c3000ba12c936772931b0728cc35409ddd94 imageID: "" lastState: {} name: performance-operator ready: false restartCount: 0 started: false state: waiting: message: 'failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable' reason: CreateContainerError hostIP: 192.168.123.132 phase: Pending podIP: 10.128.0.4 podIPs: - ip: 10.128.0.4 qosClass: BestEffort startTime: "2021-05-19T06:31:36Z" # oc get pod cluster-monitoring-operator-6b87ccc7bb-xg4s5 -o yaml -n openshift-monitoring containerStatuses: - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b71921d67098b1d618e6abc7d9343c9ef74045782fca4cc4c8122cc0654b9d94 imageID: "" lastState: {} name: cluster-monitoring-operator ready: false restartCount: 0 started: false state: waiting: message: 'failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable' reason: CreateContainerError - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4e9ead3ea46f1a71ad774ade46b8853224f0368056f4d5f8b6622927a9b71a8e imageID: "" lastState: terminated: containerID: cri-o://6d215c58cca264ee5a60347ccdc79ae2f16ff48392164fc74f7c809ae685833f exitCode: 255 finishedAt: "2021-05-19T03:21:08Z" message: "I0519 03:21:08.619440 1 main.go:178] Valid token audiences: \nI0519 03:21:08.619570 1 main.go:271] Reading certificate files\nF0519 03:21:08.619592 1 main.go:275] Failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory\ngoroutine 1 [running]:\nk8s.io/klog/v2.stacks(0xc000010001, 0xc000700000, 0xc6, 0x1c8)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:996 +0xb9\nk8s.io/klog/v2.(*loggingT).output(0x2292280, 0xc000000003, 0x0, 0x0, 0xc0003d6690, 0x1bf960e, 0x7, 0x113, 0x0)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:945 +0x191\nk8s.io/klog/v2.(*loggingT).printf(0x2292280, 0x3, 0x0, 0x0, 0x17681a5, 0x2d, 0xc000515c78, 0x1, 0x1)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:733 +0x17a\nk8s.io/klog/v2.Fatalf(...)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1463\nmain.main()\n\t/go/src/github.com/brancz/kube-rbac-proxy/main.go:275 +0x1e18\n\ngoroutine 6 [chan receive]:\nk8s.io/klog/v2.(*loggingT).flushDaemon(0x2292280)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1131 +0x8b\ncreated by k8s.io/klog/v2.init.0\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:416 +0xd8\n" reason: Error startedAt: "2021-05-19T03:21:08Z" name: kube-rbac-proxy ready: false restartCount: 3 started: false state: waiting: message: 'failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable' reason: CreateContainerError hostIP: 192.168.123.132 phase: Pending podIP: 10.128.0.53 podIPs: - ip: 10.128.0.53 qosClass: Burstable startTime: "2021-05-19T03:20:38Z"
hm we should retry that pr attached
though, it's possible dbus is just hosed and retrying won't actually help. I may make that PR exponentially back-off, but there will still be container creation errors unless dbus can keep up
Hi, Peter This bug happened in SNO-Baremetal cluster. This is my deploy job: https://auto-jenkins-csb-kniqe.apps.ocp4.prod.psi.redhat.com/job/ocp-sno-virt-e2e/134/ You need a virtual server and fill it in parameter HOST. You can rebuild my job if you don't have one. But the job always use the latest nightly build, so it doesn't necessarily reproduce this issue of older build.
And I met with the issue in bz https://bugzilla.redhat.com/show_bug.cgi?id=1965983 with latest nightly build. FYI.
*** Bug 1965983 has been marked as a duplicate of this bug. ***
I am still looking for a way to have access to the setup so I can see if my changes help. For context on what's on my mind: It seems this issue is happening because dbus does not have the time to handle all of the active connections from crio and kubelet. I suspect this is because the performance profile is not giving enough cpus to reservedCPUs (this is a suspicion). I am not certain my changes will be able to mitigate this issue: even if cri-o retries the dbus connection, if dbus doesn't have the time to handle these things, nothing will be fixed. So I'd like to have an installation I can fuss around with to be able to test against.
attached is another PR that reuses a single dbus connection, rather than creating a new one each time we create a container. Hopefully this helps as well.
Min, can you try this reproducer with more reserved CPUs? I am wondering if that is a way to mitigate this problem
attached is the 1.21 variant of the fix, which is merged
tested on 4.8.0-0.nightly-2021-06-14-145150, the mcp master rolled out successfully and the sno node became Ready. Yet when I create a pod, it takes 8 minutes to start the container, which is abnormal. It seems the node is busy with some system load, and can't respond to customer workload. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 22m default-scheduler Successfully assigned default/hello-pod-1 to sno-0-0 Normal AddedInterface 15m multus Add eth0 [10.128.0.70/23] from openshift-sdn Normal Pulling 15m kubelet Pulling image "docker.io/ocpqe/hello-pod:latest" Normal Pulled 15m kubelet Successfully pulled image "docker.io/ocpqe/hello-pod:latest" in 20.058205479s Normal Created 14m kubelet Created container hello-pod Normal Started 14m kubelet Started container hello-pod Also I saw some warning on long period Pending pod(thought it became running after a great while), for example: # oc get pod cluster-baremetal-operator-8674588c96-dzbpv -o yaml -n openshift-machine-api apiVersion: v1 kind: Pod metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" k8s.v1.cni.cncf.io/network-status: |- [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.0.59" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.0.59" ], "default": true, "dns": {} }] openshift.io/scc: anyuid workload.openshift.io/warning: the node "sno-0-0" does not have resource "management.workload.openshift.io/cores" // warning
And I also saw some errors at the pod creation time: Jun 15 09:23:47 sno-0-0 hyperkube[2314]: E0615 09:23:47.481030 2314 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode30809a2_63ee_48d2_a782_b076bbd66fc0.slice/crio-e78157c009fada3a162896463028515d8b38463d3adc001351ee11a3643288cc.scope: Error finding container e78157c009fada3a162896463028515d8b38463d3adc001351ee11a3643288cc: Status 404 returned error &{%!s(*http.body=&{0xc0074fd9b0 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77aa80) %!s(func() error=0x77aa00)} Jun 15 09:23:47 sno-0-0 hyperkube[2314]: E0615 09:23:47.871493 2314 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod880ad41e_85d1_42b3_88cb_4016cb531521.slice/crio-0553fc81a326a452c24f6db062fb6089ec62eac15639233f0a92b406a7753822.scope: Error finding container 0553fc81a326a452c24f6db062fb6089ec62eac15639233f0a92b406a7753822: Status 404 returned error &{%!s(*http.body=&{0xc007e024e0 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77aa80) %!s(func() error=0x77aa00)} Jun 15 09:23:48 sno-0-0 hyperkube[2314]: W0615 09:23:48.319432 2314 manager.go:696] Error getting data for container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode30809a2_63ee_48d2_a782_b076bbd66fc0.slice/crio-e78157c009fada3a162896463028515d8b38463d3adc001351ee11a3643288cc.scope because of race condition Jun 15 09:23:56 sno-0-0 hyperkube[2314]: E0615 09:23:56.718594 2314 cpu_manager.go:435] "ReconcileState: failed to update container" err="rpc error: code = Unknown desc = updating resources for container \"6e113d9bf33273c205bd6d78bb79ceedc2f9e1f5f65a2b378ff70b3bdd8499e9\" failed: time=\"2021-06-15T09:23:56Z\" level=error msg=\"container not running\"\n (exit status 1)" pod="openshift-ingress/router-default-6fd885f48c-cn4ws" containerName="router" containerID="6e113d9bf33273c205bd6d78bb79ceedc2f9e1f5f65a2b378ff70b3bdd8499e9" cpuSet="0-31" I will upload the must-gather
Created attachment 1791219 [details] must-gather
ah I see why this was reopened, this looks like https://bugzilla.redhat.com/show_bug.cgi?id=1965983, which was closed as a dup of this. That's my mistake. I think we should leave this one closed, assuming it's verified, as we did fix the one problem, and reopen the other one. Also, I see you've mentioned the pod does start running eventually
... I see I posted an incomplete sentence: Also, I see you've mentioned the pod does start running eventually, so I am not sure the newly reopened one will be a blocker
verified with 4.8.0-0.nightly-2021-06-14-145150
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438