Bug 1962074 - SNO:the pod get stuck in CreateContainerError and prompt "failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable" after adding a performanceprofile
Summary: SNO:the pod get stuck in CreateContainerError and prompt "failed to add conmo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: x86_64
OS: Unspecified
medium
high
Target Milestone: ---
: 4.8.0
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-19 09:22 UTC by MinLi
Modified: 2021-07-27 23:09 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:09:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather (8.85 MB, application/gzip)
2021-06-15 10:52 UTC, MinLi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 4921 0 None closed try again on EAGAIN from dbus 2021-06-02 14:34:52 UTC
Github cri-o cri-o pull 4974 0 None closed cgmgr: reuse dbus connection 2021-06-15 13:33:18 UTC
Github cri-o cri-o pull 4986 0 None closed [1.21] cgmgr: reuse dbus connection 2021-06-09 13:24:28 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:09:28 UTC

Description MinLi 2021-05-19 09:22:15 UTC
Description of problem:
many pod get stuck in  CreateContainerError status, and prompt "failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable" after adding a  performanceprofile 

cat performance_profile.yaml:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: perf-example
spec:
  cpu:
    isolated: "16-29"
    reserved: "0-15,30,31"
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 10
      size: 1G
      node: 0
  # for 3 node converged master/worker and SNO clusters we use the masters as a selector
  nodeSelector:
    node-role.kubernetes.io/master: ""
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  numa:
    topologyPolicy: "restricted"
  realTimeKernel:
    # For CU should be false
    enabled: true


Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-05-18-205323

How reproducible:


Steps to Reproduce:
1.Install an SNO cluster on BM, deploy the POA operator

2.check the POA operator status
## oc get all 
NAME                                       READY   STATUS    RESTARTS   AGE
pod/performance-operator-d74df7b97-8sjmk   1/1     Running   0          4m45s

NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/performance-operator-service   ClusterIP   172.30.151.212   <none>        443/TCP   4m45s

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/performance-operator   1/1     1            1           4m45s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/performance-operator-d74df7b97   1         1         1       4m45s

3.create the above performanceprofile
# oc create -f performance_profile.yaml

4.check the performanceprofile status
## oc get performanceprofile  perf-example -o yaml 
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  creationTimestamp: "2021-05-19T07:19:44Z"
  finalizers:
  - foreground-deletion
  generation: 1
  name: perf-example
  resourceVersion: "85720"
  uid: 83b89352-da35-4593-9184-d553e933d7da
spec:
  cpu:
    isolated: 16-29
    reserved: 0-15,30,31
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 10
      node: 0
      size: 1G
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: true
status:
  conditions:
  - lastHeartbeatTime: "2021-05-19T07:19:45Z"
    lastTransitionTime: "2021-05-19T07:19:45Z"
    status: "True"
    type: Available
  - lastHeartbeatTime: "2021-05-19T07:19:45Z"
    lastTransitionTime: "2021-05-19T07:19:45Z"
    status: "True"
    type: Upgradeable
  - lastHeartbeatTime: "2021-05-19T07:19:45Z"
    lastTransitionTime: "2021-05-19T07:19:45Z"
    status: "False"
    type: Progressing
  - lastHeartbeatTime: "2021-05-19T07:19:45Z"
    lastTransitionTime: "2021-05-19T07:19:45Z"
    status: "False"
    type: Degraded
  runtimeClass: performance-perf-example
  tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-perf-example

5.wait 20 minutes or more, check mcp and node.


Actual results:
5.the mcp master kept UPDATING status all the time though the sno node is Ready.
co network is within PROGRESSING status. Many pod get stuck in CreateContainerError.
And meanwhile, any new created pod will be pending, such as debug-pod and must-gather pod.So I can't debug node to check kubelet/crio log or must-gather logs.

Expected results:


Additional info:
# oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-eda807b096850318a467710061d40fae   False     True       False      1              0                   0                     0                      5h44m
worker   rendered-worker-d89f4b2965a80d86c8aa31cd50817b95   True      False      False      0              0                   0                     0                      5h44m

# oc get node -o wide 
NAME      STATUS   ROLES           AGE     VERSION                INTERNAL-IP       EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
sno-0-0   Ready    master,worker   4h33m   v1.21.0-rc.0+9d99e1c   192.168.123.132   <none>        Red Hat Enterprise Linux CoreOS 48.84.202105180118-0 (Ootpa)   4.18.0-293.rt7.59.el8.x86_64   cri-o://1.21.0-93.rhaos4.8.git8f8bcd9.el8

# oc describe node sno-0-0 
Name:               sno-0-0
Roles:              master,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=sno-0-0
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
...
CreationTimestamp:  Wed, 19 May 2021 06:18:27 +0300
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  sno-0-0
  AcquireTime:     <unset>
  RenewTime:       Wed, 19 May 2021 10:51:05 +0300
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 19 May 2021 10:46:55 +0300   Wed, 19 May 2021 06:18:27 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 19 May 2021 10:46:55 +0300   Wed, 19 May 2021 06:18:27 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 19 May 2021 10:46:55 +0300   Wed, 19 May 2021 06:18:27 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 19 May 2021 10:46:55 +0300   Wed, 19 May 2021 06:25:00 +0300   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.123.132
  Hostname:    sno-0-0

Capacity:
  cpu:                32
  ephemeral-storage:  137876460Ki
  hugepages-1Gi:      10Gi
  hugepages-2Mi:      0
  memory:             32916072Ki
  pods:               250
Allocatable:
  cpu:                14
  ephemeral-storage:  127066945326
  hugepages-1Gi:      10Gi
  hugepages-2Mi:      0
  memory:             21303912Ki
  pods:               250

Comment 1 MinLi 2021-05-19 09:32:10 UTC
# oc get pod -A | grep -v Running | grep -v Completed
NAMESPACE                                          NAME                                                              READY   STATUS                       RESTARTS   AGE
openshift-apiserver-operator                       openshift-apiserver-operator-c4b66d4b8-k6v77                      0/1     CreateContainerError         4          5h9m
openshift-apiserver                                apiserver-59d6966f55-ln9pk                                        0/2     Init:CreateContainerError    0          4h52m
openshift-authentication-operator                  authentication-operator-79458867d7-j5cnt                          0/1     CreateContainerError         4          5h9m
openshift-authentication                           oauth-openshift-6497b479c7-hvfkt                                  0/1     CreateContainerError         0          4h47m
openshift-cloud-credential-operator                cloud-credential-operator-54b95bb96c-7vjfd                        0/2     CreateContainerError         0          5h9m
openshift-cluster-machine-approver                 machine-approver-797899769d-48c6l                                 1/2     CreateContainerError         1          5h9m
openshift-cluster-node-tuning-operator             cluster-node-tuning-operator-968d8f695-cl5k9                      0/1     CreateContainerError         0          5h4m
openshift-cluster-samples-operator                 cluster-samples-operator-7c4f8d65f9-r4nnd                         0/2     CreateContainerError         0          5h4m
openshift-cluster-storage-operator                 cluster-storage-operator-9bb96976c-nrkz8                          0/1     CreateContainerError         3          5h4m
openshift-cluster-storage-operator                 csi-snapshot-controller-65b6b6f94d-b4z5w                          0/1     CreateContainerError         5          5h4m
openshift-cluster-storage-operator                 csi-snapshot-controller-operator-dd84677f4-vwnjp                  0/1     CreateContainerError         3          5h4m
openshift-cluster-storage-operator                 csi-snapshot-webhook-d5958544f-kjr55                              0/1     CreateContainerError         0          5h4m
openshift-config-operator                          openshift-config-operator-6ddfccb5b7-84l9n                        0/1     CreateContainerError         6          5h9m
openshift-console-operator                         console-operator-656959b66-5q2gw                                  0/1     CreateContainerError         3          4h59m
openshift-console                                  console-6c5947c444-f6xvh                                          0/1     CreateContainerError         0          4h52m
openshift-console                                  downloads-894b6fd6d-472dj                                         0/1     CreateContainerError         0          4h59m
openshift-controller-manager-operator              openshift-controller-manager-operator-77f98c55f5-5dl69            0/1     CreateContainerError         4          5h9m
openshift-controller-manager                       controller-manager-pr27x                                          0/1     CreateContainerError         1          4h51m
openshift-dns-operator                             dns-operator-6c5b489f4b-gpnzz                                     0/2     CreateContainerError         0          5h9m
openshift-dns                                      dns-default-lr7kt                                                 0/2     CreateContainerError         0          5h5m
openshift-etcd-operator                            etcd-operator-6bcd8b5669-5f875                                    0/1     CreateContainerError         4          5h9m
openshift-image-registry                           cluster-image-registry-operator-845bd756b6-p26f5                  0/1     CreateContainerError         3          5h4m
openshift-image-registry                           image-registry-f796bb59c-kzrjn                                    0/1     CreateContainerError         0          4h52m
openshift-ingress-canary                           ingress-canary-r6hnc                                              0/1     CreateContainerError         0          4h59m
openshift-ingress-operator                         ingress-operator-78b8fdb7cf-mrbh2                                 0/2     CreateContainerError         7          5h4m
openshift-ingress                                  router-default-77c7f6699c-jnbpt                                   0/1     CreateContainerConfigError   1          4h59m
openshift-insights                                 insights-operator-6b46f5bd76-8gnlt                                0/1     CreateContainerError         0          5h4m
openshift-kube-apiserver-operator                  kube-apiserver-operator-75df466f75-7wkwj                          0/1     CreateContainerError         3          5h4m
openshift-kube-controller-manager-operator         kube-controller-manager-operator-55bf67d689-x9rgb                 0/1     CreateContainerError         4          5h9m
openshift-kube-scheduler-operator                  openshift-kube-scheduler-operator-84b6488c49-kdq29                0/1     CreateContainerError         4          5h9m
openshift-kube-storage-version-migrator-operator   kube-storage-version-migrator-operator-6b565f5845-6zkr9           0/1     CreateContainerError         4          5h9m
openshift-kube-storage-version-migrator            migrator-b5574d49c-8j2ql                                          0/1     CreateContainerError         0          5h6m
openshift-machine-api                              cluster-autoscaler-operator-5f4b4f8cdb-x7nr7                      0/2     CreateContainerError         0          5h4m
openshift-machine-api                              cluster-baremetal-operator-5c94899f6c-lcmnh                       0/2     CreateContainerError         0          5h4m
openshift-machine-api                              machine-api-operator-7849998dd5-lpq7j                             0/2     CreateContainerError         3          5h4m
openshift-machine-config-operator                  machine-config-controller-84974d8779-5bgq8                        0/1     CreateContainerError         3          5h4m
openshift-machine-config-operator                  machine-config-operator-6f4d57f75f-66b9l                          0/1     CreateContainerError         3          5h4m
openshift-marketplace                              certified-operators-chcjq                                         0/1     CreateContainerError         0          5h4m
openshift-marketplace                              community-operators-8s75q                                         0/1     CreateContainerError         0          130m
openshift-marketplace                              marketplace-operator-6cb74c86fd-cmbms                             0/1     CreateContainerError         0          5h9m
openshift-marketplace                              redhat-marketplace-sdxmp                                          0/1     CreateContainerError         0          5h4m
openshift-marketplace                              redhat-operators-w4z5r                                            0/1     ContainerCreating            0          66m
openshift-marketplace                              redhat-operators-z98tp                                            0/1     CreateContainerError         0          4h6m
openshift-monitoring                               alertmanager-main-0                                               0/5     CreateContainerError         0          4h54m
openshift-monitoring                               cluster-monitoring-operator-6b87ccc7bb-xg4s5                      0/2     CreateContainerError         3          5h9m
openshift-monitoring                               grafana-86dd7559df-blc6s                                          0/2     CreateContainerError         0          4h54m
openshift-monitoring                               kube-state-metrics-69bf796889-bdf97                               0/3     CreateContainerError         0          5h6m
openshift-monitoring                               node-exporter-nlj4n                                               1/2     CreateContainerError         1          5h6m
openshift-monitoring                               openshift-state-metrics-77c86d55b9-5zsj4                          0/3     CreateContainerError         0          5h6m
openshift-monitoring                               prometheus-adapter-946bbf6c6-zztk6                                0/1     CreateContainerError         0          5h
openshift-monitoring                               prometheus-k8s-0                                                  0/7     CreateContainerError         1          4h54m
openshift-monitoring                               prometheus-operator-59b4957975-q2d2b                              0/2     CreateContainerError         0          4h55m
openshift-monitoring                               telemeter-client-64d4467c98-hqb67                                 0/3     CreateContainerError         0          5h6m
openshift-monitoring                               thanos-querier-5dd4b66587-jmbz8                                   0/5     CreateContainerError         0          4h54m
openshift-multus                                   multus-7drwt                                                      0/1     Init:CreateContainerError    3          5h8m
openshift-multus                                   multus-admission-controller-wbhkg                                 0/2     CreateContainerError         0          5h7m
openshift-multus                                   network-metrics-daemon-55xnx                                      0/2     CreateContainerError         0          5h8m
openshift-network-diagnostics                      network-check-source-7d77bd595b-n8s2d                             0/1     CreateContainerError         0          5h8m
openshift-network-diagnostics                      network-check-target-9pxnt                                        0/1     CreateContainerError         0          5h8m
openshift-oauth-apiserver                          apiserver-66dd9ff6d-wjsph                                         0/1     Init:CreateContainerError    0          5h6m
openshift-operator-lifecycle-manager               catalog-operator-5c8dc876fc-v9wpk                                 0/1     CreateContainerError         0          5h4m
openshift-operator-lifecycle-manager               olm-operator-6487f89f75-njq75                                     0/1     CreateContainerError         0          5h9m
openshift-operator-lifecycle-manager               packageserver-5f4bbd9748-8jg8v                                    0/1     CreateContainerError         0          5h4m
openshift-operator-lifecycle-manager               packageserver-5f4bbd9748-bxzfg                                    0/1     CreateContainerError         0          5h4m
openshift-performance-addon-operator               performance-operator-d74df7b97-8sjmk                              0/1     CreateContainerError         0          116m
openshift-service-ca-operator                      service-ca-operator-7f78466ccb-lj26n                              0/1     CreateContainerError         4          5h9m
openshift-service-ca                               service-ca-7fb77576f-4wfvp                                        0/1     CreateContainerError         3          5h6m

# oc get pod performance-operator-d74df7b97-8sjmk -o yaml -n   openshift-performance-addon-operator 
apiVersion: v1
kind: Pod
metadata:
  annotations:
    alm-examples: |-
      [
        {
          "apiVersion": "performance.openshift.io/v1",
          "kind": "PerformanceProfile",
          "metadata": {
            "name": "example-performanceprofile"
          },
          "spec": {
            "additionalKernelArgs": [
              "nmi_watchdog=0",
              "audit=0",
              "mce=off",
              "processor.max_cstate=1",
              "idle=poll",
              "intel_idle.max_cstate=0"
            ],
            "cpu": {
              "isolated": "2-3",
              "reserved": "0-1"
            },
            "hugepages": {
              "defaultHugepagesSize": "1G",
              "pages": [
                {
                  "count": 2,
                  "node": 0,
                  "size": "1G"
                }
              ]
            },
            "nodeSelector": {
              "node-role.kubernetes.io/performance": ""
            },
            "realTimeKernel": {
              "enabled": true
            }
          }
        },
        {
          "apiVersion": "performance.openshift.io/v2",
          "kind": "PerformanceProfile",
          "metadata": {
            "name": "example-performanceprofile"
          },
          "spec": {
            "additionalKernelArgs": [
              "nmi_watchdog=0",
              "audit=0",
              "mce=off",
              "processor.max_cstate=1",
              "idle=poll",
              "intel_idle.max_cstate=0"
            ],
            "cpu": {
              "isolated": "2-3",
              "reserved": "0-1"
            },
            "hugepages": {
              "defaultHugepagesSize": "1G",
              "pages": [
                {
                  "count": 2,
                  "node": 0,
                  "size": "1G"
                }
              ]
            },
            "nodeSelector": {
              "node-role.kubernetes.io/performance": ""
            },
            "realTimeKernel": {
              "enabled": true
            }
          }
        }
      ]
    capabilities: Basic Install
    categories: OpenShift Optional
    certified: "false"
    containerImage: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:08867d1ddc1bafc56a56c0a5f211c3000ba12c936772931b0728cc35409ddd94
    description: Operator to optimize OpenShift clusters for applications sensitive
      to CPU and network latency.
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.128.0.4"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "",
          "interface": "eth0",
          "ips": [
              "10.128.0.4"
          ],
          "default": true,
          "dns": {}
      }]
    olm.operatorGroup: openshift-performance-addon-operator
    olm.operatorNamespace: openshift-performance-addon-operator
    olm.skipRange: '>=4.6.0 <4.7.3'
    olm.targetNamespaces: ""
    olmcahash: 2d4f1f5ab3354c79ac434d17a0ffbf7deb9cd3b3757349d18946d76a3a90f233
    operatorframework.io/properties: '{"properties":[{"type":"olm.gvk","value":{"group":"performance.openshift.io","kind":"PerformanceProfile","version":"v1"}},{"type":"olm.gvk","value":{"group":"performance.openshift.io","kind":"PerformanceProfile","version":"v1alpha1"}},{"type":"olm.gvk","value":{"group":"performance.openshift.io","kind":"PerformanceProfile","version":"v2"}},{"type":"olm.package","value":{"packageName":"performance-addon-operator","version":"4.7.3"}}]}'
    operators.operatorframework.io/builder: operator-sdk-v1.0.0
    operators.operatorframework.io/project_layout: go.kubebuilder.io/v2
    repository: https://github.com/openshift-kni/performance-addon-operators
    support: Red Hat
  creationTimestamp: "2021-05-19T06:31:36Z"
  generateName: performance-operator-d74df7b97-
  labels:
    name: performance-operator
    pod-template-hash: d74df7b97
  name: performance-operator-d74df7b97-8sjmk
  namespace: openshift-performance-addon-operator
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: performance-operator-d74df7b97
    uid: cc929f92-9292-4412-b09b-977de33ae1c1
  resourceVersion: "92163"
  uid: 4e864ea2-72dc-4ca2-abe9-5f7094c032bf
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-role.kubernetes.io/master
            operator: Exists
  containers:
  - command:
    - performance-operator
    env:
    - name: WATCH_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations['olm.targetNamespaces']
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: OPERATOR_NAME
      value: performance-operator
    - name: OPERATOR_CONDITION_NAME
      value: performance-addon-operator.v4.7.3
    image: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:08867d1ddc1bafc56a56c0a5f211c3000ba12c936772931b0728cc35409ddd94
    imagePullPolicy: Always
    name: performance-operator
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /apiserver.local.config/certificates
      name: apiservice-cert
    - mountPath: /tmp/k8s-webhook-server/serving-certs
      name: webhook-cert
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: performance-operator-token-ggl2t
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: performance-operator-dockercfg-2lgmm
  nodeName: sno-0-0
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: performance-operator
  serviceAccountName: performance-operator
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: apiservice-cert
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: apiserver.crt
      - key: tls.key
        path: apiserver.key
      secretName: performance-operator-service-cert
  - name: webhook-cert
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      secretName: performance-operator-service-cert
  - name: performance-operator-token-ggl2t
    secret:
      defaultMode: 420
      secretName: performance-operator-token-ggl2t
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-05-19T06:31:36Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-05-19T07:26:38Z"
    message: 'containers with unready status: [performance-operator]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-05-19T07:26:38Z"
    message: 'containers with unready status: [performance-operator]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-05-19T06:31:36Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:08867d1ddc1bafc56a56c0a5f211c3000ba12c936772931b0728cc35409ddd94
    imageID: ""
    lastState: {}
    name: performance-operator
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: 'failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private:
          connect: resource temporarily unavailable'
        reason: CreateContainerError
  hostIP: 192.168.123.132
  phase: Pending
  podIP: 10.128.0.4
  podIPs:
  - ip: 10.128.0.4
  qosClass: BestEffort
  startTime: "2021-05-19T06:31:36Z"

# oc get pod cluster-monitoring-operator-6b87ccc7bb-xg4s5 -o yaml -n openshift-monitoring
  containerStatuses:
  - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b71921d67098b1d618e6abc7d9343c9ef74045782fca4cc4c8122cc0654b9d94
    imageID: ""
    lastState: {}
    name: cluster-monitoring-operator
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: 'failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private:
          connect: resource temporarily unavailable'
        reason: CreateContainerError
  - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4e9ead3ea46f1a71ad774ade46b8853224f0368056f4d5f8b6622927a9b71a8e
    imageID: ""
    lastState:
      terminated:
        containerID: cri-o://6d215c58cca264ee5a60347ccdc79ae2f16ff48392164fc74f7c809ae685833f
        exitCode: 255
        finishedAt: "2021-05-19T03:21:08Z"
        message: "I0519 03:21:08.619440       1 main.go:178] Valid token audiences:
          \nI0519 03:21:08.619570       1 main.go:271] Reading certificate files\nF0519
          03:21:08.619592       1 main.go:275] Failed to initialize certificate reloader:
          error loading certificates: error loading certificate: open /etc/tls/private/tls.crt:
          no such file or directory\ngoroutine 1 [running]:\nk8s.io/klog/v2.stacks(0xc000010001,
          0xc000700000, 0xc6, 0x1c8)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:996
          +0xb9\nk8s.io/klog/v2.(*loggingT).output(0x2292280, 0xc000000003, 0x0, 0x0,
          0xc0003d6690, 0x1bf960e, 0x7, 0x113, 0x0)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:945
          +0x191\nk8s.io/klog/v2.(*loggingT).printf(0x2292280, 0x3, 0x0, 0x0, 0x17681a5,
          0x2d, 0xc000515c78, 0x1, 0x1)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:733
          +0x17a\nk8s.io/klog/v2.Fatalf(...)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1463\nmain.main()\n\t/go/src/github.com/brancz/kube-rbac-proxy/main.go:275
          +0x1e18\n\ngoroutine 6 [chan receive]:\nk8s.io/klog/v2.(*loggingT).flushDaemon(0x2292280)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1131
          +0x8b\ncreated by k8s.io/klog/v2.init.0\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:416
          +0xd8\n"
        reason: Error
        startedAt: "2021-05-19T03:21:08Z"
    name: kube-rbac-proxy
    ready: false
    restartCount: 3
    started: false
    state:
      waiting:
        message: 'failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private:
          connect: resource temporarily unavailable'
        reason: CreateContainerError
  hostIP: 192.168.123.132
  phase: Pending
  podIP: 10.128.0.53
  podIPs:
  - ip: 10.128.0.53
  qosClass: Burstable
  startTime: "2021-05-19T03:20:38Z"

Comment 2 Peter Hunt 2021-05-19 15:11:55 UTC
hm we should retry that

pr attached

Comment 3 Peter Hunt 2021-05-19 15:13:19 UTC
though, it's possible dbus is just hosed and retrying won't actually help. I may make that PR exponentially back-off, but there will still be container creation errors unless dbus can keep up

Comment 5 MinLi 2021-06-03 04:17:17 UTC
Hi, Peter

This bug happened in SNO-Baremetal cluster. 

This is my deploy job: https://auto-jenkins-csb-kniqe.apps.ocp4.prod.psi.redhat.com/job/ocp-sno-virt-e2e/134/

You need a virtual server and fill it in parameter HOST. You can rebuild my job if you don't have one.
But the job always use the latest nightly build, so it doesn't necessarily reproduce this issue of older build.

Comment 6 MinLi 2021-06-03 04:18:34 UTC
And I met with the issue in bz https://bugzilla.redhat.com/show_bug.cgi?id=1965983 with latest nightly build. FYI.

Comment 7 Peter Hunt 2021-06-03 16:01:58 UTC
*** Bug 1965983 has been marked as a duplicate of this bug. ***

Comment 8 Peter Hunt 2021-06-07 14:05:43 UTC
I am still looking for a way to have access to the setup so I can see if my changes help.

For context on what's on my mind: It seems this issue is happening because dbus does not have the time to handle all of the active connections from crio and kubelet. I suspect this is because the performance profile is not giving enough cpus to reservedCPUs (this is a suspicion). I am not certain my changes will be able to mitigate this issue: even if cri-o retries the dbus connection, if dbus doesn't have the time to handle these things, nothing will be fixed. So I'd like to have an installation I can fuss around with to be able to test against.

Comment 9 Peter Hunt 2021-06-07 17:00:23 UTC
attached is another PR that reuses a single dbus connection, rather than creating a new one each time we create a container. Hopefully this helps as well.

Comment 10 Peter Hunt 2021-06-08 02:24:01 UTC
Min, can you try this reproducer with more reserved CPUs? I am wondering if that is a way to mitigate this problem

Comment 11 Peter Hunt 2021-06-09 13:24:28 UTC
attached is the 1.21 variant of the fix, which is merged

Comment 14 MinLi 2021-06-15 10:39:56 UTC
tested on 4.8.0-0.nightly-2021-06-14-145150, the mcp master rolled out successfully and the sno node became Ready. 
Yet when I create a pod, it takes 8 minutes to start the container, which is abnormal. It seems the node is busy with some system load, and can't respond to customer workload.

Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       22m   default-scheduler  Successfully assigned default/hello-pod-1 to sno-0-0
  Normal  AddedInterface  15m   multus             Add eth0 [10.128.0.70/23] from openshift-sdn
  Normal  Pulling         15m   kubelet            Pulling image "docker.io/ocpqe/hello-pod:latest"
  Normal  Pulled          15m   kubelet            Successfully pulled image "docker.io/ocpqe/hello-pod:latest" in 20.058205479s
  Normal  Created         14m   kubelet            Created container hello-pod
  Normal  Started         14m   kubelet            Started container hello-pod

Also I saw some warning on long period Pending pod(thought it became running after a great while), for example:
# oc get pod cluster-baremetal-operator-8674588c96-dzbpv -o yaml -n openshift-machine-api
apiVersion: v1
kind: Pod
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.128.0.59"
          ],
          "default": true,
          "dns": {}
      }]
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.128.0.59"
          ],
          "default": true,
          "dns": {}
      }]
    openshift.io/scc: anyuid
    workload.openshift.io/warning: the node "sno-0-0" does not have resource "management.workload.openshift.io/cores" // warning

Comment 15 MinLi 2021-06-15 10:45:47 UTC
And I also saw some errors at the pod creation time:
Jun 15 09:23:47 sno-0-0 hyperkube[2314]: E0615 09:23:47.481030    2314 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode30809a2_63ee_48d2_a782_b076bbd66fc0.slice/crio-e78157c009fada3a162896463028515d8b38463d3adc001351ee11a3643288cc.scope: Error finding container e78157c009fada3a162896463028515d8b38463d3adc001351ee11a3643288cc: Status 404 returned error &{%!s(*http.body=&{0xc0074fd9b0 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77aa80) %!s(func() error=0x77aa00)}
Jun 15 09:23:47 sno-0-0 hyperkube[2314]: E0615 09:23:47.871493    2314 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod880ad41e_85d1_42b3_88cb_4016cb531521.slice/crio-0553fc81a326a452c24f6db062fb6089ec62eac15639233f0a92b406a7753822.scope: Error finding container 0553fc81a326a452c24f6db062fb6089ec62eac15639233f0a92b406a7753822: Status 404 returned error &{%!s(*http.body=&{0xc007e024e0 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77aa80) %!s(func() error=0x77aa00)}

Jun 15 09:23:48 sno-0-0 hyperkube[2314]: W0615 09:23:48.319432    2314 manager.go:696] Error getting data for container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode30809a2_63ee_48d2_a782_b076bbd66fc0.slice/crio-e78157c009fada3a162896463028515d8b38463d3adc001351ee11a3643288cc.scope because of race condition

Jun 15 09:23:56 sno-0-0 hyperkube[2314]: E0615 09:23:56.718594    2314 cpu_manager.go:435] "ReconcileState: failed to update container" err="rpc error: code = Unknown desc = updating resources for container \"6e113d9bf33273c205bd6d78bb79ceedc2f9e1f5f65a2b378ff70b3bdd8499e9\" failed: time=\"2021-06-15T09:23:56Z\" level=error msg=\"container not running\"\n  (exit status 1)" pod="openshift-ingress/router-default-6fd885f48c-cn4ws" containerName="router" containerID="6e113d9bf33273c205bd6d78bb79ceedc2f9e1f5f65a2b378ff70b3bdd8499e9" cpuSet="0-31"

I will upload the must-gather

Comment 16 MinLi 2021-06-15 10:52:47 UTC
Created attachment 1791219 [details]
must-gather

Comment 18 Peter Hunt 2021-06-15 14:02:12 UTC
ah I see why this was reopened, this looks like https://bugzilla.redhat.com/show_bug.cgi?id=1965983, which was closed as a dup of this. That's my mistake.

I think we should leave this one closed, assuming it's verified, as we did fix the one problem, and reopen the other one.

Also, I see you've mentioned the pod does start running eventually

Comment 21 Peter Hunt 2021-06-15 15:32:16 UTC
... I see I posted an incomplete sentence:

Also, I see you've mentioned the pod does start running eventually, so I am not sure the newly reopened one will be a blocker

Comment 23 MinLi 2021-06-16 03:02:51 UTC
verified with 4.8.0-0.nightly-2021-06-14-145150

Comment 25 errata-xmlrpc 2021-07-27 23:09:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.