Bug 1962074
| Summary: | SNO:the pod get stuck in CreateContainerError and prompt "failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable" after adding a performanceprofile | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | MinLi <minmli> | ||||
| Component: | Node | Assignee: | Peter Hunt <pehunt> | ||||
| Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | medium | CC: | aos-bugs, dblack, mkarg, murali, nagrawal, pehunt | ||||
| Version: | 4.8 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.8.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-07-27 23:09:09 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
# oc get pod -A | grep -v Running | grep -v Completed
NAMESPACE NAME READY STATUS RESTARTS AGE
openshift-apiserver-operator openshift-apiserver-operator-c4b66d4b8-k6v77 0/1 CreateContainerError 4 5h9m
openshift-apiserver apiserver-59d6966f55-ln9pk 0/2 Init:CreateContainerError 0 4h52m
openshift-authentication-operator authentication-operator-79458867d7-j5cnt 0/1 CreateContainerError 4 5h9m
openshift-authentication oauth-openshift-6497b479c7-hvfkt 0/1 CreateContainerError 0 4h47m
openshift-cloud-credential-operator cloud-credential-operator-54b95bb96c-7vjfd 0/2 CreateContainerError 0 5h9m
openshift-cluster-machine-approver machine-approver-797899769d-48c6l 1/2 CreateContainerError 1 5h9m
openshift-cluster-node-tuning-operator cluster-node-tuning-operator-968d8f695-cl5k9 0/1 CreateContainerError 0 5h4m
openshift-cluster-samples-operator cluster-samples-operator-7c4f8d65f9-r4nnd 0/2 CreateContainerError 0 5h4m
openshift-cluster-storage-operator cluster-storage-operator-9bb96976c-nrkz8 0/1 CreateContainerError 3 5h4m
openshift-cluster-storage-operator csi-snapshot-controller-65b6b6f94d-b4z5w 0/1 CreateContainerError 5 5h4m
openshift-cluster-storage-operator csi-snapshot-controller-operator-dd84677f4-vwnjp 0/1 CreateContainerError 3 5h4m
openshift-cluster-storage-operator csi-snapshot-webhook-d5958544f-kjr55 0/1 CreateContainerError 0 5h4m
openshift-config-operator openshift-config-operator-6ddfccb5b7-84l9n 0/1 CreateContainerError 6 5h9m
openshift-console-operator console-operator-656959b66-5q2gw 0/1 CreateContainerError 3 4h59m
openshift-console console-6c5947c444-f6xvh 0/1 CreateContainerError 0 4h52m
openshift-console downloads-894b6fd6d-472dj 0/1 CreateContainerError 0 4h59m
openshift-controller-manager-operator openshift-controller-manager-operator-77f98c55f5-5dl69 0/1 CreateContainerError 4 5h9m
openshift-controller-manager controller-manager-pr27x 0/1 CreateContainerError 1 4h51m
openshift-dns-operator dns-operator-6c5b489f4b-gpnzz 0/2 CreateContainerError 0 5h9m
openshift-dns dns-default-lr7kt 0/2 CreateContainerError 0 5h5m
openshift-etcd-operator etcd-operator-6bcd8b5669-5f875 0/1 CreateContainerError 4 5h9m
openshift-image-registry cluster-image-registry-operator-845bd756b6-p26f5 0/1 CreateContainerError 3 5h4m
openshift-image-registry image-registry-f796bb59c-kzrjn 0/1 CreateContainerError 0 4h52m
openshift-ingress-canary ingress-canary-r6hnc 0/1 CreateContainerError 0 4h59m
openshift-ingress-operator ingress-operator-78b8fdb7cf-mrbh2 0/2 CreateContainerError 7 5h4m
openshift-ingress router-default-77c7f6699c-jnbpt 0/1 CreateContainerConfigError 1 4h59m
openshift-insights insights-operator-6b46f5bd76-8gnlt 0/1 CreateContainerError 0 5h4m
openshift-kube-apiserver-operator kube-apiserver-operator-75df466f75-7wkwj 0/1 CreateContainerError 3 5h4m
openshift-kube-controller-manager-operator kube-controller-manager-operator-55bf67d689-x9rgb 0/1 CreateContainerError 4 5h9m
openshift-kube-scheduler-operator openshift-kube-scheduler-operator-84b6488c49-kdq29 0/1 CreateContainerError 4 5h9m
openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-6b565f5845-6zkr9 0/1 CreateContainerError 4 5h9m
openshift-kube-storage-version-migrator migrator-b5574d49c-8j2ql 0/1 CreateContainerError 0 5h6m
openshift-machine-api cluster-autoscaler-operator-5f4b4f8cdb-x7nr7 0/2 CreateContainerError 0 5h4m
openshift-machine-api cluster-baremetal-operator-5c94899f6c-lcmnh 0/2 CreateContainerError 0 5h4m
openshift-machine-api machine-api-operator-7849998dd5-lpq7j 0/2 CreateContainerError 3 5h4m
openshift-machine-config-operator machine-config-controller-84974d8779-5bgq8 0/1 CreateContainerError 3 5h4m
openshift-machine-config-operator machine-config-operator-6f4d57f75f-66b9l 0/1 CreateContainerError 3 5h4m
openshift-marketplace certified-operators-chcjq 0/1 CreateContainerError 0 5h4m
openshift-marketplace community-operators-8s75q 0/1 CreateContainerError 0 130m
openshift-marketplace marketplace-operator-6cb74c86fd-cmbms 0/1 CreateContainerError 0 5h9m
openshift-marketplace redhat-marketplace-sdxmp 0/1 CreateContainerError 0 5h4m
openshift-marketplace redhat-operators-w4z5r 0/1 ContainerCreating 0 66m
openshift-marketplace redhat-operators-z98tp 0/1 CreateContainerError 0 4h6m
openshift-monitoring alertmanager-main-0 0/5 CreateContainerError 0 4h54m
openshift-monitoring cluster-monitoring-operator-6b87ccc7bb-xg4s5 0/2 CreateContainerError 3 5h9m
openshift-monitoring grafana-86dd7559df-blc6s 0/2 CreateContainerError 0 4h54m
openshift-monitoring kube-state-metrics-69bf796889-bdf97 0/3 CreateContainerError 0 5h6m
openshift-monitoring node-exporter-nlj4n 1/2 CreateContainerError 1 5h6m
openshift-monitoring openshift-state-metrics-77c86d55b9-5zsj4 0/3 CreateContainerError 0 5h6m
openshift-monitoring prometheus-adapter-946bbf6c6-zztk6 0/1 CreateContainerError 0 5h
openshift-monitoring prometheus-k8s-0 0/7 CreateContainerError 1 4h54m
openshift-monitoring prometheus-operator-59b4957975-q2d2b 0/2 CreateContainerError 0 4h55m
openshift-monitoring telemeter-client-64d4467c98-hqb67 0/3 CreateContainerError 0 5h6m
openshift-monitoring thanos-querier-5dd4b66587-jmbz8 0/5 CreateContainerError 0 4h54m
openshift-multus multus-7drwt 0/1 Init:CreateContainerError 3 5h8m
openshift-multus multus-admission-controller-wbhkg 0/2 CreateContainerError 0 5h7m
openshift-multus network-metrics-daemon-55xnx 0/2 CreateContainerError 0 5h8m
openshift-network-diagnostics network-check-source-7d77bd595b-n8s2d 0/1 CreateContainerError 0 5h8m
openshift-network-diagnostics network-check-target-9pxnt 0/1 CreateContainerError 0 5h8m
openshift-oauth-apiserver apiserver-66dd9ff6d-wjsph 0/1 Init:CreateContainerError 0 5h6m
openshift-operator-lifecycle-manager catalog-operator-5c8dc876fc-v9wpk 0/1 CreateContainerError 0 5h4m
openshift-operator-lifecycle-manager olm-operator-6487f89f75-njq75 0/1 CreateContainerError 0 5h9m
openshift-operator-lifecycle-manager packageserver-5f4bbd9748-8jg8v 0/1 CreateContainerError 0 5h4m
openshift-operator-lifecycle-manager packageserver-5f4bbd9748-bxzfg 0/1 CreateContainerError 0 5h4m
openshift-performance-addon-operator performance-operator-d74df7b97-8sjmk 0/1 CreateContainerError 0 116m
openshift-service-ca-operator service-ca-operator-7f78466ccb-lj26n 0/1 CreateContainerError 4 5h9m
openshift-service-ca service-ca-7fb77576f-4wfvp 0/1 CreateContainerError 3 5h6m
# oc get pod performance-operator-d74df7b97-8sjmk -o yaml -n openshift-performance-addon-operator
apiVersion: v1
kind: Pod
metadata:
annotations:
alm-examples: |-
[
{
"apiVersion": "performance.openshift.io/v1",
"kind": "PerformanceProfile",
"metadata": {
"name": "example-performanceprofile"
},
"spec": {
"additionalKernelArgs": [
"nmi_watchdog=0",
"audit=0",
"mce=off",
"processor.max_cstate=1",
"idle=poll",
"intel_idle.max_cstate=0"
],
"cpu": {
"isolated": "2-3",
"reserved": "0-1"
},
"hugepages": {
"defaultHugepagesSize": "1G",
"pages": [
{
"count": 2,
"node": 0,
"size": "1G"
}
]
},
"nodeSelector": {
"node-role.kubernetes.io/performance": ""
},
"realTimeKernel": {
"enabled": true
}
}
},
{
"apiVersion": "performance.openshift.io/v2",
"kind": "PerformanceProfile",
"metadata": {
"name": "example-performanceprofile"
},
"spec": {
"additionalKernelArgs": [
"nmi_watchdog=0",
"audit=0",
"mce=off",
"processor.max_cstate=1",
"idle=poll",
"intel_idle.max_cstate=0"
],
"cpu": {
"isolated": "2-3",
"reserved": "0-1"
},
"hugepages": {
"defaultHugepagesSize": "1G",
"pages": [
{
"count": 2,
"node": 0,
"size": "1G"
}
]
},
"nodeSelector": {
"node-role.kubernetes.io/performance": ""
},
"realTimeKernel": {
"enabled": true
}
}
}
]
capabilities: Basic Install
categories: OpenShift Optional
certified: "false"
containerImage: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:08867d1ddc1bafc56a56c0a5f211c3000ba12c936772931b0728cc35409ddd94
description: Operator to optimize OpenShift clusters for applications sensitive
to CPU and network latency.
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "",
"interface": "eth0",
"ips": [
"10.128.0.4"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "",
"interface": "eth0",
"ips": [
"10.128.0.4"
],
"default": true,
"dns": {}
}]
olm.operatorGroup: openshift-performance-addon-operator
olm.operatorNamespace: openshift-performance-addon-operator
olm.skipRange: '>=4.6.0 <4.7.3'
olm.targetNamespaces: ""
olmcahash: 2d4f1f5ab3354c79ac434d17a0ffbf7deb9cd3b3757349d18946d76a3a90f233
operatorframework.io/properties: '{"properties":[{"type":"olm.gvk","value":{"group":"performance.openshift.io","kind":"PerformanceProfile","version":"v1"}},{"type":"olm.gvk","value":{"group":"performance.openshift.io","kind":"PerformanceProfile","version":"v1alpha1"}},{"type":"olm.gvk","value":{"group":"performance.openshift.io","kind":"PerformanceProfile","version":"v2"}},{"type":"olm.package","value":{"packageName":"performance-addon-operator","version":"4.7.3"}}]}'
operators.operatorframework.io/builder: operator-sdk-v1.0.0
operators.operatorframework.io/project_layout: go.kubebuilder.io/v2
repository: https://github.com/openshift-kni/performance-addon-operators
support: Red Hat
creationTimestamp: "2021-05-19T06:31:36Z"
generateName: performance-operator-d74df7b97-
labels:
name: performance-operator
pod-template-hash: d74df7b97
name: performance-operator-d74df7b97-8sjmk
namespace: openshift-performance-addon-operator
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: performance-operator-d74df7b97
uid: cc929f92-9292-4412-b09b-977de33ae1c1
resourceVersion: "92163"
uid: 4e864ea2-72dc-4ca2-abe9-5f7094c032bf
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: Exists
containers:
- command:
- performance-operator
env:
- name: WATCH_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.annotations['olm.targetNamespaces']
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: OPERATOR_NAME
value: performance-operator
- name: OPERATOR_CONDITION_NAME
value: performance-addon-operator.v4.7.3
image: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:08867d1ddc1bafc56a56c0a5f211c3000ba12c936772931b0728cc35409ddd94
imagePullPolicy: Always
name: performance-operator
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /apiserver.local.config/certificates
name: apiservice-cert
- mountPath: /tmp/k8s-webhook-server/serving-certs
name: webhook-cert
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: performance-operator-token-ggl2t
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
- name: performance-operator-dockercfg-2lgmm
nodeName: sno-0-0
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: performance-operator
serviceAccountName: performance-operator
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: apiservice-cert
secret:
defaultMode: 420
items:
- key: tls.crt
path: apiserver.crt
- key: tls.key
path: apiserver.key
secretName: performance-operator-service-cert
- name: webhook-cert
secret:
defaultMode: 420
items:
- key: tls.crt
path: tls.crt
- key: tls.key
path: tls.key
secretName: performance-operator-service-cert
- name: performance-operator-token-ggl2t
secret:
defaultMode: 420
secretName: performance-operator-token-ggl2t
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-05-19T06:31:36Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-05-19T07:26:38Z"
message: 'containers with unready status: [performance-operator]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-05-19T07:26:38Z"
message: 'containers with unready status: [performance-operator]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-05-19T06:31:36Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: registry.redhat.io/openshift4/performance-addon-rhel8-operator@sha256:08867d1ddc1bafc56a56c0a5f211c3000ba12c936772931b0728cc35409ddd94
imageID: ""
lastState: {}
name: performance-operator
ready: false
restartCount: 0
started: false
state:
waiting:
message: 'failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private:
connect: resource temporarily unavailable'
reason: CreateContainerError
hostIP: 192.168.123.132
phase: Pending
podIP: 10.128.0.4
podIPs:
- ip: 10.128.0.4
qosClass: BestEffort
startTime: "2021-05-19T06:31:36Z"
# oc get pod cluster-monitoring-operator-6b87ccc7bb-xg4s5 -o yaml -n openshift-monitoring
containerStatuses:
- image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b71921d67098b1d618e6abc7d9343c9ef74045782fca4cc4c8122cc0654b9d94
imageID: ""
lastState: {}
name: cluster-monitoring-operator
ready: false
restartCount: 0
started: false
state:
waiting:
message: 'failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private:
connect: resource temporarily unavailable'
reason: CreateContainerError
- image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4e9ead3ea46f1a71ad774ade46b8853224f0368056f4d5f8b6622927a9b71a8e
imageID: ""
lastState:
terminated:
containerID: cri-o://6d215c58cca264ee5a60347ccdc79ae2f16ff48392164fc74f7c809ae685833f
exitCode: 255
finishedAt: "2021-05-19T03:21:08Z"
message: "I0519 03:21:08.619440 1 main.go:178] Valid token audiences:
\nI0519 03:21:08.619570 1 main.go:271] Reading certificate files\nF0519
03:21:08.619592 1 main.go:275] Failed to initialize certificate reloader:
error loading certificates: error loading certificate: open /etc/tls/private/tls.crt:
no such file or directory\ngoroutine 1 [running]:\nk8s.io/klog/v2.stacks(0xc000010001,
0xc000700000, 0xc6, 0x1c8)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:996
+0xb9\nk8s.io/klog/v2.(*loggingT).output(0x2292280, 0xc000000003, 0x0, 0x0,
0xc0003d6690, 0x1bf960e, 0x7, 0x113, 0x0)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:945
+0x191\nk8s.io/klog/v2.(*loggingT).printf(0x2292280, 0x3, 0x0, 0x0, 0x17681a5,
0x2d, 0xc000515c78, 0x1, 0x1)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:733
+0x17a\nk8s.io/klog/v2.Fatalf(...)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1463\nmain.main()\n\t/go/src/github.com/brancz/kube-rbac-proxy/main.go:275
+0x1e18\n\ngoroutine 6 [chan receive]:\nk8s.io/klog/v2.(*loggingT).flushDaemon(0x2292280)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1131
+0x8b\ncreated by k8s.io/klog/v2.init.0\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:416
+0xd8\n"
reason: Error
startedAt: "2021-05-19T03:21:08Z"
name: kube-rbac-proxy
ready: false
restartCount: 3
started: false
state:
waiting:
message: 'failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private:
connect: resource temporarily unavailable'
reason: CreateContainerError
hostIP: 192.168.123.132
phase: Pending
podIP: 10.128.0.53
podIPs:
- ip: 10.128.0.53
qosClass: Burstable
startTime: "2021-05-19T03:20:38Z"
hm we should retry that pr attached though, it's possible dbus is just hosed and retrying won't actually help. I may make that PR exponentially back-off, but there will still be container creation errors unless dbus can keep up Hi, Peter This bug happened in SNO-Baremetal cluster. This is my deploy job: https://auto-jenkins-csb-kniqe.apps.ocp4.prod.psi.redhat.com/job/ocp-sno-virt-e2e/134/ You need a virtual server and fill it in parameter HOST. You can rebuild my job if you don't have one. But the job always use the latest nightly build, so it doesn't necessarily reproduce this issue of older build. And I met with the issue in bz https://bugzilla.redhat.com/show_bug.cgi?id=1965983 with latest nightly build. FYI. *** Bug 1965983 has been marked as a duplicate of this bug. *** I am still looking for a way to have access to the setup so I can see if my changes help. For context on what's on my mind: It seems this issue is happening because dbus does not have the time to handle all of the active connections from crio and kubelet. I suspect this is because the performance profile is not giving enough cpus to reservedCPUs (this is a suspicion). I am not certain my changes will be able to mitigate this issue: even if cri-o retries the dbus connection, if dbus doesn't have the time to handle these things, nothing will be fixed. So I'd like to have an installation I can fuss around with to be able to test against. attached is another PR that reuses a single dbus connection, rather than creating a new one each time we create a container. Hopefully this helps as well. Min, can you try this reproducer with more reserved CPUs? I am wondering if that is a way to mitigate this problem attached is the 1.21 variant of the fix, which is merged tested on 4.8.0-0.nightly-2021-06-14-145150, the mcp master rolled out successfully and the sno node became Ready.
Yet when I create a pod, it takes 8 minutes to start the container, which is abnormal. It seems the node is busy with some system load, and can't respond to customer workload.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22m default-scheduler Successfully assigned default/hello-pod-1 to sno-0-0
Normal AddedInterface 15m multus Add eth0 [10.128.0.70/23] from openshift-sdn
Normal Pulling 15m kubelet Pulling image "docker.io/ocpqe/hello-pod:latest"
Normal Pulled 15m kubelet Successfully pulled image "docker.io/ocpqe/hello-pod:latest" in 20.058205479s
Normal Created 14m kubelet Created container hello-pod
Normal Started 14m kubelet Started container hello-pod
Also I saw some warning on long period Pending pod(thought it became running after a great while), for example:
# oc get pod cluster-baremetal-operator-8674588c96-dzbpv -o yaml -n openshift-machine-api
apiVersion: v1
kind: Pod
metadata:
annotations:
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
k8s.v1.cni.cncf.io/network-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.128.0.59"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.128.0.59"
],
"default": true,
"dns": {}
}]
openshift.io/scc: anyuid
workload.openshift.io/warning: the node "sno-0-0" does not have resource "management.workload.openshift.io/cores" // warning
And I also saw some errors at the pod creation time:
Jun 15 09:23:47 sno-0-0 hyperkube[2314]: E0615 09:23:47.481030 2314 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode30809a2_63ee_48d2_a782_b076bbd66fc0.slice/crio-e78157c009fada3a162896463028515d8b38463d3adc001351ee11a3643288cc.scope: Error finding container e78157c009fada3a162896463028515d8b38463d3adc001351ee11a3643288cc: Status 404 returned error &{%!s(*http.body=&{0xc0074fd9b0 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77aa80) %!s(func() error=0x77aa00)}
Jun 15 09:23:47 sno-0-0 hyperkube[2314]: E0615 09:23:47.871493 2314 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod880ad41e_85d1_42b3_88cb_4016cb531521.slice/crio-0553fc81a326a452c24f6db062fb6089ec62eac15639233f0a92b406a7753822.scope: Error finding container 0553fc81a326a452c24f6db062fb6089ec62eac15639233f0a92b406a7753822: Status 404 returned error &{%!s(*http.body=&{0xc007e024e0 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77aa80) %!s(func() error=0x77aa00)}
Jun 15 09:23:48 sno-0-0 hyperkube[2314]: W0615 09:23:48.319432 2314 manager.go:696] Error getting data for container /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode30809a2_63ee_48d2_a782_b076bbd66fc0.slice/crio-e78157c009fada3a162896463028515d8b38463d3adc001351ee11a3643288cc.scope because of race condition
Jun 15 09:23:56 sno-0-0 hyperkube[2314]: E0615 09:23:56.718594 2314 cpu_manager.go:435] "ReconcileState: failed to update container" err="rpc error: code = Unknown desc = updating resources for container \"6e113d9bf33273c205bd6d78bb79ceedc2f9e1f5f65a2b378ff70b3bdd8499e9\" failed: time=\"2021-06-15T09:23:56Z\" level=error msg=\"container not running\"\n (exit status 1)" pod="openshift-ingress/router-default-6fd885f48c-cn4ws" containerName="router" containerID="6e113d9bf33273c205bd6d78bb79ceedc2f9e1f5f65a2b378ff70b3bdd8499e9" cpuSet="0-31"
I will upload the must-gather
Created attachment 1791219 [details]
must-gather
ah I see why this was reopened, this looks like https://bugzilla.redhat.com/show_bug.cgi?id=1965983, which was closed as a dup of this. That's my mistake. I think we should leave this one closed, assuming it's verified, as we did fix the one problem, and reopen the other one. Also, I see you've mentioned the pod does start running eventually ... I see I posted an incomplete sentence: Also, I see you've mentioned the pod does start running eventually, so I am not sure the newly reopened one will be a blocker verified with 4.8.0-0.nightly-2021-06-14-145150 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
Description of problem: many pod get stuck in CreateContainerError status, and prompt "failed to add conmon to systemd sandbox cgroup: dial unix /run/systemd/private: connect: resource temporarily unavailable" after adding a performanceprofile cat performance_profile.yaml: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: perf-example spec: cpu: isolated: "16-29" reserved: "0-15,30,31" hugepages: defaultHugepagesSize: 1G pages: - count: 10 size: 1G node: 0 # for 3 node converged master/worker and SNO clusters we use the masters as a selector nodeSelector: node-role.kubernetes.io/master: "" machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" numa: topologyPolicy: "restricted" realTimeKernel: # For CU should be false enabled: true Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-05-18-205323 How reproducible: Steps to Reproduce: 1.Install an SNO cluster on BM, deploy the POA operator 2.check the POA operator status ## oc get all NAME READY STATUS RESTARTS AGE pod/performance-operator-d74df7b97-8sjmk 1/1 Running 0 4m45s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/performance-operator-service ClusterIP 172.30.151.212 <none> 443/TCP 4m45s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/performance-operator 1/1 1 1 4m45s NAME DESIRED CURRENT READY AGE replicaset.apps/performance-operator-d74df7b97 1 1 1 4m45s 3.create the above performanceprofile # oc create -f performance_profile.yaml 4.check the performanceprofile status ## oc get performanceprofile perf-example -o yaml apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: creationTimestamp: "2021-05-19T07:19:44Z" finalizers: - foreground-deletion generation: 1 name: perf-example resourceVersion: "85720" uid: 83b89352-da35-4593-9184-d553e933d7da spec: cpu: isolated: 16-29 reserved: 0-15,30,31 hugepages: defaultHugepagesSize: 1G pages: - count: 10 node: 0 size: 1G machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/master: "" nodeSelector: node-role.kubernetes.io/master: "" numa: topologyPolicy: restricted realTimeKernel: enabled: true status: conditions: - lastHeartbeatTime: "2021-05-19T07:19:45Z" lastTransitionTime: "2021-05-19T07:19:45Z" status: "True" type: Available - lastHeartbeatTime: "2021-05-19T07:19:45Z" lastTransitionTime: "2021-05-19T07:19:45Z" status: "True" type: Upgradeable - lastHeartbeatTime: "2021-05-19T07:19:45Z" lastTransitionTime: "2021-05-19T07:19:45Z" status: "False" type: Progressing - lastHeartbeatTime: "2021-05-19T07:19:45Z" lastTransitionTime: "2021-05-19T07:19:45Z" status: "False" type: Degraded runtimeClass: performance-perf-example tuned: openshift-cluster-node-tuning-operator/openshift-node-performance-perf-example 5.wait 20 minutes or more, check mcp and node. Actual results: 5.the mcp master kept UPDATING status all the time though the sno node is Ready. co network is within PROGRESSING status. Many pod get stuck in CreateContainerError. And meanwhile, any new created pod will be pending, such as debug-pod and must-gather pod.So I can't debug node to check kubelet/crio log or must-gather logs. Expected results: Additional info: # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-eda807b096850318a467710061d40fae False True False 1 0 0 0 5h44m worker rendered-worker-d89f4b2965a80d86c8aa31cd50817b95 True False False 0 0 0 0 5h44m # oc get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME sno-0-0 Ready master,worker 4h33m v1.21.0-rc.0+9d99e1c 192.168.123.132 <none> Red Hat Enterprise Linux CoreOS 48.84.202105180118-0 (Ootpa) 4.18.0-293.rt7.59.el8.x86_64 cri-o://1.21.0-93.rhaos4.8.git8f8bcd9.el8 # oc describe node sno-0-0 Name: sno-0-0 Roles: master,worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=sno-0-0 kubernetes.io/os=linux node-role.kubernetes.io/master= node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos ... CreationTimestamp: Wed, 19 May 2021 06:18:27 +0300 Taints: <none> Unschedulable: false Lease: HolderIdentity: sno-0-0 AcquireTime: <unset> RenewTime: Wed, 19 May 2021 10:51:05 +0300 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 19 May 2021 10:46:55 +0300 Wed, 19 May 2021 06:18:27 +0300 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 19 May 2021 10:46:55 +0300 Wed, 19 May 2021 06:18:27 +0300 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 19 May 2021 10:46:55 +0300 Wed, 19 May 2021 06:18:27 +0300 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 19 May 2021 10:46:55 +0300 Wed, 19 May 2021 06:25:00 +0300 KubeletReady kubelet is posting ready status Addresses: InternalIP: 192.168.123.132 Hostname: sno-0-0 Capacity: cpu: 32 ephemeral-storage: 137876460Ki hugepages-1Gi: 10Gi hugepages-2Mi: 0 memory: 32916072Ki pods: 250 Allocatable: cpu: 14 ephemeral-storage: 127066945326 hugepages-1Gi: 10Gi hugepages-2Mi: 0 memory: 21303912Ki pods: 250