Bug 1666198
Summary: | [CRIO] crioctl inconsistent number of pods | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Franck Grosjean <fgrosjea> |
Component: | Node | Assignee: | Mrunal Patel <mpatel> |
Status: | CLOSED ERRATA | QA Contact: | weiwei jiang <wjiang> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.9.0 | CC: | aos-bugs, bleanhar, dornelas, jokerman, lsm5, mmccomas, mpatel, nagrawal, sakulkar, schoudha, smoro, umohnani, vlaad, wjiang |
Target Milestone: | --- | ||
Target Release: | 3.9.z | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | cri-o-1.9.16-2.git858756d.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-07-05 06:58:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1186913 |
Description
Franck Grosjean
2019-01-15 07:57:42 UTC
Hi, I am working on a fix for the grpc message size issue. I think that may be causing cascading issues. Thanks! Hi, I see in the comments for that PR some mention of not backporting this fix to cri-i 1.9. Is it really wise? Any rational? Crio is sold as GA since OCP 3.9. Right now, we have several production clusters suffering from those issue. Regards. Samuel, I ported this to cri-o 1.9.15. Checked with [root@qe-wjiang39-crio-merrn-1 ~]# crictl version Version: 0.1.0 RuntimeName: cri-o RuntimeVersion: 1.9.15 RuntimeApiVersion: v1alpha1 [root@qe-wjiang39-crio-merrn-1 ~]# oc version oc v3.9.73 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-wjiang39-crio-merrn-1:8443 openshift v3.9.73 kubernetes v1.9.1+a0ce1bc657 [root@qe-wjiang39-crio-merrn-1 ~]# oc get nodes -o wide NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME qe-wjiang39-crio-merrn-1 Ready compute,master 47m v1.9.1+a0ce1bc657 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.10.1.el7.x86_64 cri-o://1.9.15 And still got this inconsistent issue. [root@qe-wjiang39-crio-merrn-1 ~]# crictl pods |wc -l 1663 [root@qe-wjiang39-crio-merrn-1 ~]# crictl ps CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 2e97bdf6a147e docker-registry.default.svc:5000/install-test/nodejs-mongodb-example@sha256:73146816e4d2e02162ac45a4b4329d9317cbea0c99674b082671b1a9c708b833 2 hours ago CONTAINER_RUNNING nodejs-mongodb-example 0 ece2ae5a966be brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhscl/mongodb-32-rhel7@sha256:4d5660778ccce1233145830928e289b77d9681ad9ec96b0e279747b7dbc5e5c5 3 hours ago CONTAINER_RUNNING mongodb 0 e1807c726636d registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker@sha256:6d9d08631c558be88807113e4f7d6f2b5083fa72f472c2f4a6e871486f677c43 3 hours ago CONTAINER_RUNNING c 0 10f36cdc1f6f2 registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:7b7507fc3896aff506adf54e0b068debcb88b1af8efa0620f9b383d561a48478 3 hours ago CONTAINER_RUNNING controller-manager 0 b428b1532af8c registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:7b7507fc3896aff506adf54e0b068debcb88b1af8efa0620f9b383d561a48478 3 hours ago CONTAINER_RUNNING apiserver 0 cae04244189d2 registry.reg-aws.openshift.com:443/openshift3/ose-web-console@sha256:3d65332942ad073a6c7f5a24761bcbd479651df20147cd7153331fda6cd42246 3 hours ago CONTAINER_RUNNING webconsole 0 e5933071d128a registry.reg-aws.openshift.com:443/openshift3/registry-console@sha256:af0f6f86d0e0ac546222c86666aa0de7d78c727b2aaf4847d8f99d9cf5fb24b8 3 hours ago CONTAINER_RUNNING registry-console 0 b63f87edfb2d0 registry.reg-aws.openshift.com:443/openshift3/ose@sha256:6ea849633fb1358ff726f57c41f89767c49632f5c60a6779db9f8dc51a5d1b86 3 hours ago CONTAINER_RUNNING dockergc 0 740bf027c8da6 registry.reg-aws.openshift.com:443/openshift3/ose-docker-registry@sha256:e1b2e8b15bd110ac0a86f0a906b32611d9430c303d01d1c2e5b83d2db674f316 3 hours ago CONTAINER_RUNNING registry 0 691d111ebbcff registry.reg-aws.openshift.com:443/openshift3/ose-haproxy-router@sha256:b91f686500222eeb9d244a561a608077c6ee513a2502ca915ae338dedbe37517 3 hours ago CONTAINER_RUNNING router 0 [root@qe-wjiang39-crio-merrn-1 ~]# oc adm manage-node `hostname` --list-pods Listing matched pods on node: qe-wjiang39-crio-merrn-1 NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-fcrrj 1/1 Running 0 2h default dockergc-dqjh6 1/1 Running 0 2h default registry-console-1-nwm8z 1/1 Running 0 2h default router-1-4kpn7 1/1 Running 0 2h install-test mongodb-1-hmmhw 1/1 Running 0 2h install-test nodejs-mongodb-example-1-build 0/1 Completed 0 2h install-test nodejs-mongodb-example-1-jbs6p 1/1 Running 0 2h kube-service-catalog apiserver-n9wpj 1/1 Running 0 2h kube-service-catalog controller-manager-tz4xh 1/1 Running 0 2h openshift-ansible-service-broker asb-1-deploy 0/1 Error 0 2h openshift-ansible-service-broker asb-etcd-1-deploy 0/1 Error 0 2h openshift-template-service-broker apiserver-pvmlq 1/1 Running 0 2h openshift-web-console webconsole-5449c78b44-hz24z 1/1 Running 0 2h [root@qe-wjiang39-crio-merrn-1 ~]# crictl logs $(crictl ps --label io.kubernetes.container.name=dockergc --quiet) 2>&1| tail -n 10 I0319 04:48:50.143501 1 dockergc.go:150] gathering disk usage data I0319 04:48:50.148970 1 dockergc.go:159] usage is under high threshold (15180MB < 36276MB) I0319 04:49:50.149305 1 dockergc.go:150] gathering disk usage data I0319 04:49:50.154950 1 dockergc.go:159] usage is under high threshold (15180MB < 36276MB) I0319 04:50:50.155239 1 dockergc.go:150] gathering disk usage data I0319 04:50:50.157362 1 dockergc.go:159] usage is under high threshold (15183MB < 36276MB) I0319 04:51:50.157533 1 dockergc.go:150] gathering disk usage data I0319 04:51:50.159449 1 dockergc.go:159] usage is under high threshold (15170MB < 36276MB) I0319 04:52:50.159778 1 dockergc.go:150] gathering disk usage data I0319 04:52:50.161582 1 dockergc.go:159] usage is under high threshold (15180MB < 36276MB) So how to deal with this? FYI, customer is letting me know they've been discussing that case over there: https://access.redhat.com/support/cases/#/case/02335236 They haven't tried upgrading to cri-o 1.9.15 yet. I probably won't be able to convince them to go further right now, considering #22. AFAIU, their large amounts of rogue Pods could be somewhat related to openshift users having a few CronJobs. Which I can confirm on my own OKD 3.11 / crio 1.11.7-1: I have several pods related to my LDAP group sync job, all of which show as Completed, or no longer show from oc point of view, yet are still listed by crictl pods: POD ID CREATED STATE NAME NAMESPACE ATTEMPT f648111d1084a 5 minutes ago NotReady sync-ldap-groups-1552986000-z44z5 utgb-infra 0 d5f41239b8364 About an hour ago NotReady sync-ldap-groups-1552982400-4qt8x utgb-infra 0 b90619921f8f8 2 hours ago NotReady sync-ldap-groups-1552978800-hbrfn utgb-infra 0 ccfe2a5ba8a09 3 hours ago NotReady sync-ldap-groups-1552975200-6qjbn utgb-infra 0 d5e1870f066a4 4 hours ago NotReady sync-ldap-groups-1552971600-6x4qf utgb-infra 0 Then again, CronJob is just a way to reproduce that issue. I can find inconsistent pods such as the following as well: POD ID CREATED STATE NAME NAMESPACE ATTEMPT 0774b4ef6a9a4 6 hours ago NotReady logging-curator-1552966200-hzlc6 openshift-logging 0 b4423ddd6c52f 14 hours ago NotReady console-5677c7c58d-cnwvd openshift-console 0 11510d0c64944 14 hours ago NotReady webconsole-7df4f9f689-cfth6 openshift-web-console 0 eadd65a873e8b 14 hours ago NotReady node-exporter-6wkq7 openshift-monitoring 4 1da299d964fce 14 hours ago NotReady console-5677c7c58d-k9kz2 openshift-console 4 1d722435acd74 14 hours ago NotReady logging-fluentd-9sm7w openshift-logging 1 As far as I can see, that "STATE" column properly identifies rogue Pods. Considering I have no more than 85 pods, and 11 nodes (like 10x below the volume of pods per nodes seen with Orange), I'm not convinced it has to do with some grpc buffer overflow. Still, any theory would be welcome at that point. There are 3 things to check here: 1. What are the settings for kubelet GC for containers - max-pods maximum-dead-containers maximum-dead-containers-per-container 2. Can we enable cri-o debug and see if we are getting any remove pod sandbox calls from the kubelet for the pods that are piling up in NOT_READY status? If so is there any failure being logged? 3. Is kubelet issuing remove pod sandbox calls to CRI-O for these pods? If so, are there any failures being logged? My suspicion is that either kubelet isn't sending remove calls to CRI-O or CRI-O is failing to remove. Meanwhile, I am backporting another patch which could also be at play here. 1. Kubelet settings, as per customers' openshift_node_groups definition: - key: kubeletArguments.max-pods value: ['40'] [...] - key: kubeletArguments.maximum-dead-containers value: ['20'] - key: kubeletArguments.maximum-dead-containers-per-container value: ['2'] 2. I could enable debugs on my own setup. Probably not on customers' clusters though. Setting log_level = "debug" into /etc/crio/crio.conf? Or do you need me to do something else? 3. I should point out that in the last couple days, I've been trying out some new crio package ( see https://bugzilla.redhat.com/show_bug.cgi?id=1690259 ), which is already spamming my logs (x20 over my average /var/log/messages size). Anything you want me to grep? I've got a shitload of errors, that are probably unrelated to that specific case. We just merged https://github.com/kubernetes-sigs/cri-o/pull/2142 with some fixes and will release a new version of CRI-O with it. Regarding crio debugs, today, I have the following Pod: # crictl pods [...] b1c11205ffd96 2 hours ago NotReady sync-ldap-groups-1554206400-2zrxj utgb-infra 0 [...] grepping logs matching that ID, I would find the following: Apr 2 14:00:06 master1 dockerd-current: time="2019-04-02T14:00:06.014449108+02:00" level=error msg="Handler for GET /containers/b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739/json returned error: No such container: b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739" Apr 2 14:00:06 master1 systemd: Started libcontainer container b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739. Apr 2 14:00:06 master1 crio: time="2019-04-02 14:00:06.085805036+02:00" level=debug msg="Received container pid: 84280" Apr 2 14:00:06 master1 kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready Apr 2 14:00:06 master1 kernel: IPv6: ADDRCONF(NETDEV_UP): vetheda14c4f: link is not ready Apr 2 14:00:06 master1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vetheda14c4f: link becomes ready Apr 2 14:00:06 master1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Apr 2 14:00:06 master1 kernel: device vetheda14c4f entered promiscuous mode Apr 2 14:00:06 master1 kernel: type=1700 audit(1554206406.285:7568): dev=vetheda14c4f prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295 Apr 2 14:00:06 master1 kernel: type=1300 audit(1554206406.285:7568): arch=c000003e syscall=46 success=yes exit=64 a0=31 a1=7ffe18da98a0 a2=0 a3=40 items=0 ppid=9434 pid=9435 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ovs-vswitchd" exe="/usr/sbin/ovs-vswitchd" subj=system_u:system_r:spc_t:s0 key=(null) Apr 2 14:00:06 master1 kernel: type=1327 audit(1554206406.285:7568): proctitle=6F76732D767377697463686400756E69783A2F7661722F72756E2F6F70656E767377697463682F64622E736F636B002D76636F6E736F6C653A656D6572002D767379736C6F673A657272002D7666696C653A696E666F002D2D6D6C6F636B616C6C002D2D6E6F2D6368646972002D2D6C6F672D66696C653D2F7661722F6C6F67 Apr 2 14:00:06 master1 crio: time="2019-04-02 14:00:06.203241853+02:00" level=info msg="Got pod network {Name:sync-ldap-groups-1554206400-2zrxj Namespace:utgb-infra ID:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739 NetNS:/proc/84280/ns/net PortMappings:[]}" Apr 2 14:00:06 master1 crio: time="2019-04-02 14:00:06.203342486+02:00" level=info msg="About to add CNI network cni-loopback (type=loopback)" Apr 2 14:00:06 master1 crio: time="2019-04-02 14:00:06.214424393+02:00" level=info msg="Got pod network {Name:sync-ldap-groups-1554206400-2zrxj Namespace:utgb-infra ID:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739 NetNS:/proc/84280/ns/net PortMappings:[]}" [...] Apr 2 14:00:07 master1 crio: time="2019-04-02 14:00:07.365754646+02:00" level=debug msg="RunPodSandboxResponse: &RunPodSandboxResponse{PodSandboxId:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,}" Apr 2 14:00:07 master1 crio: time="2019-04-02 14:00:07.366745976+02:00" level=debug msg="PodSandboxStatusRequest &PodSandboxStatusRequest{PodSandboxId:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Verbose:false,}" Apr 2 14:00:07 master1 crio: time="2019-04-02 14:00:07.366904278+02:00" level=debug msg="PodSandboxStatusResponse: &PodSandboxStatusResponse{Status:&PodSandboxStatus{Id:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&PodSandboxMetadata{Name:sync-ldap-groups-1554206400-2zrxj,Uid:da63293d-553e-11e9-b3a1-52540016f6b8,Namespace:utgb-infra,Attempt:0,},State:SANDBOX_READY,CreatedAt:1554206406080244558,Network:&PodSandboxNetworkStatus{Ip:10.129.1.22,},Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,},},},Labels:map[string]string{controller-uid: da4d6e97-553e-11e9-b3a1-52540016f6b8,io.kubernetes.container.name: POD,io.kubernetes.pod.name: sync-ldap-groups-1554206400-2zrxj,io.kubernetes.pod.namespace: utgb-infra,io.kubernetes.pod.uid: da63293d-553e-11e9-b3a1-52540016f6b8,job-name: sync-ldap-groups-1554206400,},Annotations:map[string]string{kubernetes.io/config.seen: 2019-04-02T14:00:05.301532436+02:00,kubernetes.io/config.source: api,openshift.io/scc: restricted,},},Info:map[string]string{},}" Apr 2 14:00:07 master1 crio: time="2019-04-02 14:00:07.368479200+02:00" level=debug msg="ImageStatusRequest: &ImageStatusRequest{Image:&ImageSpec{Image:docker.io/openshift/jenkins-slave-base-centos7:v3.11,},Verbose:false,}" [...] Apr 2 14:00:07 master1 crio: time="2019-04-02 14:00:07.474804782+02:00" level=debug msg="CreateContainerRequest &CreateContainerRequest{PodSandboxId:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Config:&ContainerConfig{Metadata:&ContainerMetadata{Name:sync-ldap-groups,Attempt:0,},Image:&ImageSpec{Image:38f3ae11b29463cf07734ca48a056c9794daa488003dee30c64c61fab384f452,},Command:[/bin/bash -c oc adm groups sync --sync-config=/etc/config/ldap-group-sync.yaml --confirm || : [.....] Apr 2 14:00:07 master1 crio: time="2019-04-02 14:00:07.485898024+02:00" level=debug msg="pod container state &{State:{Version:1.0.0 ID:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739 Status:running Pid:84280 Bundle:/run/containers/storage/overlay-containers/b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739/userdata Annotations:map[io.kubernetes.cri-o.KubeName:sync-ldap-groups-1554206400-2zrxj io.kubernetes.cri-o.NamespaceOptions:{"pid":1} io.kubernetes.cri-o.PrivilegedRuntime:false io.kubernetes.cri-o.ShmPath:/var/run/containers/storage/overlay-containers/b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739/userdata/shm io.kubernetes.cri-o.ContainerID:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739 io.kubernetes.cri-o.ContainerName:k8s_POD_sync-ldap-groups-1554206400-2zrxj_utgb-infra_da63293d-553e-11e9-b3a1-52540016f6b8_0 io.kubernetes.cri-o.Created:2019-04-02T14:00:05.782991996+02:00 io.kubernetes.cri-o.HostnamePath:/var/run/containers/storage/overlay-containers/b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739/userdata/hostname io.kubernetes.cri-o.Metadata:{"name":"sync-ldap-groups-1554206400-2zrxj","uid":"da63293d-553e-11e9-b3a1-52540016f6b8","namespace":"utgb-infra"} controller-uid:da4d6e97-553e-11e9-b3a1-52540016f6b8 io.kubernetes.container.name:POD io.kubernetes.cri-o.PortMappings:null io.kubernetes.cri-o.ResolvPath:/var/run/containers/storage/overlay-containers/b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739/userdata/resolv.conf kubernetes.io/config.source:api io.kubernetes.cri-o.ContainerType:sandbox io.kubernetes.cri-o.Labels:{"controller-uid":"da4d6e97-553e-11e9-b3a1-52540016f6b8","io.kubernetes.container.name":"POD","io.kubernetes.pod.name":"sync-ldap-groups-1554206400-2zrxj","io.kubernetes.pod.namespace":"utgb-infra","io.kubernetes.pod.uid":"da63293d-553e-11e9-b3a1-52540016f6b8","job-name":"sync-ldap-groups-1554206400"} io.kubernetes.pod.name:sync-ldap-groups-1554206400-2zrxj io.kubernetes.pod.uid:da63293d-553e-11e9-b3a1-52540016f6b8 io.kubernetes.cri-o.HostNetwork:false io.kubernetes.cri-o.LogPath:/var/log/pods/da63293d-553e-11e9-b3a1-52540016f6b8/b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739.log io.kubernetes.cri-o.MountPoint:/var/lib/containers/storage/overlay2/445d60551167ed6831f9611c83571be738049266634a70f1b6a404ea845597ef/merged kubernetes.io/config.seen:2019-04-02T14:00:05.301532436+02:00 io.kubernetes.cri-o.HostName:sync-ldap-groups-1554206400-2zrxj io.kubernetes.cri-o.Name:k8s_sync-ldap-groups-1554206400-2zrxj_utgb-infra_da63293d-553e-11e9-b3a1-52540016f6b8_0 io.kubernetes.cri-o.Namespace:utgb-infra io.kubernetes.cri-o.SandboxID:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739 io.kubernetes.cri-o.SeccompProfilePath: io.kubernetes.cri-o.TrustedSandbox:true job-name:sync-ldap-groups-1554206400 io.kubernetes.cri-o.Annotations:{"kubernetes.io/config.seen":"2019-04-02T14:00:05.301532436+02:00","kubernetes.io/config.source":"api","openshift.io/scc":"restricted"} io.kubernetes.cri-o.CgroupParent:kubepods-besteffort-podda63293d_553e_11e9_b3a1_52540016f6b8.slice io.kubernetes.pod.namespace:utgb-infra openshift.io/scc:restricted]} Created:2019-04-02 12:00:06.080244558 +0000 UTC Started:2019-04-02 14:00:06.115322034 +0200 CEST m=+74219.543908190 Finished:0001-01-01 00:00:00 +0000 UTC ExitCode:0 OOMKilled:false Error:}" Apr 2 14:00:07 master1 crio: time="2019-04-02 14:00:07.610339786+02:00" level=debug msg="ListPodSandboxResponse &ListPodSandboxResponse{Items:[&PodSandbox{Id:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&PodSandboxMetadata{Name:sync-ldap-groups-1554206400-2zrxj,Uid:da63293d-553e-11e9-b3a1-52540016f6b8,Namespace:utgb-infra,Attempt:0,},State:SANDBOX_READY,CreatedAt:1554206406080244558,Labels:map[string]string{controller-uid: da4d6e97-553e-11e9-b3a1-52540016f6b8,io.kubernetes.container.name: POD,io.kubernetes.pod.name: sync-ldap-groups-1554206400-2zrxj [....] Apr 2 14:00:08 master1 crio: time="2019-04-02 14:00:08.057265025+02:00" level=debug msg="ListContainersResponse: &ListContainersResponse{Containers:[&Container{Id:85630c75b695557a44cca9e0f379f6af72e3355d0ef50017db9867430a65cd74,PodSandboxId:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&ContainerMetadata{Name:sync-ldap-groups,Attempt:0,},Image:&ImageSpec{Image:38f3ae11b29463cf07734ca48a056c9794daa488003dee30c64c61fab384f452,},ImageRef:docker.io/openshift/jenkins-slave-base-centos7@sha256:93d974db4cc92a72d63767f8fe7d28d145bd124493b449e82177476a2fa783c9,State:CONTAINER_RUNNING,CreatedAt:1554206407937015579,Labels:map[string]string{io.kubernetes.container.name: sync-ldap-groups,io.kubernetes.pod.name: sync-ldap-groups-1554206400-2zrxj, [...] Apr 2 14:00:08 master1 crio: time="2019-04-02 14:00:08.077226094+02:00" level=debug msg="ListPodSandboxResponse &ListPodSandboxResponse{Items:[&PodSandbox{Id:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&PodSandboxMetadata{Name:sync-ldap-groups-1554206400-2zrxj,Uid:da63293d-553e-11e9-b3a1-52540016f6b8,Namespace:utgb-infra,Attempt:0,},State:SANDBOX_READY,CreatedAt:1554206406080244558,Labels:map[string]string{controller-uid: da4d6e97-553e-11e9-b3a1-52540016f6b8,io.kubernetes.container.name: POD,io.kubernetes.pod.name: sync-ldap-groups-1554206400-2zrxj,io.kubernetes.pod.namespace: utgb-infra,io.kubernetes.pod.uid: da63293d-553e-11e9-b3a1-52540016f6b8,job-name: sync-ldap-groups-1554206400,},Annotations:map[string]string{kubernetes.io/config.seen: 2019-04-02T14:00:05.301532436+02:00,kubernetes.io/config.source: api,openshift.io/scc: restricted,},}],}" Apr 2 14:00:08 master1 crio: time="2019-04-02 14:00:08.077979786+02:00" level=debug msg="PodSandboxStatusRequest &PodSandboxStatusRequest{PodSandboxId:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Verbose:false,}" Apr 2 14:00:08 master1 crio: time="2019-04-02 14:00:08.078091575+02:00" level=debug msg="PodSandboxStatusResponse: &PodSandboxStatusResponse{Status:&PodSandboxStatus{Id:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&PodSandboxMetadata{Name:sync-ldap-groups-1554206400-2zrxj,Uid:da63293d-553e-11e9-b3a1-52540016f6b8,Namespace:utgb-infra,Attempt:0,},State:SANDBOX_READY,CreatedAt:1554206406080244558,Network:&PodSandboxNetworkStatus{Ip:10.129.1.22,},Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,},},},Labels:map[string]string{controller-uid: da4d6e97-553e-11e9-b3a1-52540016f6b8,io.kubernetes.container.name: POD,io.kubernetes.pod.name: sync-ldap-groups-1554206400-2zrxj,io.kubernetes.pod.namespace: utgb-infra,io.kubernetes.pod.uid: da63293d-553e-11e9-b3a1-52540016f6b8,job-name: sync-ldap-groups-1554206400,},Annotations:map[string]string{kubernetes.io/config.seen: 2019-04-02T14:00:05.301532436+02:00,kubernetes.io/config.source: api,openshift.io/scc: restricted,},},Info:map[string]string{},}" Apr 2 14:00:08 master1 crio: time="2019-04-02 14:00:08.078635162+02:00" level=debug msg="ListContainersRequest &ListContainersRequest{Filter:&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{io.kubernetes.pod.uid: da63293d-553e-11e9-b3a1-52540016f6b8,},},}" Apr 2 14:00:08 master1 crio: time="2019-04-02 14:00:08.078675202+02:00" level=debug msg="no filters were applied, returning full container list" Apr 2 14:00:08 master1 crio: time="2019-04-02 14:00:08.078776284+02:00" level=debug msg="ListContainersResponse: &ListContainersResponse{Containers:[&Container{Id:85630c75b695557a44cca9e0f379f6af72e3355d0ef50017db9867430a65cd74,PodSandboxId:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&ContainerMetadata{Name:sync-ldap-groups,Attempt:0,},Image:&ImageSpec{Image:38f3ae11b29463cf07734ca48a056c9794daa488003dee30c64c61fab384f452,},ImageRef:docker.io/openshift/jenkins-slave-base-centos7@sha256:93d974db4cc92a72d63767f8fe7d28d145bd124493b449e82177476a2fa783c9,State:CONTAINER_RUNNING,CreatedAt:1554206407937015579,Labels:map[string]string{io.kubernetes.container.name: sync-ldap-groups,io.kubernetes.pod.name: sync-ldap-groups-1554206400-2zrxj,io.kubernetes.pod.namespace: utgb-infra,io.kubernetes.pod.uid: da63293d-553e-11e9-b3a1-52540016f6b8,},Annotations:map[string]string{io.kubernetes.container.hash: cd18d43f,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 30,},}],}" [...] Apr 2 14:00:08 master1 origin-node: I0402 14:00:08.083679 9780 kubelet.go:1865] SyncLoop (PLEG): "sync-ldap-groups-1554206400-2zrxj_utgb-infra(da63293d-553e-11e9-b3a1-52540016f6b8)", event: &pleg.PodLifecycleEvent{ID:"da63293d-553e-11e9-b3a1-52540016f6b8", Type:"ContainerStarted", Data:"b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739"} [...] Apr 2 14:00:09 master1 crio: time="2019-04-02 14:00:09.101349968+02:00" level=debug msg="ListPodSandboxResponse &ListPodSandboxResponse{Items:[&PodSandbox{Id:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&PodSandboxMetadata{Name:sync-ldap-groups-1554206400-2zrxj,Uid:da63293d-553e-11e9-b3a1-52540016f6b8,Namespace:utgb-infra,Attempt:0,} [...] [...] Apr 2 14:00:11 master1 crio: time="2019-04-02 14:00:11.229781423+02:00" level=info msg="Got pod network {Name:sync-ldap-groups-1554206400-2zrxj Namespace:utgb-infra ID:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739 NetNS:/proc/84280/ns/net PortMappings:[]}" [...] Apr 2 14:00:11 master1 crio: time="2019-04-02 14:00:11.419417902+02:00" level=debug msg="event: "/var/run/crio/exits/b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739.ZKTOZZ": RENAME" Apr 2 14:00:11 master1 crio: time="2019-04-02 14:00:11.419507866+02:00" level=debug msg="event: "/var/run/crio/exits/b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739": CREATE" Apr 2 14:00:11 master1 crio: time="2019-04-02 14:00:11.419542244+02:00" level=debug msg="container or sandbox exited: b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739" Apr 2 14:00:11 master1 crio: time="2019-04-02 14:00:11.419575049+02:00" level=debug msg="sandbox exited and found: b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739" Apr 2 14:00:11 master1 origin-node: I0402 14:00:11.418929 9780 reconciler.go:301] Volume detached for volume "ldap-ca-cert" (UniqueName: "kubernetes.io/configmap/da63293d-553e-11e9-b3a1-52540016f6b8-ldap-ca-cert") on node "master1.friends.intra.example.com" DevicePath "" Apr 2 14:00:11 master1 origin-node: I0402 14:00:11.418977 9780 reconciler.go:301] Volume detached for volume "ldap-sync-volume" (UniqueName: "kubernetes.io/configmap/da63293d-553e-11e9-b3a1-52540016f6b8-ldap-sync-volume") on node "master1.friends.intra.example.com" DevicePath "" Apr 2 14:00:11 master1 origin-node: I0402 14:00:11.419009 9780 reconciler.go:301] Volume detached for volume "ldap-group-syncer-token-sn67w" (UniqueName: "kubernetes.io/secret/da63293d-553e-11e9-b3a1-52540016f6b8-ldap-group-syncer-token-sn67w") on node "master1.friends.intra.example.com" DevicePath "" Apr 2 14:00:11 master1 crio: time="2019-04-02 14:00:11.588246944+02:00" level=debug msg="ListPodSandboxRequest &ListPodSandboxRequest{Filter:&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},},}" [...] Apr 2 14:00:11 master1 crio: time="2019-04-02 14:00:11.618018786+02:00" level=debug msg="unmounted container "b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739"" Apr 2 14:00:11 master1 crio: time="2019-04-02 14:00:11.618053734+02:00" level=debug msg="StopPodSandboxResponse b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739: &StopPodSandboxResponse{}" Apr 2 14:00:11 master1 systemd: Removed slice libcontainer container kubepods-besteffort-podda63293d_553e_11e9_b3a1_52540016f6b8.slice. Apr 2 14:00:12 master1 crio: time="2019-04-02 14:00:12.239509223+02:00" level=debug msg="ListPodSandboxRequest &ListPodSandboxRequest{Filter:nil,}" [...] Apr 2 14:00:12 master1 crio: time="2019-04-02 14:00:12.240581861+02:00" level=debug msg="ListPodSandboxResponse &ListPodSandboxResponse{Items:[&PodSandbox{Id:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&PodSandboxMetadata{Name:sync-ldap-groups-1554206400-2zrxj,Uid:da63293d-553e-11e9-b3a1-52540016f6b8,Namespace:utgb-infra,Attempt:0,} [...] [...] Apr 2 14:00:12 master1 crio: time="2019-04-02 14:00:12.243374065+02:00" level=debug msg="ListContainersResponse: &ListContainersResponse{Containers:[&Container{Id:85630c75b695557a44cca9e0f379f6af72e3355d0ef50017db9867430a65cd74,PodSandboxId:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&ContainerMetadata{Name:sync-ldap-groups,Attempt:0,} [...] [...] Apr 2 14:00:12 master1 crio: time="2019-04-02 14:00:12.253872302+02:00" level=debug msg="ListPodSandboxResponse &ListPodSandboxResponse{Items:[&PodSandbox{Id:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&PodSandboxMetadata{Name:sync-ldap-groups-1554206400-2zrxj,Uid:da63293d-553e-11e9-b3a1-52540016f6b8,Namespace:utgb-infra,Attempt:0,},State:SANDBOX_NOTREADY,CreatedAt:1554206406080244558,Labels:map[string]string{controller-uid: da4d6e97-553e-11e9-b3a1-52540016f6b8,io.kubernetes.container.name: POD,io.kubernetes.pod.name: sync-ldap-groups-1554206400-2zrxj,io.kubernetes.pod.namespace: utgb-infra,io.kubernetes.pod.uid: da63293d-553e-11e9-b3a1-52540016f6b8,job-name: sync-ldap-groups-1554206400,},Annotations:map[string]string{kubernetes.io/config.seen: 2019-04-02T14:00:05.301532436+02:00,kubernetes.io/config.source: api,openshift.io/scc: restricted,},}],}" Apr 2 14:00:12 master1 crio: time="2019-04-02 14:00:12.254529872+02:00" level=debug msg="PodSandboxStatusRequest &PodSandboxStatusRequest{PodSandboxId:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Verbose:false,}" Apr 2 14:00:12 master1 crio: time="2019-04-02 14:00:12.254662182+02:00" level=debug msg="PodSandboxStatusResponse: &PodSandboxStatusResponse{Status:&PodSandboxStatus{Id:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&PodSandboxMetadata{Name:sync-ldap-groups-1554206400-2zrxj,Uid:da63293d-553e-11e9-b3a1-52540016f6b8,Namespace:utgb-infra,Attempt:0,},State:SANDBOX_NOTREADY,CreatedAt:1554206406080244558,Network:&PodSandboxNetworkStatus{Ip:10.129.1.22,},Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,},},},Labels:map[string]string{controller-uid: da4d6e97-553e-11e9-b3a1-52540016f6b8,io.kubernetes.container.name: POD,io.kubernetes.pod.name: sync-ldap-groups-1554206400-2zrxj,io.kubernetes.pod.namespace: utgb-infra,io.kubernetes.pod.uid: da63293d-553e-11e9-b3a1-52540016f6b8,job-name: sync-ldap-groups-1554206400,},Annotations:map[string]string{kubernetes.io/config.seen: 2019-04-02T14:00:05.301532436+02:00,kubernetes.io/config.source: api,openshift.io/scc: restricted,},},Info:map[string]string{},}" Apr 2 14:00:12 master1 crio: time="2019-04-02 14:00:12.255311625+02:00" level=debug msg="ListContainersRequest &ListContainersRequest{Filter:&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{io.kubernetes.pod.uid: da63293d-553e-11e9-b3a1-52540016f6b8,},},}" Apr 2 14:00:12 master1 crio: time="2019-04-02 14:00:12.255360093+02:00" level=debug msg="no filters were applied, returning full container list" Apr 2 14:00:12 master1 crio: time="2019-04-02 14:00:12.255472103+02:00" level=debug msg="ListContainersResponse: &ListContainersResponse{Containers:[&Container{Id:85630c75b695557a44cca9e0f379f6af72e3355d0ef50017db9867430a65cd74,PodSandboxId:b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739,Metadata:&ContainerMetadata{Name:sync-ldap-groups,Attempt:0,},Image:&ImageSpec{Image:38f3ae11b29463cf07734ca48a056c9794daa488003dee30c64c61fab384f452,},ImageRef:docker.io/openshift/jenkins-slave-base-centos7@sha256:93d974db4cc92a72d63767f8fe7d28d145bd124493b449e82177476a2fa783c9,State:CONTAINER_EXITED,CreatedAt:1554206407937015579,Labels:map[string]string{io.kubernetes.container.name: sync-ldap-groups,io.kubernetes.pod.name: sync-ldap-groups-1554206400-2zrxj,io.kubernetes.pod.namespace: utgb-infra,io.kubernetes.pod.uid: da63293d-553e-11e9-b3a1-52540016f6b8,},Annotations:map[string]string{io.kubernetes.container.hash: cd18d43f,io.kubernetes.container.restartCount: 0,io.kubernetes.container.terminationMessagePath: /dev/termination-log,io.kubernetes.container.terminationMessagePolicy: File,io.kubernetes.pod.terminationGracePeriod: 30,},}],}" [...] Apr 2 14:00:12 master1 origin-node: I0402 14:00:12.261481 9780 kubelet.go:1865] SyncLoop (PLEG): "sync-ldap-groups-1554206400-2zrxj_utgb-infra(da63293d-553e-11e9-b3a1-52540016f6b8)", event: &pleg.PodLifecycleEvent{ID:"da63293d-553e-11e9-b3a1-52540016f6b8", Type:"ContainerDied", Data:"b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739"} Apr 2 14:00:12 master1 origin-node: W0402 14:00:12.261632 9780 pod_container_deletor.go:75] Container "b1c11205ffd9672b31c567433afff847cb815e80943c8106958b7985d6719739" not found in pod's containers [...] Not sure there's anything relevant in there, should I grep for something specific? Regards. (In reply to Mrunal Patel from comment #24) > There are 3 things to check here: > 1. What are the settings for kubelet GC for containers - > max-pods > maximum-dead-containers > maximum-dead-containers-per-container > > 2. Can we enable cri-o debug and see if we are getting any remove pod > sandbox calls from the kubelet for the pods that are piling up in NOT_READY > status? > If so is there any failure being logged? > > > 3. Is kubelet issuing remove pod sandbox calls to CRI-O for these pods? If > so, are there any failures being logged? > > My suspicion is that either kubelet isn't sending remove calls to CRI-O or > CRI-O is failing to remove. Checked and still got same result with https://bugzilla.redhat.com/show_bug.cgi?id=1666198#c22 [root@qe-wjiang39crio-master-etcd-1 ~]# oc version oc v3.9.74 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-wjiang39crio-master-etcd-1:8443 openshift v3.9.74 kubernetes v1.9.1+a0ce1bc657 [root@qe-wjiang39crio-master-etcd-1 ~]# oc get nodes -o wide NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME qe-wjiang39crio-master-etcd-1 Ready master 2h v1.9.1+a0ce1bc657 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.10.1.el7.x86_64 cri-o://1.9.16 qe-wjiang39crio-node-registry-router-1 Ready compute 2h v1.9.1+a0ce1bc657 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.10.1.el7.x86_64 cri-o://1.9.16 [root@qe-wjiang39crio-master-etcd-1 ~]# cat /etc/origin/node/node-config.yaml kubeletArguments: max-pods: - '40' maximum-dead-containers: - '20' maximum-dead-containers-per-container: - '2' And in my scenario, all the pods is SANDBOX_READAY status but actually no running container for it. [root@qe-wjiang39crio-master-etcd-1 ~]# oc adm manage-node qe-wjiang39crio-master-etcd-1 --list-pods Listing matched pods on node: qe-wjiang39crio-master-etcd-1 NAMESPACE NAME READY STATUS RESTARTS AGE default dockergc-wxjgd 1/1 Running 0 2h default h-1-4nfnl 1/1 Running 0 1h default h-1-8cgn6 1/1 Running 0 1h default h-1-cqqwf 1/1 Running 0 1h default h-1-lhhmd 1/1 Running 0 1h default h-1-ztpbn 1/1 Running 0 1h default registry-console-1-bxp4t 1/1 Running 0 2h kube-service-catalog apiserver-qmwm6 1/1 Running 0 2h kube-service-catalog controller-manager-drl9d 1/1 Running 0 2h openshift-ansible-service-broker asb-etcd-1-9ctf5 1/1 Running 0 2h openshift-template-service-broker apiserver-qw9kr 1/1 Running 0 2h openshift-web-console webconsole-5d8cf98965-ptl57 1/1 Running 0 2h [root@qe-wjiang39crio-master-etcd-1 ~]# crictl pods |grep -i -v notready| grep -i ready | wc -l 195 And no removepodsandboxrequest in crio logs: [root@qe-wjiang39crio-master-etcd-1 ~]# timeout 300 journalctl -f -u crio|grep -i -E "removepodsandboxrequest" Terminated I am unable to reproduce this locally. Can I please get access to the cluster? @wjiang, would it be possible to get access to the cluster to further debug the issue here. I have been unable to reproduce it on my end. Fix is in https://github.com/cri-o/cri-o/pull/2516. Will cut a new release and create a new build once it is in. The fix has been merged and cri-o 1.9.16 has been cut. Waiting on a new build for it now. Checked with cri-o-1.9.16-2.git858756d.el7.x86_64, this issue is already fixed. [root@qe-wjiangmaster-etcd-1 ~]# rpm -qa|grep -i cri-o cri-o-1.9.16-2.git858756d.el7.x86_64 [root@qe-wjiangmaster-etcd-1 ~]# oc get nodes -o wide NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME qe-wjiangmaster-etcd-1 Ready master 1h v1.9.1+a0ce1bc657 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.21.3.el7.x86_64 cri-o://1.9.17-dev qe-wjiangnode-registry-router-1 Ready compute 1h v1.9.1+a0ce1bc657 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.21.3.el7.x86_64 cri-o://1.9.17-dev [root@qe-wjiangmaster-etcd-1 ~]# cat test.sh #!/bin/env bash set -x oc run h --image=openshift/hello-openshift --replicas=1 sleep 10 for i in {1..30}; do oc scale dc/h --replicas=100 sleep 5 oc scale dc/h --replicas=1 sleep 5 done After run the testing script and wait not more than 30s. [root@qe-wjiangmaster-etcd-1 ~]# crictl pods PODSANDBOX ID CREATED STATE NAME NAMESPACE ATTEMPT 8bc39b6473d39 12 minutes ago SANDBOX_READY h-2-86m95 default 0 1e7594f335269 2 hours ago SANDBOX_READY asb-etcd-1-s546g openshift-ansible-service-broker 0 67ff290ead86e 2 hours ago SANDBOX_READY apiserver-gzm6j openshift-template-service-broker 0 63c9648a804ed 2 hours ago SANDBOX_READY controller-manager-7l8rv kube-service-catalog 0 4f7dbadff702c 2 hours ago SANDBOX_READY apiserver-2k4qp kube-service-catalog 0 2309adaf5d76e 2 hours ago SANDBOX_READY registry-console-1-p5cqb default 0 699da02fa2a5c 2 hours ago SANDBOX_READY dockergc-6pr5j default 0 9c83cc70d22ac 2 hours ago SANDBOX_READY webconsole-7bf77b6f-26fgm openshift-web-console 0 [root@qe-wjiangmaster-etcd-1 ~]# oc adm manage-node `hostname` --list-pods -o wide Listing matched pods on node: qe-wjiangmaster-etcd-1 NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE default dockergc-6pr5j 1/1 Running 0 1h 10.128.0.4 qe-wjiangmaster-etcd-1 default h-2-86m95 1/1 Running 0 12m 10.128.0.18 qe-wjiangmaster-etcd-1 default registry-console-1-p5cqb 1/1 Running 0 1h 10.128.0.5 qe-wjiangmaster-etcd-1 kube-service-catalog apiserver-2k4qp 1/1 Running 0 1h 10.128.0.6 qe-wjiangmaster-etcd-1 kube-service-catalog controller-manager-7l8rv 1/1 Running 0 1h 10.128.0.7 qe-wjiangmaster-etcd-1 openshift-ansible-service-broker asb-etcd-1-s546g 1/1 Running 0 1h 10.128.0.10 qe-wjiangmaster-etcd-1 openshift-template-service-broker apiserver-gzm6j 1/1 Running 0 1h 10.128.0.9 qe-wjiangmaster-etcd-1 openshift-web-console webconsole-7bf77b6f-26fgm 1/1 Running 0 1h 10.128.0.2 qe-wjiangmaster-etcd-1 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1642 |