Bug 1580555

Summary: [3.9] Image Garbage collection trying to delete images in use by stopped containers
Product: OpenShift Container Platform Reporter: Seth Jennings <sjenning>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED CURRENTRELEASE QA Contact: DeShuai Ma <dma>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, bleanhar, dapark, dma, dmoessne, gferrazs, jokerman, ktadimar, mmccomas, sjenning, smunilla, vlaad, wjiang
Target Milestone: ---Keywords: Reopened
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Prevents image garbage collection from attempting to remove images in use by containers
Story Points: ---
Clone Of: 1577739 Environment:
Last Closed: 2018-08-28 14:24:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1577739    
Bug Blocks: 1580551, 1580552, 1580554, 1619477    
Attachments:
Description Flags
node.log none

Comment 3 weiwei jiang 2018-05-30 06:46:47 UTC
Checked with 
# oc version 
oc v3.9.30
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-3-197.ec2.internal:8443
openshift v3.9.30
kubernetes v1.9.1+a0ce1bc657

And the issue can not be reproduced.

# journalctl  -u atomic-openshift-node|grep -i "image_gc_manager"|grep -i " used"
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.641943  108519 image_gc_manager.go:334] Image ID sha256:45e0e3dae5ec197a44fe104bf30f9341a6e3d29faeff1c6da30399fb925a7679 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.641959  108519 image_gc_manager.go:334] Image ID sha256:adf66bf8d4cc4e7f7555378452767949b23d5608e9cadcbf0b7e97a2e47d7252 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.641972  108519 image_gc_manager.go:334] Image ID sha256:4eca8aeae35d502fb560f8bd95c09d569adf7e9b907745cdac116344d659a1df is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.641986  108519 image_gc_manager.go:334] Image ID sha256:a813b03690b5b20bbaaed50aae05f775d92f183af0a5a1b092f741274d24b4f8 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642003  108519 image_gc_manager.go:334] Image ID sha256:bb05bf5ecdfa35ca58f6e6d2790611869c97cc05c6cacf448c79e1deef241940 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642019  108519 image_gc_manager.go:334] Image ID sha256:41f631bcc32083027c523935b78fd2f9a3c668c09855a7848dad71d2fa584ea6 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642033  108519 image_gc_manager.go:334] Image ID sha256:75e79260a34f5da432b408f596c4179f750cc22757b96405b47fc572658cba56 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642046  108519 image_gc_manager.go:334] Image ID sha256:c9499ed94d429dbfbe1396ab71383778ed34e47b714a14f441890fe889783fa9 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642060  108519 image_gc_manager.go:334] Image ID sha256:a721a89b2b9b8078974c469b3e81957465f5b618135c6d49b951eb347cf56102 is being used

Comment 4 DeShuai Ma 2018-06-05 07:34:42 UTC
Reopen the bug, In container env, the imagegc try to remove "openshift3/openvswitch" and "openshift3/node"

oc v3.9.30
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-stage39master-etcd-nfs-1:8443
openshift v3.9.30
kubernetes v1.9.1+a0ce1bc657

[root@qe-stage39master-etcd-nfs-1 ~]# oc describe no qe-stage39node-registry-router-1
Name:               qe-stage39node-registry-router-1
Roles:              compute
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=431ac1fb-1463-4527-b3d1-79245dd698e1
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=regionOne
                    failure-domain.beta.kubernetes.io/zone=nova
                    kubernetes.io/hostname=qe-stage39node-registry-router-1
                    logging-infra-fluentd=true
                    node-role.kubernetes.io/compute=true
                    registry=enabled
                    role=node
                    router=enabled
Annotations:        volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Mon, 04 Jun 2018 22:39:26 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Tue, 05 Jun 2018 03:27:02 -0400   Mon, 04 Jun 2018 22:39:19 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Tue, 05 Jun 2018 03:27:02 -0400   Mon, 04 Jun 2018 22:39:19 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 05 Jun 2018 03:27:02 -0400   Mon, 04 Jun 2018 22:39:19 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            True    Tue, 05 Jun 2018 03:27:02 -0400   Tue, 05 Jun 2018 03:14:53 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.16.120.48
  ExternalIP:  10.8.248.170
  Hostname:    qe-stage39node-registry-router-1
Capacity:
 cpu:     4
 memory:  8009420Ki
 pods:    250
Allocatable:
 cpu:     4
 memory:  7907020Ki
 pods:    250
System Info:
 Machine ID:                         16100d3c3dae46ad8a4ff7fbc9fa554b
 System UUID:                        15694DD8-A91A-4A73-AC7F-AF23A21B7633
 Boot ID:                            7a5bf782-57ab-41c5-8dff-129e23388157
 Kernel Version:                     3.10.0-862.3.2.el7.x86_64
 OS Image:                           Red Hat Enterprise Linux Server 7.5 (Maipo)
 Operating System:                   linux
 Architecture:                       amd64
 Container Runtime Version:          docker://1.13.1
 Kubelet Version:                    v1.9.1+a0ce1bc657
 Kube-Proxy Version:                 v1.9.1+a0ce1bc657
ExternalID:                          15694dd8-a91a-4a73-ac7f-af23a21b7633
Non-terminated Pods:                 (10 in total)
  Namespace                          Name                              CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                          ----                              ------------  ----------  ---------------  -------------
  default                            docker-registry-1-94jp5           100m (2%)     0 (0%)      256Mi (3%)       0 (0%)
  default                            router-1-xfswm                    100m (2%)     0 (0%)      256Mi (3%)       0 (0%)
  hasha                              postgresql-1-mznj5                0 (0%)        0 (0%)      512Mi (6%)       512Mi (6%)
  openshift-ansible-service-broker   asb-etcd-1-ctqsh                  0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-infra                    heapster-h8fsc                    0 (0%)        0 (0%)      937500k (11%)    3750M (46%)
  openshift-metrics                  prometheus-node-exporter-zm62n    100m (2%)     200m (5%)   30Mi (0%)        50Mi (0%)
  openshift-template-service-broker  apiserver-hwwrr                   0 (0%)        0 (0%)      0 (0%)           0 (0%)
  wen                                django-psql-example-1-qhwkl       0 (0%)        0 (0%)      512Mi (6%)       512Mi (6%)
  wen                                frontend-1-wx6wp                  0 (0%)        0 (0%)      0 (0%)           0 (0%)
  wen                                postgresql-1-674qq                0 (0%)        0 (0%)      512Mi (6%)       512Mi (6%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests   Memory Limits
  ------------  ----------  ---------------   -------------
  300m (7%)     200m (5%)   3116440928 (38%)  5413041536 (66%)
Events:
  Type     Reason                   Age                From                                       Message
  ----     ------                   ----               ----                                       -------
  Normal   Starting                 12m                kubelet, qe-stage39node-registry-router-1  Starting kubelet.
  Normal   NodeAllocatableEnforced  12m                kubelet, qe-stage39node-registry-router-1  Updated Node Allocatable limit across pods
  Normal   NodeNotReady             12m                kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeNotReady
  Normal   NodeHasSufficientDisk    12m (x3 over 12m)  kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeHasSufficientDisk
  Normal   NodeHasSufficientMemory  12m (x3 over 12m)  kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    12m (x3 over 12m)  kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeHasNoDiskPressure
  Normal   NodeReady                12m                kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeReady
  Warning  ImageGCFailed            7m                 kubelet, qe-stage39node-registry-router-1  wanted to free 8134389760 bytes, but freed 8978280283 bytes space with errors in image deletion: [rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 98871f35af21 (cannot be forced) - image has dependent child images, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete a8fd5c530c44 (cannot be forced) - image has dependent child images]
  Warning  ImageGCFailed            2m                 kubelet, qe-stage39node-registry-router-1  wanted to free 4316168192 bytes, but freed 4918712134 bytes space with errors in image deletion: [rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete a8fd5c530c44 (cannot be forced) - image has dependent child images, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 1fea394aac80 (cannot be forced) - image is being used by running container 15711776cedd, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete e42d0dccf073 (cannot be forced) - image has dependent child images, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 0dbd08ad57f2 (cannot be forced) - image has dependent child images, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete e37239ae2fa3 (cannot be forced) - image is being used by running container 6e6475a7a625]


//On node
[root@qe-stage39node-registry-router-1 ~]#  docker images |grep 'a8fd5c530c44 \| 1fea394aac80 \| 0dbd08ad57f2 \| e42d0dccf073 \| e37239ae2fa3'
docker.io/centos/ruby-22-centos7                                          <none>              e42d0dccf073        3 days ago          566 MB
registry.access.stage.redhat.com/openshift3/openvswitch                   v3.9.30             e37239ae2fa3        5 days ago          1.46 GB
registry.access.stage.redhat.com/openshift3/node                          v3.9.30             1fea394aac80        5 days ago          1.46 GB
registry.access.stage.redhat.com/rhscl/python-35-rhel7                    <none>              0dbd08ad57f2        13 days ago         627 MB
registry.access.stage.redhat.com/rhscl/nodejs-4-rhel7                     <none>              a8fd5c530c44        13 days ago         533 MB

Comment 5 DeShuai Ma 2018-06-05 07:46:18 UTC
Created attachment 1447750 [details]
node.log

Comment 9 weiwei jiang 2018-06-13 05:42:39 UTC
Checked with v3.9.31 and the issue can not be reproduced. 

since the containerized env is not in this, so move to verified.