Bug 1580555 - [3.9] Image Garbage collection trying to delete images in use by stopped containers
Summary: [3.9] Image Garbage collection trying to delete images in use by stopped cont...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.9.z
Assignee: Seth Jennings
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On: 1577739
Blocks: 1580551 1580552 1580554 1619477
TreeView+ depends on / blocked
 
Reported: 2018-05-21 19:20 UTC by Seth Jennings
Modified: 2020-01-31 18:52 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Prevents image garbage collection from attempting to remove images in use by containers
Clone Of: 1577739
Environment:
Last Closed: 2018-08-28 14:24:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
node.log (16.80 MB, application/x-gzip)
2018-06-05 07:46 UTC, DeShuai Ma
no flags Details

Comment 3 weiwei jiang 2018-05-30 06:46:47 UTC
Checked with 
# oc version 
oc v3.9.30
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-3-197.ec2.internal:8443
openshift v3.9.30
kubernetes v1.9.1+a0ce1bc657

And the issue can not be reproduced.

# journalctl  -u atomic-openshift-node|grep -i "image_gc_manager"|grep -i " used"
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.641943  108519 image_gc_manager.go:334] Image ID sha256:45e0e3dae5ec197a44fe104bf30f9341a6e3d29faeff1c6da30399fb925a7679 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.641959  108519 image_gc_manager.go:334] Image ID sha256:adf66bf8d4cc4e7f7555378452767949b23d5608e9cadcbf0b7e97a2e47d7252 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.641972  108519 image_gc_manager.go:334] Image ID sha256:4eca8aeae35d502fb560f8bd95c09d569adf7e9b907745cdac116344d659a1df is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.641986  108519 image_gc_manager.go:334] Image ID sha256:a813b03690b5b20bbaaed50aae05f775d92f183af0a5a1b092f741274d24b4f8 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642003  108519 image_gc_manager.go:334] Image ID sha256:bb05bf5ecdfa35ca58f6e6d2790611869c97cc05c6cacf448c79e1deef241940 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642019  108519 image_gc_manager.go:334] Image ID sha256:41f631bcc32083027c523935b78fd2f9a3c668c09855a7848dad71d2fa584ea6 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642033  108519 image_gc_manager.go:334] Image ID sha256:75e79260a34f5da432b408f596c4179f750cc22757b96405b47fc572658cba56 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642046  108519 image_gc_manager.go:334] Image ID sha256:c9499ed94d429dbfbe1396ab71383778ed34e47b714a14f441890fe889783fa9 is being used
May 30 02:43:06 ip-172-18-6-28.ec2.internal atomic-openshift-node[108519]: I0530 02:43:06.642060  108519 image_gc_manager.go:334] Image ID sha256:a721a89b2b9b8078974c469b3e81957465f5b618135c6d49b951eb347cf56102 is being used

Comment 4 DeShuai Ma 2018-06-05 07:34:42 UTC
Reopen the bug, In container env, the imagegc try to remove "openshift3/openvswitch" and "openshift3/node"

oc v3.9.30
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-stage39master-etcd-nfs-1:8443
openshift v3.9.30
kubernetes v1.9.1+a0ce1bc657

[root@qe-stage39master-etcd-nfs-1 ~]# oc describe no qe-stage39node-registry-router-1
Name:               qe-stage39node-registry-router-1
Roles:              compute
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=431ac1fb-1463-4527-b3d1-79245dd698e1
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=regionOne
                    failure-domain.beta.kubernetes.io/zone=nova
                    kubernetes.io/hostname=qe-stage39node-registry-router-1
                    logging-infra-fluentd=true
                    node-role.kubernetes.io/compute=true
                    registry=enabled
                    role=node
                    router=enabled
Annotations:        volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             <none>
CreationTimestamp:  Mon, 04 Jun 2018 22:39:26 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Tue, 05 Jun 2018 03:27:02 -0400   Mon, 04 Jun 2018 22:39:19 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Tue, 05 Jun 2018 03:27:02 -0400   Mon, 04 Jun 2018 22:39:19 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 05 Jun 2018 03:27:02 -0400   Mon, 04 Jun 2018 22:39:19 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            True    Tue, 05 Jun 2018 03:27:02 -0400   Tue, 05 Jun 2018 03:14:53 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.16.120.48
  ExternalIP:  10.8.248.170
  Hostname:    qe-stage39node-registry-router-1
Capacity:
 cpu:     4
 memory:  8009420Ki
 pods:    250
Allocatable:
 cpu:     4
 memory:  7907020Ki
 pods:    250
System Info:
 Machine ID:                         16100d3c3dae46ad8a4ff7fbc9fa554b
 System UUID:                        15694DD8-A91A-4A73-AC7F-AF23A21B7633
 Boot ID:                            7a5bf782-57ab-41c5-8dff-129e23388157
 Kernel Version:                     3.10.0-862.3.2.el7.x86_64
 OS Image:                           Red Hat Enterprise Linux Server 7.5 (Maipo)
 Operating System:                   linux
 Architecture:                       amd64
 Container Runtime Version:          docker://1.13.1
 Kubelet Version:                    v1.9.1+a0ce1bc657
 Kube-Proxy Version:                 v1.9.1+a0ce1bc657
ExternalID:                          15694dd8-a91a-4a73-ac7f-af23a21b7633
Non-terminated Pods:                 (10 in total)
  Namespace                          Name                              CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                          ----                              ------------  ----------  ---------------  -------------
  default                            docker-registry-1-94jp5           100m (2%)     0 (0%)      256Mi (3%)       0 (0%)
  default                            router-1-xfswm                    100m (2%)     0 (0%)      256Mi (3%)       0 (0%)
  hasha                              postgresql-1-mznj5                0 (0%)        0 (0%)      512Mi (6%)       512Mi (6%)
  openshift-ansible-service-broker   asb-etcd-1-ctqsh                  0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-infra                    heapster-h8fsc                    0 (0%)        0 (0%)      937500k (11%)    3750M (46%)
  openshift-metrics                  prometheus-node-exporter-zm62n    100m (2%)     200m (5%)   30Mi (0%)        50Mi (0%)
  openshift-template-service-broker  apiserver-hwwrr                   0 (0%)        0 (0%)      0 (0%)           0 (0%)
  wen                                django-psql-example-1-qhwkl       0 (0%)        0 (0%)      512Mi (6%)       512Mi (6%)
  wen                                frontend-1-wx6wp                  0 (0%)        0 (0%)      0 (0%)           0 (0%)
  wen                                postgresql-1-674qq                0 (0%)        0 (0%)      512Mi (6%)       512Mi (6%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests   Memory Limits
  ------------  ----------  ---------------   -------------
  300m (7%)     200m (5%)   3116440928 (38%)  5413041536 (66%)
Events:
  Type     Reason                   Age                From                                       Message
  ----     ------                   ----               ----                                       -------
  Normal   Starting                 12m                kubelet, qe-stage39node-registry-router-1  Starting kubelet.
  Normal   NodeAllocatableEnforced  12m                kubelet, qe-stage39node-registry-router-1  Updated Node Allocatable limit across pods
  Normal   NodeNotReady             12m                kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeNotReady
  Normal   NodeHasSufficientDisk    12m (x3 over 12m)  kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeHasSufficientDisk
  Normal   NodeHasSufficientMemory  12m (x3 over 12m)  kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    12m (x3 over 12m)  kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeHasNoDiskPressure
  Normal   NodeReady                12m                kubelet, qe-stage39node-registry-router-1  Node qe-stage39node-registry-router-1 status is now: NodeReady
  Warning  ImageGCFailed            7m                 kubelet, qe-stage39node-registry-router-1  wanted to free 8134389760 bytes, but freed 8978280283 bytes space with errors in image deletion: [rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 98871f35af21 (cannot be forced) - image has dependent child images, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete a8fd5c530c44 (cannot be forced) - image has dependent child images]
  Warning  ImageGCFailed            2m                 kubelet, qe-stage39node-registry-router-1  wanted to free 4316168192 bytes, but freed 4918712134 bytes space with errors in image deletion: [rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete a8fd5c530c44 (cannot be forced) - image has dependent child images, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 1fea394aac80 (cannot be forced) - image is being used by running container 15711776cedd, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete e42d0dccf073 (cannot be forced) - image has dependent child images, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete 0dbd08ad57f2 (cannot be forced) - image has dependent child images, rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete e37239ae2fa3 (cannot be forced) - image is being used by running container 6e6475a7a625]


//On node
[root@qe-stage39node-registry-router-1 ~]#  docker images |grep 'a8fd5c530c44 \| 1fea394aac80 \| 0dbd08ad57f2 \| e42d0dccf073 \| e37239ae2fa3'
docker.io/centos/ruby-22-centos7                                          <none>              e42d0dccf073        3 days ago          566 MB
registry.access.stage.redhat.com/openshift3/openvswitch                   v3.9.30             e37239ae2fa3        5 days ago          1.46 GB
registry.access.stage.redhat.com/openshift3/node                          v3.9.30             1fea394aac80        5 days ago          1.46 GB
registry.access.stage.redhat.com/rhscl/python-35-rhel7                    <none>              0dbd08ad57f2        13 days ago         627 MB
registry.access.stage.redhat.com/rhscl/nodejs-4-rhel7                     <none>              a8fd5c530c44        13 days ago         533 MB

Comment 5 DeShuai Ma 2018-06-05 07:46:18 UTC
Created attachment 1447750 [details]
node.log

Comment 9 weiwei jiang 2018-06-13 05:42:39 UTC
Checked with v3.9.31 and the issue can not be reproduced. 

since the containerized env is not in this, so move to verified.


Note You need to log in before you can comment on or make changes to this bug.