Bug 1397356 - projects are not terminating - stayed in hanged state for hours / days
Summary: projects are not terminating - stayed in hanged state for hours / days
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Derek Carr
QA Contact: DeShuai Ma
URL:
Whiteboard: aos-scalability-34
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-22 11:15 UTC by Elvir Kuric
Modified: 2017-03-08 18:43 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-01-18 12:55:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
node logs (65.54 KB, text/plain)
2016-11-29 21:56 UTC, Vikas Laad
no flags Details
pod yaml file (4.72 KB, text/plain)
2016-11-29 22:22 UTC, Vikas Laad
no flags Details
pods json (31.71 KB, text/plain)
2016-11-30 20:38 UTC, Vikas Laad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0066 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.4 RPM Release Advisory 2017-01-18 17:23:26 UTC

Description Elvir Kuric 2016-11-22 11:15:49 UTC
Description of problem:

After project delete, they will not terminate and will stay in "hanged" state for long time 

Version-Release number of selected component (if applicable):

# rpm -qa | egrep "atomic|docker|kernel" 
kernel-3.10.0-511.el7.x86_64
atomic-openshift-3.4.0.27-1.git.0.dffa95c.el7.x86_64
tuned-profiles-atomic-openshift-node-3.4.0.27-1.git.0.dffa95c.el7.x86_64
atomic-openshift-master-3.4.0.27-1.git.0.dffa95c.el7.x86_64
atomic-openshift-pod-3.4.0.27-1.git.0.dffa95c.el7.x86_64
docker-rhel-push-plugin-1.12.2-5.el7.x86_64
atomic-openshift-node-3.4.0.27-1.git.0.dffa95c.el7.x86_64
kernel-tools-3.10.0-514.el7.x86_64
docker-selinux-1.10.3-57.el7.x86_64
python-docker-py-1.9.0-1.el7.noarch
kernel-tools-libs-3.10.0-514.el7.x86_64
docker-1.12.2-5.el7.x86_64
atomic-openshift-sdn-ovs-3.4.0.27-1.git.0.dffa95c.el7.x86_64
atomic-openshift-tests-3.4.0.27-1.git.0.dffa95c.el7.x86_64
kernel-headers-3.10.0-514.el7.x86_64
kernel-3.10.0-514.el7.x86_64
tuned-profiles-atomic-2.7.1-3.el7.noarch
docker-common-1.12.2-5.el7.x86_64
atomic-openshift-dockerregistry-3.4.0.27-1.git.0.dffa95c.el7.x86_64
kernel-3.10.0-512.el7.x86_64
atomic-openshift-clients-3.4.0.27-1.git.0.dffa95c.el7.x86_64
atomic-openshift-clients-redistributable-3.4.0.27-1.git.0.dffa95c.el7.x86_64



How reproducible:

Create 100s of projects ( with 100 pods per project ) and delete delete them ( oc delete project $project ) 


Steps to Reproduce:
See above 

Actual results:

project stay hanging in "terminating" state 

Expected results:

projects to be deleted 


Additional info:

- Restarting master and node services

- rebooting node where pod is assigned did not clear it from below view - pods / projects still in "terminating" state 

    # oc get pods --all-namespaces -o wide | grep Term
    clusterproject11    hellopods51                0/1       Terminating         0          1d        <none>           dhcp7-91.example.net
    clusterproject11    hellopods64                0/1       Terminating         0          1d        <none>           dhcp7-177.example.net
    clusterproject11    hellopods71                0/1       Terminating         0          1d        <none>           dhcp8-194.example.net
    clusterproject11    hellopods75                0/1       Terminating         0          1d        <none>           dhcp8-12.example.net
    clusterproject15    hellopods41                0/1       Terminating         0          1d        <none>           dhcp7-202.example.net
    clusterproject17    hellopods10                0/1       Terminating         0          1d        <none>           dhcp8-161.example.net
    clusterproject17    hellopods65                0/1       Terminating         0          1d        <none>           dhcp8-153.example.net
    clusterproject26    hellopods23                0/1       Terminating         0          1d        <none>           dhcp8-196.example.net
    clusterproject26    hellopods26                0/1       Terminating         0          1d        <none>           dhcp7-181.example.net
    clusterproject26    hellopods33                0/1       Terminating         0          1d        <none>           dhcp7-170.example.net
    clusterproject32    hellopods61                0/1       Terminating         0          1d        <none>           dhcp7-190.example.net
    clusterproject32    hellopods72                0/1       Terminating         0          1d        <none>           dhcp8-237.example.net
    clusterproject34    hellopods96                0/1       Terminating         0          1d        <none>           dhcp7-139.example.net
    clusterproject44    hellopods23                0/1       Terminating         0          1d        <none>           dhcp8-18.example.net
    clusterproject48    hellopods2                 0/1       Terminating         0          1d        <none>           dhcp6-183.example.net
    clusterprojecta11   hellopods37                0/1       Terminating         0          21h       <none>           dhcp8-192.example.net
    clusterprojecta11   hellopods44                0/1       Terminating         0          21h       <none>           dhcp7-241.example.net
    clusterprojecta11   hellopods80                0/1       Terminating         0          20h       <none>           dhcp8-241.example.net
    clusterprojecta11   hellopods94                0/1       Terminating         0          20h       <none>           dhcp7-169.example.net
    clusterprojecta21   hellopods10                0/1       Terminating         0          20h       <none>           dhcp7-246.example.net
    clusterprojecta21   hellopods13                0/1       Terminating         0          20h       <none>           dhcp7-197.example.net
    clusterprojecta21   hellopods14                0/1       Terminating         0          20h       <none>           dhcp7-181.example.net
    clusterprojecta21   hellopods36                0/1       Terminating         0          20h       <none>           dhcp8-153.example.net
    clusterprojecta21   hellopods37                0/1       Terminating         0          20h       <none>           dhcp7-190.example.net
    clusterprojecta21   hellopods4                 0/1       Terminating         0          20h       <none>           dhcp7-29.example.net
    clusterprojecta21   hellopods44                0/1       Terminating         0          20h       <none>           dhcp7-74.example.net
    clusterprojecta21   hellopods59                0/1       Terminating         0          20h       <none>           dhcp6-186.example.net
    clusterprojecta21   hellopods65                0/1       Terminating         0          20h       <none>           dhcp8-171.example.net
    clusterprojecta21   hellopods81                0/1       Terminating         0          20h       <none>           dhcp7-224.example.net
    clusterprojecta21   hellopods85                0/1       Terminating         0          20h       <none>           dhcp7-177.example.net
    clusterprojecta21   hellopods9                 0/1       Terminating         0          20h       <none>           dhcp7-180.example.net
    clusterprojecta3    hellopods13                0/1       Terminating         0          21h       <none>           dhcp7-16.example.net
    clusterprojecta3    hellopods15                0/1       Terminating         0          21h       <none>           dhcp8-177.example.net
    clusterprojecta3    hellopods19                0/1       Terminating         0          21h       <none>           dhcp7-89.example.net
    clusterprojecta3    hellopods39                0/1       Terminating         0          21h       <none>           dhcp7-8.example.net
    clusterprojecta3    hellopods42                0/1       Terminating         0          21h       <none>           dhcp7-166.example.net
    clusterprojecta3    hellopods44                0/1       Terminating         0          21h       <none>           dhcp7-112.example.net
    clusterprojecta3    hellopods48                0/1       Terminating         0          21h       <none>           dhcp7-50.example.net
    clusterprojecta3    hellopods61                0/1       Terminating         0          21h       <none>           dhcp7-215.example.net
    clusterprojecta3    hellopods73                0/1       Terminating         0          21h       <none>           dhcp8-228.example.net
    clusterprojecta33   hellopods87                0/1       Terminating         0          19h       <none>           dhcp8-129.example.net
    clusterprojecta37   hellopods22                0/1       Terminating         0          20h       <none>           dhcp8-221.example.net
    clusterprojecta37   hellopods36                0/1       Terminating         0          20h       <none>           dhcp8-186.example.net
    clusterprojecta37   hellopods44                0/1       Terminating         0          19h       <none>           dhcp8-253.example.net
    clusterprojecta37   hellopods45                0/1       Terminating         0          19h       <none>           dhcp7-205.example.net
    clusterprojecta37   hellopods49                0/1       Terminating         0          19h       <none>           dhcp7-130.example.net
    clusterprojecta37   hellopods57                0/1       Terminating         0          19h       <none>           dhcp8-198.example.net
    clusterprojecta37   hellopods6                 0/1       Terminating         0          20h       <none>           dhcp8-147.example.net
    clusterprojecta37   hellopods72                0/1       Terminating         0          19h       <none>           dhcp8-208.example.net
    clusterprojecta37   hellopods75                0/1       Terminating         0          19h       <none>           dhcp7-172.example.net
    clusterprojecta37   hellopods79                0/1       Terminating         0          19h       <none>           dhcp7-100.example.net
    clusterprojecta37   hellopods80                0/1       Terminating         0          19h       <none>           dhcp8-129.example.net
    clusterprojecta37   hellopods81                0/1       Terminating         0          19h       <none>           dhcp7-91.example.net
    clusterprojecta37   hellopods83                0/1       Terminating         0          19h       <none>           dhcp7-47.example.net
    clusterprojecta37   hellopods84                0/1       Terminating         0          19h       <none>           dhcp7-250.example.net
    clusterprojecta9    hellopods40                0/1       Terminating         0          21h       <none>           dhcp7-82.example.net
    clusterprojecta9    hellopods45                0/1       Terminating         0          21h       <none>           dhcp7-179.example.net
    clusterprojecta9    hellopods5                 0/1       Terminating         0          21h       <none>           dhcp7-79.example.net
    clusterprojecta9    hellopods53                0/1       Terminating         0          21h       <none>           dhcp8-153.example.net
    clusterprojecta9    hellopods62                0/1       Terminating         0          21h       <none>           dhcp6-180.example.net
    clusterprojecta9    hellopods83                0/1       Terminating         0          21h       <none>           dhcp8-157.example.net
    clusterprojecta9    hellopods86                0/1       Terminating         0          21h       <none>           dhcp7-175.example.net
    clusterprojecta9    hellopods96                0/1       Terminating         0          21h       <none>           dhcp8-189.example.net
    root@dhcp6-26: ~ #

Comment 1 Jiří Mencák 2016-11-22 11:21:46 UTC
I've hit the same issue on 

# oc version
oc v3.4.0.28+dfe3a66
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-13-129.us-west-2.compute.internal:8443
openshift v3.4.0.28+dfe3a66
kubernetes v1.4.0+776c994

I've had 100 projects, 18 nginx pods per projects.  After deletion of all the projects I'm left with 2 projects that will not terminate:

# oc get projects|grep Termin
nginx-16                          Terminating
nginx-15                          Terminating

# oc get pods -n nginx-15 
NAME            READY     STATUS        RESTARTS   AGE
nginx-6-s0yei   0/1       Terminating   0          19h

# oc get pods -n nginx-16 
NAME             READY     STATUS        RESTARTS   AGE
nginx-18-xp8r2   0/1       Terminating   0          19h
nginx-6-0cc2g    0/1       Terminating   0          19h
nginx-7-k6r4s    0/1       Terminating   0          19h
nginx-8-u2tad    0/1       Terminating   0          19h

Comment 2 Michal Fojtik 2016-11-22 11:56:47 UTC
(In reply to jmencak from comment #1)

Are there any usefull logs when this happen about these projects and the reason why they are hanging in terminating state? 

> I've hit the same issue on 
> 
> # oc version
> oc v3.4.0.28+dfe3a66
> kubernetes v1.4.0+776c994
> features: Basic-Auth GSSAPI Kerberos SPNEGO
> 
> Server https://ip-172-31-13-129.us-west-2.compute.internal:8443
> openshift v3.4.0.28+dfe3a66
> kubernetes v1.4.0+776c994
> 
> I've had 100 projects, 18 nginx pods per projects.  After deletion of all
> the projects I'm left with 2 projects that will not terminate:
> 
> # oc get projects|grep Termin
> nginx-16                          Terminating
> nginx-15                          Terminating
> 
> # oc get pods -n nginx-15 
> NAME            READY     STATUS        RESTARTS   AGE
> nginx-6-s0yei   0/1       Terminating   0          19h
> 
> # oc get pods -n nginx-16 
> NAME             READY     STATUS        RESTARTS   AGE
> nginx-18-xp8r2   0/1       Terminating   0          19h
> nginx-6-0cc2g    0/1       Terminating   0          19h
> nginx-7-k6r4s    0/1       Terminating   0          19h
> nginx-8-u2tad    0/1       Terminating   0          19h

Comment 3 Elvir Kuric 2016-11-22 12:42:22 UTC
additional info. 

project clusterproject37 is for 21h i "Terminating" state. below are logs from that pod

# oc logs hellopods79 -n clusterprojecta37
serving on 8080
serving on 8888


# oc get pods --all-namespaces  -o wide | grep hellopods79 
clusterprojecta37   hellopods79                0/1       Terminating   0          21h       <none>         dhcp7-100.example.net
root@dhcp6-26: ~ # oc describe pod hellopods79 -n clusterprojecta37
Name:				hellopods79
Namespace:			clusterprojecta37
Security Policy:		anyuid
Node:				dhcp7-100.example.net/172.16.7.100
Start Time:			Mon, 21 Nov 2016 10:14:54 -0500
Labels:				name=hello-openshift
Status:				Terminating (expires Mon, 21 Nov 2016 11:18:58 -0500)
Termination Grace Period:	0s
IP:				
Controllers:			<none>
Containers:
  hello-openshift:
    Container ID:	docker://15101b252c0f342778c9a366a1f2ff5d5c1e010e781d1155a1d2ea2f71bbf416
    Image:		openshift/hello-openshift:v1.0.6
    Image ID:		docker-pullable://docker.io/openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11
    Port:		8080/TCP
    State:		Terminated
      Reason:		Error
      Exit Code:	2
      Started:		Mon, 21 Nov 2016 10:15:48 -0500
      Finished:		Mon, 21 Nov 2016 11:19:00 -0500
    Ready:		False
    Restart Count:	0
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xl1wq (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  default-token-xl1wq:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-xl1wq
QoS Class:	BestEffort
Tolerations:	<none>
No events.


# ssh dhcp7-100.example.net
Last login: Tue Nov 22 07:35:05 2016 from dhcp6-26.example.net
root@dhcp7-100: ~ # docker ps -a 
CONTAINER ID        IMAGE                                                     COMMAND              CREATED             STATUS                    PORTS               NAMES
15101b252c0f        openshift/hello-openshift:v1.0.6                          "/hello-openshift"   21 hours ago        Exited (2) 20 hours ago                       k8s_hello-openshift.d809ad9d_hellopods79_clusterprojecta37_3fbe5360-affd-11e6-82e7-001a4b1004c3_ee3b591b
4e8543c04f34        registry.ops.openshift.com/openshift3/ose-pod:v3.4.0.27   "/pod"               21 hours ago        Exited (0) 20 hours ago                       k8s_POD.17e5e6be_hellopods79_clusterprojecta37_3fbe5360-affd-11e6-82e7-001a4b1004c3_7210c77d
 




 ssh dhcp7-100.example.net
Last login: Tue Nov 22 07:35:39 2016 from dhcp6-26.example.net
sroot@dhcp7-100: ~ # systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─custom.conf
   Active: active (running) since Sun 2016-11-20 08:54:26 EST; 1 day 22h ago
     Docs: http://docs.docker.com
 Main PID: 22859 (dockerd-current)
   Memory: 71.7M
   CGroup: /system.slice/docker.service
           ├─22859 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt nati...
           └─22865 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --shim docker-containerd-shim --metrics-interval=0 --start-timeout 2m --state-dir /var/...

Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.420126652-05:00" level=error msg="Handler for POST /containers/0196c977c232262590b62f4bad...y stopped"
Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.420150507-05:00" level=error msg="Handler for POST /containers/0196c977c232262590b62f4bad...y stopped"
Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.433825906-05:00" level=info msg="{Action=stop, LoginUID=4294967295, PID=122522}"
Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.434263245-05:00" level=error msg="Handler for POST /containers/cbac17c069cef22851ea68b8b4...y stopped"
Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.434316705-05:00" level=error msg="Handler for POST /containers/cbac17c069cef22851ea68b8b4...y stopped"
Nov 22 06:59:40 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:40.276733032-05:00" level=info msg="{Action=remove, LoginUID=4294967295, PID=122522}"
Nov 22 06:59:40 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:40.483223665-05:00" level=info msg="{Action=remove, LoginUID=4294967295, PID=122522}"
Nov 22 06:59:43 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:43.000590367-05:00" level=error msg="Handler for GET /containers/cbac17c069cef22851ea68b8b4c...c0b16b5b7"
Nov 22 06:59:43 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:43.000627456-05:00" level=error msg="Handler for GET /containers/cbac17c069cef22851ea68b8b4c...c0b16b5b7"
Nov 22 07:40:19 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T07:40:19.332204433-05:00" level=info msg="{Action=logs, LoginUID=4294967295, PID=122522}"
Hint: Some lines were ellipsized, use -l to show in full.
root@dhcp7-100: ~ # systemctl status atomic-openshift-node
● atomic-openshift-node.service - Atomic OpenShift Node
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
           └─openshift-sdn-ovs.conf
   Active: active (running) since Mon 2016-11-21 14:48:09 EST; 16h ago
     Docs: https://github.com/openshift/origin
 Main PID: 122522 (openshift)
   Memory: 44.1M
   CGroup: /system.slice/atomic-openshift-node.service
           ├─122522 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2
           └─122778 journalctl -k -f

Nov 22 07:41:14 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:14.820855  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Nov 22 07:41:14 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:14.820909  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Nov 22 07:41:24 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:24.856654  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Nov 22 07:41:24 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:24.856785  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Nov 22 07:41:34 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:34.918107  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Nov 22 07:41:34 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:34.918213  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Nov 22 07:41:45 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:45.031724  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Nov 22 07:41:45 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:45.031791  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Nov 22 07:41:55 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:55.072071  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Nov 22 07:41:55 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:55.072119  122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats
Hint: Some lines were ellipsized, use -l to show in full.
root@dhcp7-100: ~ # docker info
Containers: 2
 Running: 0
 Paused: 0
 Stopped: 2
Images: 30
Server Version: 1.12.2-rc2
Storage Driver: devicemapper
 Pool Name: docker_vg-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 11.91 GB
 Data Space Total: 22.42 GB
 Data Space Available: 10.51 GB
 Metadata Space Used: 1.626 MB
 Metadata Space Total: 25.17 MB
 Metadata Space Available: 23.54 MB
 Thin Pool Minimum Free Space: 2.242 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-09-28)
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: host null bridge overlay
 Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 2
Total Memory: 7.638 GiB
Name: dhcp7-100.example.net
ID: KJ5J:FJPP:5HCT:CLM2:5V2S:TKDG:UVOE:VCAV:WG4G:IQVD:FACE:BOBN
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.ops.openshift.com/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 registry.ops.openshift.com
 127.0.0.0/8
Registries: registry.ops.openshift.com (insecure), registry.access.redhat.com (secure), docker.io (secure)
root@dhcp7-100: ~ #

Comment 5 Michal Fojtik 2016-11-22 12:53:02 UTC
Jordan, this looks like:

https://github.com/kubernetes/kubernetes/issues/36891

Comment 6 Michal Fojtik 2016-11-22 12:55:37 UTC
(except the pods that are hanging in terminating state seems to have   deletionTimestamp annotation set)

Comment 7 Jiří Mencák 2016-11-22 13:02:40 UTC
Adding requested yaml for failed pods:

http://ekuric.usersys.redhat.com/jmencak/BZ1397356/oc_get_pods.yaml

Comment 8 Jordan Liggitt 2016-11-22 13:10:19 UTC
> Jordan, this looks like:
> 
> https://github.com/kubernetes/kubernetes/issues/36891

No, in that issue, the remaining items have no deletionTimrstamls, and the namespace is actually gone (not stuck in terminating)

Comment 9 Elvir Kuric 2016-11-22 19:20:20 UTC
It is possible to schedule pods on node where are pods in "Terminating" state 

eg 

# oc get pods --all-namespaces  -o wide | grep dhcp8-241.example.net
clusterprojecta11   hellopods80                0/1       Terminating         0          1d        <none>           dhcp8-241.example.net
clusterprojectc6    hellopods0                 1/1       Running             0          1m        172.20.160.58    dhcp8-241.example.net

Comment 11 Jiří Mencák 2016-11-23 07:59:08 UTC
I'm currently trying to work around this issue by first slowly (1 rc/second) scaling down the pods by their rcs within the project to 0 and only after that deleting the project.  Worked for me within the same environment (same setup, different cluster) 100 projects, 18 pods per project.  Not sure it works every time, tried only once so far.

Comment 29 Vikas Laad 2016-11-28 14:40:58 UTC
I am running into this issue in scale lab, this happened when I tried deleting 200 projects.

Comment 30 Vikas Laad 2016-11-29 21:56:13 UTC
Created attachment 1226065 [details]
node logs

Attaching the logs from one of the nodes where pod is stuck in terminating.

Comment 31 Derek Carr 2016-11-29 22:05:44 UTC
Can you also add the pod YAML as well?

Comment 32 Vikas Laad 2016-11-29 22:22:25 UTC
Created attachment 1226070 [details]
pod yaml file

Comment 35 Vikas Laad 2016-11-30 20:38:45 UTC
Created attachment 1226501 [details]
pods json

oc get --raw=/api/v1/nodes/10.10.1.108/proxy/pods | python -mjson.tool >> 108_pods.json

Comment 36 Derek Carr 2016-11-30 22:11:23 UTC
After more analysis, the problem does appear to be on the API server.

For the pod in question, doing a force delete via:

$ oc delete pod <pod-name> --grace-period=0 --loglevel=8 

The pod appears to never be deleted if it

1. has a graceperiodseconds=0 already
2. has a deletionTimestamp set already

There are actually 2 bugs potentially here:

1. how did a pod end up in this state
2. how come the operator cannot recover from it by force deletion

This gives us an easier recreate scenario to debug further.

Comment 37 Derek Carr 2016-11-30 23:04:58 UTC
Upstream PR to 1.5 to fix server-side graceful deletion error:
https://github.com/kubernetes/kubernetes/pull/37721

Comment 42 Troy Dawson 2016-12-06 20:31:33 UTC
This has been merged into ocp and is in OCP v3.4.0.33 or newer.

Comment 44 DeShuai Ma 2016-12-07 05:38:17 UTC
Could you help verify the issue is fixed in your env?

Comment 47 errata-xmlrpc 2017-01-18 12:55:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066


Note You need to log in before you can comment on or make changes to this bug.