Description of problem: After project delete, they will not terminate and will stay in "hanged" state for long time Version-Release number of selected component (if applicable): # rpm -qa | egrep "atomic|docker|kernel" kernel-3.10.0-511.el7.x86_64 atomic-openshift-3.4.0.27-1.git.0.dffa95c.el7.x86_64 tuned-profiles-atomic-openshift-node-3.4.0.27-1.git.0.dffa95c.el7.x86_64 atomic-openshift-master-3.4.0.27-1.git.0.dffa95c.el7.x86_64 atomic-openshift-pod-3.4.0.27-1.git.0.dffa95c.el7.x86_64 docker-rhel-push-plugin-1.12.2-5.el7.x86_64 atomic-openshift-node-3.4.0.27-1.git.0.dffa95c.el7.x86_64 kernel-tools-3.10.0-514.el7.x86_64 docker-selinux-1.10.3-57.el7.x86_64 python-docker-py-1.9.0-1.el7.noarch kernel-tools-libs-3.10.0-514.el7.x86_64 docker-1.12.2-5.el7.x86_64 atomic-openshift-sdn-ovs-3.4.0.27-1.git.0.dffa95c.el7.x86_64 atomic-openshift-tests-3.4.0.27-1.git.0.dffa95c.el7.x86_64 kernel-headers-3.10.0-514.el7.x86_64 kernel-3.10.0-514.el7.x86_64 tuned-profiles-atomic-2.7.1-3.el7.noarch docker-common-1.12.2-5.el7.x86_64 atomic-openshift-dockerregistry-3.4.0.27-1.git.0.dffa95c.el7.x86_64 kernel-3.10.0-512.el7.x86_64 atomic-openshift-clients-3.4.0.27-1.git.0.dffa95c.el7.x86_64 atomic-openshift-clients-redistributable-3.4.0.27-1.git.0.dffa95c.el7.x86_64 How reproducible: Create 100s of projects ( with 100 pods per project ) and delete delete them ( oc delete project $project ) Steps to Reproduce: See above Actual results: project stay hanging in "terminating" state Expected results: projects to be deleted Additional info: - Restarting master and node services - rebooting node where pod is assigned did not clear it from below view - pods / projects still in "terminating" state # oc get pods --all-namespaces -o wide | grep Term clusterproject11 hellopods51 0/1 Terminating 0 1d <none> dhcp7-91.example.net clusterproject11 hellopods64 0/1 Terminating 0 1d <none> dhcp7-177.example.net clusterproject11 hellopods71 0/1 Terminating 0 1d <none> dhcp8-194.example.net clusterproject11 hellopods75 0/1 Terminating 0 1d <none> dhcp8-12.example.net clusterproject15 hellopods41 0/1 Terminating 0 1d <none> dhcp7-202.example.net clusterproject17 hellopods10 0/1 Terminating 0 1d <none> dhcp8-161.example.net clusterproject17 hellopods65 0/1 Terminating 0 1d <none> dhcp8-153.example.net clusterproject26 hellopods23 0/1 Terminating 0 1d <none> dhcp8-196.example.net clusterproject26 hellopods26 0/1 Terminating 0 1d <none> dhcp7-181.example.net clusterproject26 hellopods33 0/1 Terminating 0 1d <none> dhcp7-170.example.net clusterproject32 hellopods61 0/1 Terminating 0 1d <none> dhcp7-190.example.net clusterproject32 hellopods72 0/1 Terminating 0 1d <none> dhcp8-237.example.net clusterproject34 hellopods96 0/1 Terminating 0 1d <none> dhcp7-139.example.net clusterproject44 hellopods23 0/1 Terminating 0 1d <none> dhcp8-18.example.net clusterproject48 hellopods2 0/1 Terminating 0 1d <none> dhcp6-183.example.net clusterprojecta11 hellopods37 0/1 Terminating 0 21h <none> dhcp8-192.example.net clusterprojecta11 hellopods44 0/1 Terminating 0 21h <none> dhcp7-241.example.net clusterprojecta11 hellopods80 0/1 Terminating 0 20h <none> dhcp8-241.example.net clusterprojecta11 hellopods94 0/1 Terminating 0 20h <none> dhcp7-169.example.net clusterprojecta21 hellopods10 0/1 Terminating 0 20h <none> dhcp7-246.example.net clusterprojecta21 hellopods13 0/1 Terminating 0 20h <none> dhcp7-197.example.net clusterprojecta21 hellopods14 0/1 Terminating 0 20h <none> dhcp7-181.example.net clusterprojecta21 hellopods36 0/1 Terminating 0 20h <none> dhcp8-153.example.net clusterprojecta21 hellopods37 0/1 Terminating 0 20h <none> dhcp7-190.example.net clusterprojecta21 hellopods4 0/1 Terminating 0 20h <none> dhcp7-29.example.net clusterprojecta21 hellopods44 0/1 Terminating 0 20h <none> dhcp7-74.example.net clusterprojecta21 hellopods59 0/1 Terminating 0 20h <none> dhcp6-186.example.net clusterprojecta21 hellopods65 0/1 Terminating 0 20h <none> dhcp8-171.example.net clusterprojecta21 hellopods81 0/1 Terminating 0 20h <none> dhcp7-224.example.net clusterprojecta21 hellopods85 0/1 Terminating 0 20h <none> dhcp7-177.example.net clusterprojecta21 hellopods9 0/1 Terminating 0 20h <none> dhcp7-180.example.net clusterprojecta3 hellopods13 0/1 Terminating 0 21h <none> dhcp7-16.example.net clusterprojecta3 hellopods15 0/1 Terminating 0 21h <none> dhcp8-177.example.net clusterprojecta3 hellopods19 0/1 Terminating 0 21h <none> dhcp7-89.example.net clusterprojecta3 hellopods39 0/1 Terminating 0 21h <none> dhcp7-8.example.net clusterprojecta3 hellopods42 0/1 Terminating 0 21h <none> dhcp7-166.example.net clusterprojecta3 hellopods44 0/1 Terminating 0 21h <none> dhcp7-112.example.net clusterprojecta3 hellopods48 0/1 Terminating 0 21h <none> dhcp7-50.example.net clusterprojecta3 hellopods61 0/1 Terminating 0 21h <none> dhcp7-215.example.net clusterprojecta3 hellopods73 0/1 Terminating 0 21h <none> dhcp8-228.example.net clusterprojecta33 hellopods87 0/1 Terminating 0 19h <none> dhcp8-129.example.net clusterprojecta37 hellopods22 0/1 Terminating 0 20h <none> dhcp8-221.example.net clusterprojecta37 hellopods36 0/1 Terminating 0 20h <none> dhcp8-186.example.net clusterprojecta37 hellopods44 0/1 Terminating 0 19h <none> dhcp8-253.example.net clusterprojecta37 hellopods45 0/1 Terminating 0 19h <none> dhcp7-205.example.net clusterprojecta37 hellopods49 0/1 Terminating 0 19h <none> dhcp7-130.example.net clusterprojecta37 hellopods57 0/1 Terminating 0 19h <none> dhcp8-198.example.net clusterprojecta37 hellopods6 0/1 Terminating 0 20h <none> dhcp8-147.example.net clusterprojecta37 hellopods72 0/1 Terminating 0 19h <none> dhcp8-208.example.net clusterprojecta37 hellopods75 0/1 Terminating 0 19h <none> dhcp7-172.example.net clusterprojecta37 hellopods79 0/1 Terminating 0 19h <none> dhcp7-100.example.net clusterprojecta37 hellopods80 0/1 Terminating 0 19h <none> dhcp8-129.example.net clusterprojecta37 hellopods81 0/1 Terminating 0 19h <none> dhcp7-91.example.net clusterprojecta37 hellopods83 0/1 Terminating 0 19h <none> dhcp7-47.example.net clusterprojecta37 hellopods84 0/1 Terminating 0 19h <none> dhcp7-250.example.net clusterprojecta9 hellopods40 0/1 Terminating 0 21h <none> dhcp7-82.example.net clusterprojecta9 hellopods45 0/1 Terminating 0 21h <none> dhcp7-179.example.net clusterprojecta9 hellopods5 0/1 Terminating 0 21h <none> dhcp7-79.example.net clusterprojecta9 hellopods53 0/1 Terminating 0 21h <none> dhcp8-153.example.net clusterprojecta9 hellopods62 0/1 Terminating 0 21h <none> dhcp6-180.example.net clusterprojecta9 hellopods83 0/1 Terminating 0 21h <none> dhcp8-157.example.net clusterprojecta9 hellopods86 0/1 Terminating 0 21h <none> dhcp7-175.example.net clusterprojecta9 hellopods96 0/1 Terminating 0 21h <none> dhcp8-189.example.net root@dhcp6-26: ~ #
I've hit the same issue on # oc version oc v3.4.0.28+dfe3a66 kubernetes v1.4.0+776c994 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-31-13-129.us-west-2.compute.internal:8443 openshift v3.4.0.28+dfe3a66 kubernetes v1.4.0+776c994 I've had 100 projects, 18 nginx pods per projects. After deletion of all the projects I'm left with 2 projects that will not terminate: # oc get projects|grep Termin nginx-16 Terminating nginx-15 Terminating # oc get pods -n nginx-15 NAME READY STATUS RESTARTS AGE nginx-6-s0yei 0/1 Terminating 0 19h # oc get pods -n nginx-16 NAME READY STATUS RESTARTS AGE nginx-18-xp8r2 0/1 Terminating 0 19h nginx-6-0cc2g 0/1 Terminating 0 19h nginx-7-k6r4s 0/1 Terminating 0 19h nginx-8-u2tad 0/1 Terminating 0 19h
(In reply to jmencak from comment #1) Are there any usefull logs when this happen about these projects and the reason why they are hanging in terminating state? > I've hit the same issue on > > # oc version > oc v3.4.0.28+dfe3a66 > kubernetes v1.4.0+776c994 > features: Basic-Auth GSSAPI Kerberos SPNEGO > > Server https://ip-172-31-13-129.us-west-2.compute.internal:8443 > openshift v3.4.0.28+dfe3a66 > kubernetes v1.4.0+776c994 > > I've had 100 projects, 18 nginx pods per projects. After deletion of all > the projects I'm left with 2 projects that will not terminate: > > # oc get projects|grep Termin > nginx-16 Terminating > nginx-15 Terminating > > # oc get pods -n nginx-15 > NAME READY STATUS RESTARTS AGE > nginx-6-s0yei 0/1 Terminating 0 19h > > # oc get pods -n nginx-16 > NAME READY STATUS RESTARTS AGE > nginx-18-xp8r2 0/1 Terminating 0 19h > nginx-6-0cc2g 0/1 Terminating 0 19h > nginx-7-k6r4s 0/1 Terminating 0 19h > nginx-8-u2tad 0/1 Terminating 0 19h
additional info. project clusterproject37 is for 21h i "Terminating" state. below are logs from that pod # oc logs hellopods79 -n clusterprojecta37 serving on 8080 serving on 8888 # oc get pods --all-namespaces -o wide | grep hellopods79 clusterprojecta37 hellopods79 0/1 Terminating 0 21h <none> dhcp7-100.example.net root@dhcp6-26: ~ # oc describe pod hellopods79 -n clusterprojecta37 Name: hellopods79 Namespace: clusterprojecta37 Security Policy: anyuid Node: dhcp7-100.example.net/172.16.7.100 Start Time: Mon, 21 Nov 2016 10:14:54 -0500 Labels: name=hello-openshift Status: Terminating (expires Mon, 21 Nov 2016 11:18:58 -0500) Termination Grace Period: 0s IP: Controllers: <none> Containers: hello-openshift: Container ID: docker://15101b252c0f342778c9a366a1f2ff5d5c1e010e781d1155a1d2ea2f71bbf416 Image: openshift/hello-openshift:v1.0.6 Image ID: docker-pullable://docker.io/openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11 Port: 8080/TCP State: Terminated Reason: Error Exit Code: 2 Started: Mon, 21 Nov 2016 10:15:48 -0500 Finished: Mon, 21 Nov 2016 11:19:00 -0500 Ready: False Restart Count: 0 Volume Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-xl1wq (ro) Environment Variables: <none> Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: default-token-xl1wq: Type: Secret (a volume populated by a Secret) SecretName: default-token-xl1wq QoS Class: BestEffort Tolerations: <none> No events. # ssh dhcp7-100.example.net Last login: Tue Nov 22 07:35:05 2016 from dhcp6-26.example.net root@dhcp7-100: ~ # docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 15101b252c0f openshift/hello-openshift:v1.0.6 "/hello-openshift" 21 hours ago Exited (2) 20 hours ago k8s_hello-openshift.d809ad9d_hellopods79_clusterprojecta37_3fbe5360-affd-11e6-82e7-001a4b1004c3_ee3b591b 4e8543c04f34 registry.ops.openshift.com/openshift3/ose-pod:v3.4.0.27 "/pod" 21 hours ago Exited (0) 20 hours ago k8s_POD.17e5e6be_hellopods79_clusterprojecta37_3fbe5360-affd-11e6-82e7-001a4b1004c3_7210c77d ssh dhcp7-100.example.net Last login: Tue Nov 22 07:35:39 2016 from dhcp6-26.example.net sroot@dhcp7-100: ~ # systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─custom.conf Active: active (running) since Sun 2016-11-20 08:54:26 EST; 1 day 22h ago Docs: http://docs.docker.com Main PID: 22859 (dockerd-current) Memory: 71.7M CGroup: /system.slice/docker.service ├─22859 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt nati... └─22865 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --shim docker-containerd-shim --metrics-interval=0 --start-timeout 2m --state-dir /var/... Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.420126652-05:00" level=error msg="Handler for POST /containers/0196c977c232262590b62f4bad...y stopped" Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.420150507-05:00" level=error msg="Handler for POST /containers/0196c977c232262590b62f4bad...y stopped" Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.433825906-05:00" level=info msg="{Action=stop, LoginUID=4294967295, PID=122522}" Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.434263245-05:00" level=error msg="Handler for POST /containers/cbac17c069cef22851ea68b8b4...y stopped" Nov 22 06:59:15 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:15.434316705-05:00" level=error msg="Handler for POST /containers/cbac17c069cef22851ea68b8b4...y stopped" Nov 22 06:59:40 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:40.276733032-05:00" level=info msg="{Action=remove, LoginUID=4294967295, PID=122522}" Nov 22 06:59:40 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:40.483223665-05:00" level=info msg="{Action=remove, LoginUID=4294967295, PID=122522}" Nov 22 06:59:43 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:43.000590367-05:00" level=error msg="Handler for GET /containers/cbac17c069cef22851ea68b8b4c...c0b16b5b7" Nov 22 06:59:43 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T06:59:43.000627456-05:00" level=error msg="Handler for GET /containers/cbac17c069cef22851ea68b8b4c...c0b16b5b7" Nov 22 07:40:19 dhcp7-100.example.net dockerd-current[22859]: time="2016-11-22T07:40:19.332204433-05:00" level=info msg="{Action=logs, LoginUID=4294967295, PID=122522}" Hint: Some lines were ellipsized, use -l to show in full. root@dhcp7-100: ~ # systemctl status atomic-openshift-node ● atomic-openshift-node.service - Atomic OpenShift Node Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d └─openshift-sdn-ovs.conf Active: active (running) since Mon 2016-11-21 14:48:09 EST; 16h ago Docs: https://github.com/openshift/origin Main PID: 122522 (openshift) Memory: 44.1M CGroup: /system.slice/atomic-openshift-node.service ├─122522 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 └─122778 journalctl -k -f Nov 22 07:41:14 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:14.820855 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Nov 22 07:41:14 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:14.820909 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Nov 22 07:41:24 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:24.856654 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Nov 22 07:41:24 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:24.856785 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Nov 22 07:41:34 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:34.918107 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Nov 22 07:41:34 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:34.918213 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Nov 22 07:41:45 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:45.031724 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Nov 22 07:41:45 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:45.031791 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Nov 22 07:41:55 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:55.072071 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Nov 22 07:41:55 dhcp7-100.example.net atomic-openshift-node[122522]: I1122 07:41:55.072119 122522 conversion.go:133] failed to handle multiple devices for container. Skipping F...stem stats Hint: Some lines were ellipsized, use -l to show in full. root@dhcp7-100: ~ # docker info Containers: 2 Running: 0 Paused: 0 Stopped: 2 Images: 30 Server Version: 1.12.2-rc2 Storage Driver: devicemapper Pool Name: docker_vg-docker--pool Pool Blocksize: 524.3 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 11.91 GB Data Space Total: 22.42 GB Data Space Available: 10.51 GB Metadata Space Used: 1.626 MB Metadata Space Total: 25.17 MB Metadata Space Available: 23.54 MB Thin Pool Minimum Free Space: 2.242 GB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: true Deferred Deleted Device Count: 0 Library Version: 1.02.135-RHEL7 (2016-09-28) Logging Driver: json-file Cgroup Driver: systemd Plugins: Volume: local Network: host null bridge overlay Authorization: rhel-push-plugin Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Security Options: seccomp selinux Kernel Version: 3.10.0-514.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo) OSType: linux Architecture: x86_64 Number of Docker Hooks: 2 CPUs: 2 Total Memory: 7.638 GiB Name: dhcp7-100.example.net ID: KJ5J:FJPP:5HCT:CLM2:5V2S:TKDG:UVOE:VCAV:WG4G:IQVD:FACE:BOBN Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Registry: https://registry.ops.openshift.com/v1/ WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled Insecure Registries: registry.ops.openshift.com 127.0.0.0/8 Registries: registry.ops.openshift.com (insecure), registry.access.redhat.com (secure), docker.io (secure) root@dhcp7-100: ~ #
Jordan, this looks like: https://github.com/kubernetes/kubernetes/issues/36891
(except the pods that are hanging in terminating state seems to have deletionTimestamp annotation set)
Adding requested yaml for failed pods: http://ekuric.usersys.redhat.com/jmencak/BZ1397356/oc_get_pods.yaml
> Jordan, this looks like: > > https://github.com/kubernetes/kubernetes/issues/36891 No, in that issue, the remaining items have no deletionTimrstamls, and the namespace is actually gone (not stuck in terminating)
It is possible to schedule pods on node where are pods in "Terminating" state eg # oc get pods --all-namespaces -o wide | grep dhcp8-241.example.net clusterprojecta11 hellopods80 0/1 Terminating 0 1d <none> dhcp8-241.example.net clusterprojectc6 hellopods0 1/1 Running 0 1m 172.20.160.58 dhcp8-241.example.net
I'm currently trying to work around this issue by first slowly (1 rc/second) scaling down the pods by their rcs within the project to 0 and only after that deleting the project. Worked for me within the same environment (same setup, different cluster) 100 projects, 18 pods per project. Not sure it works every time, tried only once so far.
I am running into this issue in scale lab, this happened when I tried deleting 200 projects.
Created attachment 1226065 [details] node logs Attaching the logs from one of the nodes where pod is stuck in terminating.
Can you also add the pod YAML as well?
Created attachment 1226070 [details] pod yaml file
Created attachment 1226501 [details] pods json oc get --raw=/api/v1/nodes/10.10.1.108/proxy/pods | python -mjson.tool >> 108_pods.json
After more analysis, the problem does appear to be on the API server. For the pod in question, doing a force delete via: $ oc delete pod <pod-name> --grace-period=0 --loglevel=8 The pod appears to never be deleted if it 1. has a graceperiodseconds=0 already 2. has a deletionTimestamp set already There are actually 2 bugs potentially here: 1. how did a pod end up in this state 2. how come the operator cannot recover from it by force deletion This gives us an easier recreate scenario to debug further.
Upstream PR to 1.5 to fix server-side graceful deletion error: https://github.com/kubernetes/kubernetes/pull/37721
This has been merged into ocp and is in OCP v3.4.0.33 or newer.
Could you help verify the issue is fixed in your env?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066