1415076 – unable to delete dead container with service account filesystem

Bug 1415076 - unable to delete dead container with service account filesystem

Summary: unable to delete dead container with service account filesystem

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Derek Carr
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-20 07:49 UTC by Jaspreet Kaur
Modified:	2020-02-14 18:29 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-31 22:21:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jaspreet Kaur 2017-01-20 07:49:29 UTC

Description of problem: The docker-cleanup script is unable to delete the container because of serviceaccount filesystem is busy.

Error response from daemon: Unable to remove filesystem for 1c2ae6b10fe6fe66af762e4177e57f748995837b0d2f61695181e0573263f37c: remove /var/lib/docker/containers/1c2ae6b10fe6fe66af762e4177e57f748995837b0d2f61695181e0573263f37c/secrets/kubernetes.io/serviceaccount: device or resource busy
Error response from daemon: Unable to remove filesystem for 7048c117f5a8887c49e561188db76a10b38dc1ba986d553be45244948bc91125: remove /var/lib/docker/containers/7048c117f5a8887c49e561188db76a10b38dc1ba986d553be45244948bc91125/secrets/kubernetes.io/serviceaccount: device or resource busy
Error response from daemon: Unable to remove filesystem for e965cc54c38e5332cde6d6e175f3e662451ab2756a10302d9daee49cc51d5234: remove /var/lib/docker/containers/e965cc54c38e5332cde6d6e175f3e662451ab2756a10302d9daee49cc51d5234/secrets/kubernetes.io/serviceaccount: device or resource busy


Tried steps here : https://access.redhat.com/solutions/2840311

The above article fails to kill the process.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Seth Jennings 2017-01-26 23:28:28 UTC

Some first level triage.

All the processes in the support ticket that are holding the mount points are in uninterruptable sleep as indicated by D n the STAT colum:

# ps -f 5380 5381 15850 109932
UID         PID   PPID  C STIME TTY      STAT   TIME CMD
root       5380      1  0 Jan18 ?        Dl     0:00 java -Xmx256m -Djava.library.path=/opt/draios/lib -Dsun.rmi.transport.connectionTimeout=2000 -Dsun.rmi.transport.tc
root       5381      1  0 Jan18 ?        D      0:00 statsite -f /opt/draios/etc/statsite.ini
1000090+  15850 107147  0 Jan18 ?        D      0:00 du -sh _state indices node.lock
1000090+ 109932 109905 13  2016 ?        Dl   20105:32 /bin/java -Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccu

This is why they are not dying in response to a kill -9.

I'm not sure if the customer system is still in this state, but it is likely that a NFS server or block device is not responding.

"cat /proc/<pid>/stack" might shed some light on where in the kernel the processes are stuck.

Comment 2 Derek Carr 2017-01-31 22:21:00 UTC

Closing bug as customer case is resolved.

Note You need to log in before you can comment on or make changes to this bug.