Bug 1319378

Summary: Handle docker corruption effectively
Product: Red Hat Enterprise Linux 7 Reporter: Jaspreet Kaur <jkaur>
Component: dockerAssignee: Vivek Goyal <vgoyal>
Status: CLOSED CURRENTRELEASE QA Contact: atomic-bugs <atomic-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.2CC: aos-bugs, aweiteka, bvincell, dwalsh, ghelleks, jokerman, lsm5, michael.voegele, mmccomas
Target Milestone: rcKeywords: Extras
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-18 13:16:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jaspreet Kaur 2016-03-19 05:21:47 UTC
3. What is the nature and description of the request?
The local docker registry can get corrupt. This should not happen, it should stay stable in all situations.

4. Why does the customer need this? (List the business requirements here)
To have a stable environment. To have lower cost in operations.

5. How would the customer like to achieve this? (List the functional requirements here) 
I don't know the details of how the managing of the local docker registry is implemented. But before starting operations on the local docker registry, it should be checked for available space. If there is not enough space available and no space can be freed, an error should be reported instead of making it corrupt.

6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented. 
Decrease the size of the local docker registry on a node, so it is possible to fill it quickly. Play around with some pods (new-app, delete pod). The registry should never get corrupt (e.g. pvs does not show warnings).

7. Is there already an existing RFE upstream or in Red Hat Bugzilla? 
not yet

8. Does the customer have any specific timeline dependencies and which release would they like to target?
ASAP. Every customer would like to have STABLE environments.

9. Is the sales team involved in this request and do they have any additional input? Red Hat Consultant on site, account team fully aware of the request. 

no

10. List any affected packages or components.
docker, local docker registry

11. Would the customer be able to assist in testing this functionality if implemented? 
Yes

Comment 3 Daniel Walsh 2016-03-21 14:58:51 UTC
This seems to be pointed more at docker registry then at docker.

Comment 6 Daniel Walsh 2016-04-04 13:14:33 UTC
Couldn't you have just removed some container images, and then it would start working again?

Comment 7 Jaspreet Kaur 2016-04-04 13:34:53 UTC
Hello,

No it doesnt help once it is corrupted + it is not an effective way as it might prevent deployment of application that need those images for an existing project.

Regards,
Jaspreet

Comment 8 Daniel Walsh 2016-04-04 14:06:32 UTC
My point is there is probably a lot of junk images that you don't even know about.

atomic image prune 

Should get rid of hanging images which nothing is using.  It would temporarily get you out of this situation, and get your containers working again.  Being able to expand the disk image would also help supposedly.

Comment 9 Vivek Goyal 2016-04-11 17:45:58 UTC
Does a reboot solve the problem? If thin pool is full, xfs can infinitely and to solve that, one needs to add more storage to thin pool and as of now system needs to be rebooted to get rid of unkillable IO thread.

I think after reboot,one can also first try to delete some images and hopefully that will work. If not, we first need to add more storage to thin pool and make sure it grows successfully and then do further docker operations.

Comment 10 Vivek Goyal 2016-04-11 17:48:56 UTC
Once you run into the situation, please attach following

- journalctl output
- Preferrably run docker daemon in debug mode (-D)
- output of commands "lvs", "vgs"

Comment 11 Daniel Walsh 2016-04-11 18:30:56 UTC
We need to document a better way to get out of this state.

You need to reboot.  
atomic images purge

Now if you still need more space list docker images and see if there is other images that can be removed.

Long range we have patches for docker-1.10 that will block docker pull and docker create when the system is 90% used up.

Comment 13 Daniel Walsh 2016-04-26 13:13:19 UTC
No I am not saying that this will not happen in docker-1.10, we are just taking steps to make it less likely,  Giving the users 10% of disk space to figure out he is having a problem.  This will block new containers and images from being installed but will not prevent existing containers from growing.

Comment 14 Jaspreet Kaur 2016-04-27 05:55:49 UTC
Thanks Daniel for the information.

But if the docker gets corrupt after growing containers it should have an easy way to get it back to ready state. Reboot will not be an option for any of the users.

The only concern is that even they they take preventions and meet the corrupt state then there should be a resolution to that.

Comment 15 Daniel Walsh 2016-04-27 12:01:13 UTC
xfs going wild is a kernel issue that we can not fix.  I believe their is a kernel bug on it.  Only way to fix this with current kernels is to reboot.

Comment 16 Jaspreet Kaur 2016-05-18 05:44:39 UTC
Hello Daniel, 

Can you please share the Kernel bugzilla on this.

Regards,
Jaspreet

Comment 18 Daniel Walsh 2016-05-19 14:08:49 UTC
https://github.com/docker/docker/issues/20707

Comment 19 Daniel Walsh 2016-05-19 14:09:23 UTC
Vivek, I could not google up a bugzilla on the kernel for this.  Do you know of any?

Comment 20 Vivek Goyal 2016-05-19 14:45:04 UTC
Dan, Following is one of the bugs which talked about xfs being full and leading to hang.

https://bugzilla.redhat.com/show_bug.cgi?id=1240437

Comment 26 Daniel Walsh 2016-10-18 13:16:06 UTC
Since we are now shipping docker-1.10, I am going to close this as fixed in the current release.