Bug 1205454 - Race condition with RegisterMachine and containers that exit quickly
Summary: Race condition with RegisterMachine and containers that exit quickly
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: docker
Version: 22
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Sally
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-24 22:42 UTC by Colin Walters
Modified: 2016-07-19 13:11 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-19 13:11:41 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Colin Walters 2015-03-24 22:42:15 UTC
With docker-io-1.4.1-27.git76baa35.fc22.x86_64 "docker run busybox echo hello" works fine.  It fails with: docker-1.5.0-24.git5ebfacd.fc22.x86_64

Looks like the fatal error below may be:
Failed to determine unit of process 1074 : Invalid argument
An issue with the systemd cgroup integration?

Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="+job wait(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="POST /v1.18/containers/179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543/wait"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="-job attach(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543) = OK (0)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="-job release_interface(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543) = OK (0)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="+job release_interface(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=warning msg="Your kernel does not support OOM notifications: open /sys/fs/cgroup/memory/system.slice/docker-179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543.scope/memory.oom_control: no such file or directory"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="-job start(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543) = OK (0)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=error msg="Unable to RegisterMachine hopeful_hodgkin for 179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543: Failed to determine unit of process 1074 : Invalid argument"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="-job log(start, 179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543, busybox:latest) = OK (0)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="+job log(start, 179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543, busybox:latest)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="-job allocate_interface(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543) = OK (0)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="+job allocate_interface(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="+job start(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="POST /v1.18/containers/179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543/start"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="+job attach(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="-job container_inspect(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543) = OK (0)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="+job container_inspect(179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="POST /v1.18/containers/179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543/attach?stderr=1&stdout=1&stream=1"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="-job create() = OK (0)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="-job log(create, 179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543, busybox:latest) = OK (0)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="+job log(create, 179a5460785b92c923ecdeecc52ee81119a1cf85a6c1bf52c32a3321bfac9543, busybox:latest)"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="+job create()"
Mar 24 22:32:20 localhost.localdomain docker[826]: time="2015-03-24T22:32:20Z" level=info msg="POST /v1.18/containers/create"

Comment 1 Colin Walters 2015-03-26 19:45:09 UTC
Ok, I think this is a race condition between RegisterMachine and a container that exits quickly.

This happens with a simple "docker run centos true", but works fine with "docker run centos sleep 5".

Comment 2 Daniel Walsh 2015-03-26 20:24:26 UTC
Are you just seeing the error in the logs?  Or are you seeing a fatal error.

We saw this in testing, and just basically ignored the failure.

You are correct that we register with a pid, which could have disappeared before we register the machine.

Comment 3 Colin Walters 2015-03-26 21:53:53 UTC
(In reply to Daniel Walsh from comment #2)
> Are you just seeing the error in the logs?  Or are you seeing a fatal error.
>
> We saw this in testing, and just basically ignored the failure.

I have a wedged container that appears running but can't be stopped.
 
> You are correct that we register with a pid, which could have disappeared
> before we register the machine.

I think the parent process here needs to avoid calling waitpid() on it until the RegisterMachine call has returned.

Ah, looking at the code there's even a comment for this.  

Although, I'm not sure offhand why this causes an issue if we're just logging an error.

Comment 4 Daniel Walsh 2015-03-27 14:44:06 UTC
I am not sure we can get the pid of the child process and register the machine before the process exits.  I though the WaitFor

Comment 5 Daniel Walsh 2015-04-14 14:21:42 UTC
Is there a way for us to know the error is PID does not exist, in which case we can just quietly ignore the error.

Comment 6 Colin Walters 2015-04-14 15:17:58 UTC
(In reply to Daniel Walsh from comment #5)
> Is there a way for us to know the error is PID does not exist, in which case
> we can just quietly ignore the error.

That would cover up the race condition here where the pid is recycled and we end up calling RegisterMachine for someone else's process.  I'd much prefer the docker daemon avoided calling waitpid() until RegisterMachine was done.

Comment 7 Daniel Walsh 2015-04-14 18:32:19 UTC
I am trying to move it to the StartCallback, which looks like it should be called before waitpid, but I still see the problem 1 out of 5 tries.  :^(  I am asking Michael Crosby if he has any ideas.

Comment 8 Dusty Mabe 2015-05-07 16:07:27 UTC
This happens for Fedora-Cloud-Atomic-22_Beta_TC8-20150405.x86_64.qcow2:

-bash-4.3# atomic host status
  TIMESTAMP (UTC)         VERSION   ID             OSNAME            REFSPEC                                                
* 2015-04-15 12:50:37     22.39     a8d8656489     fedora-atomic     fedora-atomic:fedora-atomic/f22/x86_64/docker-host
-bash-4.3# rpm -q docker
docker-1.5.0-25.git5ebfacd.fc22.x86_64

The command I am trying to run is:

docker run -it --rm busybox true && echo "PASS"

The logs give me:

time="2015-05-07T15:54:43Z" level=error msg="Unable to RegisterMachine mad_bartik for def45f097c895cec20c407eea09bf6887caa08750903fc7624097261ce404147: Failed to determine unit of process 1150 : Invalid argument"

Comment 10 Sally 2015-07-07 18:03:53 UTC
I just looked at this (again) and I am unable to reproduce.  

Docker 1.6.2, RHEL 7

I don't get any errors with either:

docker run -it --rm busybox true && echo "PASS"

docker run busybox echo hello

Anything else/other versions/distributions I should try?

Comment 11 Sally 2015-07-07 18:14:10 UTC
and also can't reproduce with Docker 1.7 Fedora 21

Comment 12 Daniel Walsh 2015-07-08 13:24:46 UTC
You were not testing with my patches.  Make sure you use the systems docker to see this not just vanilla docker from github

Comment 13 Sally 2015-07-10 19:27:45 UTC
I submitted https://github.com/rhatdan/docker/pull/88

Comment 14 Daniel Walsh 2015-08-21 03:32:25 UTC
Now that we have this working with hooks. Sally you need to get this added to the docker package.

Comment 15 Sally 2015-09-03 08:40:23 UTC
Using Mrunal's hooks (https://github.com/mrunalp/runc/tree/feature/hooks) I have a branch (https://github.com/sallyom/runc/tree/registermachine) that implements RegisterMachine/TerminateMachine with no race condition.  Will open PR once hooks are merged.

Comment 16 Daniel Walsh 2015-09-03 10:39:09 UTC
Docker-1.8 should no longer have the registry machine patch.  As Sally states we are attempting to do this a different way in the future.

Comment 17 Lokesh Mandvekar 2016-02-16 17:16:45 UTC
Probably fixed long back. Putting this ON_QA so we can close this one out soon. We have docker 1.10.1 in testing repos.

Comment 18 Fedora End Of Life 2016-07-19 13:11:41 UTC
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.