Bug 1368267
Summary: | [extras-rhel-7.2.7] docker-latest-1.12 daemon crashes unrecoverably | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ed Santiago <santiago> |
Component: | docker-latest | Assignee: | Antonio Murdaca <amurdaca> |
Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.2 | CC: | amurdaca, dwalsh, lsm5, lsu, mpatel, vgoyal |
Target Milestone: | rc | Keywords: | Extras, ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | docker-latest-1.12.0-15.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-09-15 08:29:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 2
Ed Santiago
2016-08-19 01:26:24 UTC
Another docker-autotest run without a daemon crash. I'll tentatively assume this is a problem with my old virt. But I'm starting another test run now. Spoke too soon (comment 3). This last run of docker-autotest killed the daemon. Similar symptoms as in description, but FWIW it was in a much earlier test within docker-autotest; i.e. it does not seem to be correlated to any specific subtest. Can you attach the journal logs about docker? Ed could you remove oci-register-machine. Might have to rm -f /usr/libexec/oci/hooks.d/oci-register-machine We are not going to support this for RHEL7, ALso make sure docker-latest unit file is running with MountFlags=slave and docker-containerd.service file does not exist and docker-containerd is not running when you stop docker.service # mv /usr/libexec/oci/hooks.d/oci-register-machine{,.DELETE-ME} # grep MountFlags /usr/lib/systemd/system/docker-latest.service MountFlags=slave # ls -l /usr/lib/systemd/system/*contain* -rw-r--r--. 1 root root 791 Jun 14 09:36 /usr/lib/systemd/system/container-getty@.service # grep -ir containerd /usr/lib/systemd/system/ # ps auxww|grep container root 22919 0.0 0.0 112648 980 pts/2 R+ 08:59 0:00 grep --color=auto container # systemctl restart docker-latest ...and restarting docker-autotest. Normal run takes ~40m, expect followup then (or, if things crash, earlier). I still see oci-register-machine in that log? I rebooted 35 minutes ago, wiped /var/lib/docker-latest, ran docker-latest-storage-setup --reset, started a new run of docker-autotest. It completed with errors but no actual docker daemon crashes. Second run killed the daemon. And even after reboot, I too am seeing: # journalctl -b | grep oci-register|tail ... Aug 19 10:29:56 esm-rhel7-d12-3.localdomain oci-register-machine[15436]: 2016/08/19 10:29:56 Register machine: prestart eff8cfca5e7cc2674bcc1688c6cbb148848a9eb4a0e67e710d18c9ee67303134 15425 /var/lib/docker-latest/devicemapper/mnt/36924cbac77b8cc89ebf829157c47dbddcd0bf21321213a5028d2a3aafc0421f/rootfs # find / -xdev -name '*oci*register*' /var/lib/yum/yumdb/o/46369ab3dbeecdced9196135840b73c6f41f86d6-oci-register-machine-0-1.7.git31bbcd2.el7-x86_64 /usr/share/doc/oci-register-machine-0 /usr/share/doc/oci-register-machine-0/oci-register-machine.1.md /usr/share/man/man1/oci-register-machine.1.gz /usr/share/licenses/oci-register-machine-0 /usr/libexec/oci/hooks.d/oci-register-machine.DELETE-ME I've just tried rm -f that last one, restarted docker-latest, restarted docker-autotest. Yes docker will execute any thing in the hooks.d directory. So get rid of it. Three successful(*) docker-autotest runs since deleting oci-register-machine. (*) as in no daemon crashes Yet another reason to remove oci-register-machine. Followup: many test runs later, on two machines, no further daemon crashes seen. And FWIW no other instances of the sporadic "docker attach" failure that was happening on (some) runs. I'm tentatively going to conclude that those were related to oci-register-machine. Sounds good, but my question is why? Especially the crashes. I can see oci-register-machine failing because of some race condition. In docker-1.12 runc/oci-register-machine are running as separate processes from docker, docker-containerd, so how would they cause the daemon to crash. I wish I knew! But the daemon crash hasn't happened since removing oci-register-machine. I've kept running tests through the weekend, still no daemon crash. I haven't tried reinstalling oci-register-machine but suspect that if I do the crash will come back. Is there anything I can do on my end to try to track down the connection? Antonio and Mrunal any ideas? # yum reinstall oci-register-machine # systemctl restart docker-latest # ./autotest-local run docker ... Failed within minutes. # systemctl status docker-latest -l ● docker-latest.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker-latest.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: http://docs.docker.com Aug 22 09:37:09 esm-rhel7-d12-3.localdomain dockerd-latest[15844]: time="2016-08-22T09:37:09.311440508-04:00" level=error msg="containerd: notify OOM events" error="no init process found" Aug 22 09:37:11 esm-rhel7-d12-3.localdomain dockerd-latest[15844]: time="2016-08-22T09:37:11.321688341-04:00" level=error msg="Create container failed with error: mkdir /var/run/docker/libcontainerd/containerd/ea859eefd708ee3b3827c50c73b7bc5368110113d4af2b541ef02e65b18f4864: file exists" Aug 22 09:37:11 esm-rhel7-d12-3.localdomain dockerd-latest[15844]: time="2016-08-22T09:37:11.986123078-04:00" level=error msg="Handler for POST /v1.24/containers/ea859eefd708ee3b3827c50c73b7bc5368110113d4af2b541ef02e65b18f4864/start returned error: mkdir /var/run/docker/libcontainerd/containerd/ea859eefd708ee3b3827c50c73b7bc5368110113d4af2b541ef02e65b18f4864: file exists" Aug 22 09:37:11 esm-rhel7-d12-3.localdomain dockerd-latest[15844]: time="2016-08-22T09:37:11.986836677-04:00" level=error msg="Handler for POST /v1.24/containers/ea859eefd708ee3b3827c50c73b7bc5368110113d4af2b541ef02e65b18f4864/start returned error: mkdir /var/run/docker/libcontainerd/containerd/ea859eefd708ee3b3827c50c73b7bc5368110113d4af2b541ef02e65b18f4864: file exists" Aug 22 09:37:11 esm-rhel7-d12-3.localdomain dockerd-latest[15844]: time="2016-08-22T09:37:11.988595236-04:00" level=info msg="{Action=remove, Username=root, LoginUID=0, PID=22083}" Aug 22 09:37:11 esm-rhel7-d12-3.localdomain dockerd-latest[15844]: time="2016-08-22T09:37:11.991408617-04:00" level=info msg="stopping containerd after receiving terminated" Aug 22 09:37:13 esm-rhel7-d12-3.localdomain systemd[1]: Stopped Docker Application Container Engine. Aug 22 09:49:38 esm-rhel7-d12-3.localdomain systemd[1]: [/usr/lib/systemd/system/docker-latest.service:19] Unknown lvalue 'TasksMax' in section 'Service' Aug 22 09:49:42 esm-rhel7-d12-3.localdomain systemd[1]: [/usr/lib/systemd/system/docker-latest.service:19] Unknown lvalue 'TasksMax' in section 'Service' Aug 22 09:58:53 esm-rhel7-d12-3.localdomain systemd[1]: [/usr/lib/systemd/system/docker-latest.service:19] Unknown lvalue 'TasksMax' in section 'Service' Removing the dependency on oci-register-machine isn't enough, because docker and atomic both bring it in so there will already be a /usr/libexec/oci/hooks.d/oci-register-machine when docker-latest gets installed. Confirmed by installing and testing docker-latest-1.12.0-12.el7 on fresh virt. (By "confirmed" I mean my sentence above. i.e. the crashing bug persists, with docker daemon dying in the middle of a test run, until I manually rm -f /usr/libexec/oci/hooks.d/oci-register-machine. I cannot yum remove oci-register-machine because that uninstalls docker and atomic). I can remove oci-register-machine from docker 1.10.3 if Dan agrees. Also, atomic depends on docker (I think there's a separate bug to remove this dep) which then pulls in oci-register-machine, so removing it from docker should solve that. I don't think removing the Requires from docker 1.10 is enough: oci-register-machine will not be removed on upgrade. I suspect the only real solution[*] is: docker-1.10.spec: - Requires: oci-register-machine >= 1:0-1.7 docker-latest-1.12.spec: + Conflicts: oci-register-machine [*] until someone figures out the true cause of the problem (In reply to Ed Santiago from comment #25) > I don't think removing the Requires from docker 1.10 is enough: > oci-register-machine will not be removed on upgrade. I suspect the only real > solution[*] is: > > docker-1.10.spec: > - Requires: oci-register-machine >= 1:0-1.7 > > docker-latest-1.12.spec: > + Conflicts: oci-register-machine > > [*] until someone figures out the true cause of the problem True, I'll add this for now. Thanks Dan, please let me know if you think this shouldn't be done No joy with -13 build. I guess Conflicts doesn't work the way I thought it did: # yum update Loaded plugins: product-id, search-disabled-repos, subscription-manager Resolving Dependencies --> Running transaction check ---> Package docker.x86_64 0:1.10.3-46.el7.11 will be updated ---> Package docker.x86_64 0:1.10.3-46.el7.12 will be an update ---> Package docker-common.x86_64 0:1.10.3-46.el7.11 will be updated ---> Package docker-common.x86_64 0:1.10.3-46.el7.12 will be an update ---> Package docker-latest.x86_64 0:1.12.0-12.el7 will be updated ---> Package docker-latest.x86_64 0:1.12.0-13.el7 will be an update ---> Package docker-rhel-push-plugin.x86_64 0:1.10.3-46.el7.11 will be updated ---> Package docker-rhel-push-plugin.x86_64 0:1.10.3-46.el7.12 will be an update ---> Package docker-selinux.x86_64 0:1.10.3-46.el7.11 will be updated ---> Package docker-selinux.x86_64 0:1.10.3-46.el7.12 will be an update --> Processing Conflict: docker-latest-1.12.0-13.el7.x86_64 conflicts oci-register-machine --> Finished Dependency Resolution Error: docker-latest conflicts with 1:oci-register-machine-0-1.7.git31bbcd2.el7.x86_64 You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest Basically: oci-register-machine is already installed, because of docker. Even though the yum-updated docker no longer requires oci-register-machine, it's installed, and yum is not removing it automatically. Yes oci-register-machine should be removed from all dependencies, until we figure out these problems. So oci-register-machine was shipped in previous version of rhel. I think the easiest fix would be to obsoletes oci-register-machine for now, which should cause it to get removed if installed. We can remove this in a future release. (In reply to Ed Santiago from comment #27) > No joy with -13 build. I guess Conflicts doesn't work the way I thought it > did: > > # yum update > Loaded plugins: product-id, search-disabled-repos, subscription-manager > Resolving Dependencies > --> Running transaction check > ---> Package docker.x86_64 0:1.10.3-46.el7.11 will be updated > ---> Package docker.x86_64 0:1.10.3-46.el7.12 will be an update > ---> Package docker-common.x86_64 0:1.10.3-46.el7.11 will be updated > ---> Package docker-common.x86_64 0:1.10.3-46.el7.12 will be an update > ---> Package docker-latest.x86_64 0:1.12.0-12.el7 will be updated > ---> Package docker-latest.x86_64 0:1.12.0-13.el7 will be an update > ---> Package docker-rhel-push-plugin.x86_64 0:1.10.3-46.el7.11 will be > updated > ---> Package docker-rhel-push-plugin.x86_64 0:1.10.3-46.el7.12 will be an > update > ---> Package docker-selinux.x86_64 0:1.10.3-46.el7.11 will be updated > ---> Package docker-selinux.x86_64 0:1.10.3-46.el7.12 will be an update > --> Processing Conflict: docker-latest-1.12.0-13.el7.x86_64 conflicts > oci-register-machine > --> Finished Dependency Resolution > Error: docker-latest conflicts with > 1:oci-register-machine-0-1.7.git31bbcd2.el7.x86_64 > You could try using --skip-broken to work around the problem > You could try running: rpm -Va --nofiles --nodigest > > Basically: oci-register-machine is already installed, because of docker. > Even though the yum-updated docker no longer requires oci-register-machine, > it's installed, and yum is not removing it automatically. Ha yup, we should be adding Obsoletes. I thought you meant to say "let the user deal with removing oci-register-machine if they want". I'm adding Obsoletes in the next build. -14 : not quite right yet. On a fresh virt, yum install docker docker-latest is fine because it doesn't bring in oci-register-machine. On an existing install, though, oci-register-machine is not removed: # rpm -qa|grep oci-register oci-register-machine-0-1.7.git31bbcd2.el7.x86_64 # rpm -q --obsoletes docker-latest-1.12.0-14.el7 oci-register-machine <= 1:0-1.7 docker-storage-setup <= 0.5-3 I want to move this back to assigned. I want to update the version of oci-register-machine to include a config file which can be used to disable the hook. Then we can remove this hacking around obsoletes and requires, Then we can turn it back on in a future release and we can continue debugging what is causing the problem. SGTM. To make sure I understand correctly: 1) New build of oci-register-machine a) V-R = 0-1.8.el7 I presume? b) will the config file be set to disable by default? If not, will there be a prominent release note? 2) New build of docker - Obsoletes: oci-register-machine <= 1:0-1.7 + Requires: oci-register-machine >= 1:0-1.8 ** or ** + Obsoletes: oci-register-machine < 1:0-1.8 ? (The first keeps oci-register-machine mandatory; the second removes the non-config-file version but allows the config-file version without necessarily installing it. Assuming I've understood correctly). Which approach do you suggest? Yes it should be set to disabled by default by the oci-register-machine package. Use Requires not Obsoletes. We want to keep this package in atomic host, but disable it by default, for now. I think there's a typo in the requires: <= should be >= : # yum install docker-latest ... Error: Package: docker-latest-1.12.0-15.el7.x86_64 (Internal-Extras-Repository-2) Requires: oci-register-machine <= 1:0-1.8 Removing: 1:oci-register-machine-0-1.7.git31bbcd2.el7.x86_64 (@rhel-7-server-extras-rpms) oci-register-machine = 1:0-1.7.git31bbcd2.el7 Updated By: 1:oci-register-machine-0-1.8.gitaf6c129.el7.x86_64 (Internal-Extras-Repository-2) oci-register-machine = 1:0-1.8.gitaf6c129.el7 Available: oci-register-machine-1.10.3-44.el7.x86_64 (rhel-7-server-extras-rpms) oci-register-machine = 1.10.3-44.el7 -16 installed fine on update. Now running docker-autotest. cat /etc/oci-register-machine.conf # cat /etc/oci-register-machine.conf # Disable oci-register-machine by setting the disabled field to true disabled : true I also ran 'journalctl -f | grep oci-r' during docker-autotest. No hits. Test run successful, other than already-filed BZs. I've run the following tests, all starting with a fresh rhel-7.2-server-x86_64-updated image: 1) Fresh install of docker-latest # [add Internal-Extras-Repository-2 repo] # yum install docker docker-latest 2) Update of older docker-latest # yum install docker docker-latest # [add Internal-Extras-Repository-2 repo] # yum update 3) Install new docker-latest on system with older docker # yum install docker (brings in older docker & oci-register-machine) # [add Internal-Extras-Repository-2 repo] # yum install docker-latest (updates docker, oci-register-machine) In all cases, docker-autotest ran with only the known exceptions for existing docker-1.12 bugs. And this is with oci-register-machine installed but disabled correct? Yes. To be precise: oci-register-machine is always installed in all cases, it's brought in by docker. In all three situations listed in comment 39, upon completion of yum install or yum update, the file /etc/oci-register-machine.conf exists on the system with the line 'disabled : true' in it. In all cases, after completion of docker-autotest, 'journalctl | grep oci-r' produces no results, suggesting that oci-register-machine is not being run. In truth it is being run long enough to read its config file which says do nothing. BTW Kick off a case with the disabled: false and see if docker is still failing. > BTW Kick off a case with the disabled: false and see if docker is still failing.
Done. It took an hour to fail, but fail it did.
For precision: I did not set disabled: false, I commented out the 'disabled' line. I confirmed that oci-register-machine is being run via journalctl | grep oci-r (many many matches)
When it fails, it is the docker daemon that fails correct? IE dockerd is not longer running, is docker-containerd running? One last test, if you could look at it would be to take systemd-machined.service and remove the PrivateTmp and PrivateNetwork lines from the service. systemctl daemon-reload systemctl restart systemd-machined.service Now rerun your tests and see if this kills the docker daemon. (In reply to Daniel Walsh from comment #44) > When it fails, it is the docker daemon that fails correct? IE dockerd is > not longer running, is docker-containerd running? After the crash, Neither dockerd-latest nor docker-containerd are running. > One last test, if you could look at it would be to take > systemd-machined.service and remove the PrivateTmp and PrivateNetwork lines > from the service. > systemctl daemon-reload > systemctl restart systemd-machined.service > > Now rerun your tests and see if this kills the docker daemon. First test run completed, with unexpected failures in tests relating to docker start or docker restart. (I'm not planning to investigate those further unless requested; I'll chalk them up to the above changes). Second run crashed with (what at first glance) looks like the usual crash. dockerd and docker-containerd are down. Verified as fixed in docker-latest-1.12.1-2.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1829.html |