Bug 1456138

Summary:

devicemapper error dm_task_set_cookie failed

Product:

Red Hat Enterprise Linux 7

Reporter:

liujia <jiajliu>

Component:

docker

Assignee:

Daniel Walsh <dwalsh>

Status:

CLOSED ERRATA

QA Contact:

atomic-bugs <atomic-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.3

CC:

agk, amurdaca, anli, aos-bugs, bmeng, chaoyang, ddarrah, dmoessne, dwalsh, ghuang, gpei, gtirloni, hannsj_uhl, haowang, jhonce, jhou, jiajliu, jialiu, jligon, jokerman, lsm5, lsu, lxia, mmccomas, myllynen, nhorman, rkant, sdodson, vgoyal, wehe, wmeng, xtian, yuxzhu

Target Milestone:

Keywords:

Extras, TestBlocker

Target Release:

7.3

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

docker-1.12.6-32.git88a4867.el7_3

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1463003 (view as bug list)

Environment:

Last Closed:

2017-06-28 15:39:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1467350

Attachments:

Description	Flags
devicemapper: Can't set cookie dm_task_set_cookie failed	none

Description liujia 2017-05-27 08:11:33 UTC

Description of problem:
Upgrade ocp 3.5(container install) failed at task [Restart master] for a devicemapper error: Can't set cookie dm_task_set_cookie failed.

fatal: [openshift-109.x.x.x]: FAILED! => {
    "changed": false,
    "failed": true,
    "invocation": {
        "module_args": {
            "daemon_reload": false,
            "enabled": null,
            "masked": null,
            "name": "atomic-openshift-master",
            "state": "restarted",
            "user": false
        }
    }
}

MSG:

Unable to restart service atomic-openshift-master: Job for atomic-openshift-master.service failed because the control process exited with error code. See "systemctl status atomic-openshift-master.service" and "journalctl -xe" for details.

# systemctl status atomic-openshift-master.service -l
● atomic-openshift-master.service
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master.service; enabled; vendor preset: disabled)
   Active: activating (start-post) (Result: exit-code) since Sat 2017-05-27 01:31:53 EDT; 7s ago
  Process: 20005 ExecStop=/usr/bin/docker stop atomic-openshift-master (code=exited, status=1/FAILURE)
  Process: 20020 ExecStart=/usr/bin/docker run --rm --privileged --net=host --name atomic-openshift-master --env-file=/etc/sysconfig/atomic-openshift-master -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin openshift3/ose:${IMAGE_VERSION} start master --config=${CONFIG_FILE} $OPTIONS (code=exited, status=125)
  Process: 20014 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master (code=exited, status=1/FAILURE)
 Main PID: 20020 (code=exited, status=125);         : 20021 (sleep)
   Memory: 92.0K
   CGroup: /system.slice/atomic-openshift-master.service
           └─control
             └─20021 /usr/bin/sleep 10

May 27 01:31:53 openshift-109.x.x.x systemd[1]: Starting atomic-openshift-master.service...
May 27 01:31:53 openshift-109.x.x.x docker[20014]: Error response from daemon: No such container: atomic-openshift-master
May 27 01:31:54 openshift-109.x.x.x docker[20020]: /usr/bin/docker-current: Error response from daemon: devmapper: Error activating devmapper device for 'ed6dd8b37d073aedcb636d597c81437c02e84c3a9593923dc5ccd8569f01abab-init': devicemapper: Can't set cookie dm_task_set_cookie failed.
May 27 01:31:54 openshift-109.x.x.x docker[20020]: See '/usr/bin/docker-current run --help'.
May 27 01:31:54 openshift-109.x.x.x systemd[1]: atomic-openshift-master.service: main process exited, code=exited, status=125/n/a


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.6.85-1.git.0.109a54e.el7.noarch
docker-1.12.6-28.git1398f24.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1.Container install ocp3.5(one master/node/etcd + one nfs)
2.Upgrade  ocp3.5 to ocp3.6
# ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Tried to start master server manually, failed. Tried to restart docker,failed.   Reboot host, then master and docker services restored. Re-run upgrade playbook, failed again at the same task.

Comment 2 liujia 2017-05-27 09:33:05 UTC

Now, all upgrade tests against container env have been blocked. Add "TestBlocker" keywords.

Comment 3 Scott Dodson 2017-05-30 16:04:57 UTC

liujia,

If you `yum downgrade docker-1.12.6-16.el7` and restart prior to performing the upgrade does it work? I suspect this may be a regression in docker.

Comment 4 liujia 2017-05-31 09:56:18 UTC

(In reply to Scott Dodson from comment #3)
> liujia,
> 
> If you `yum downgrade docker-1.12.6-16.el7` and restart prior to performing
> the upgrade does it work? I suspect this may be a regression in docker.

I think you are right. If downgrade docker to 1.12.6-16, upgrade succeed. Then this bug will not block left test. Thx~

Comment 5 Scott Dodson 2017-05-31 13:39:39 UTC

Assigning to containers as this looks to be a docker regression.

Comment 6 Wang Haoran 2017-06-01 07:08:19 UTC

This is not restricted to upgrade scenario, I have a new cluster with the new docker installed, it works at first, but after running some testing, the docker is broken with the error:
failed at task [Restart master] for a devicemapper error: Can't set cookie dm_task_set_cookie failed.

Comment 7 Scott Dodson 2017-06-01 13:17:39 UTC

For the containers team, Restart master tasks just calls `systemctl restart atomic-openshift-master` and this is the unit definition.

[Unit]
After=docker.service
Requires=docker.service
PartOf=docker.service
After=etcd_container.service
Wants=etcd_container.service

[Service]
EnvironmentFile=/etc/sysconfig/atomic-openshift-master
ExecStartPre=-/usr/bin/docker rm -f atomic-openshift-master
ExecStart=/usr/bin/docker run --rm --privileged --net=host --name atomic-openshift-master --env-file=/etc/sysconfig/atomic-openshift-master -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etcd/origin openshift3/ose:v3.6 start master --config=${CONFIG_FILE} $OPTIONS
ExecStartPost=/usr/bin/sleep 10
ExecStop=/usr/bin/docker stop atomic-openshift-master
Restart=always
RestartSec=5s

[Install]
WantedBy=docker.service

Comment 8 Alasdair Kergon 2017-06-01 15:29:00 UTC

Anyone got a pointer to the bit of source code producing that message?

Comment 9 Alasdair Kergon 2017-06-01 15:36:14 UTC

(All failure modes of the dm_task_set_cookie function issue a low-level error message, so perhaps that can be extracted - or the logging fixed if it wasn't captured.)

Comment 10 Alasdair Kergon 2017-06-01 15:40:06 UTC

(I don't know what parameters are used by this caller's source code, but there can be several dependencies here including use of semaphores and /dev/urandom.)

Comment 11 Johnny Liu 2017-06-07 03:39:05 UTC

I also encounter this issue on 3.5 too.

containerized install on RHEL + openshift v3.5.5.24 + docker-1.12.6-28.git1398f24.el7.x86_64, failed at restart master.

containerized install on RHEL + openshift v3.4.1.32 + docker-1.12.6-28.git1398f24.el7.x86_64, PASS.

Comment 12 Johnny Liu 2017-06-14 07:45:39 UTC

This is blocking testing with latest docker version.

Comment 13 Vivek Goyal 2017-06-14 12:19:23 UTC

I am thinking this probably is semaphore leak issue where we have exhausted maximum number of semaphores on system.

https://github.com/moby/moby/issues/33603

Can you provide output of following commands.

- dmsetup udevcookies
- ipcs
- cat /proc/sys/kernel/sem

On the failing system, try running "dmsetup udevcomplete_all and see if that gets you going.

I will also need an easy way to reproduce this problem to figure out why leak is happening.

Can you attach journal logs of failing system. Want to see if there are any messages there which indicate towards possible udev issue or something else.

Comment 37 Giovanni Tirloni 2017-06-19 16:48:19 UTC

Just faced this issue after I stress tested a Kubernetes 1.6.5 cluster (asked it to scale a nginx deployment to 800 replicas). I noticed containers were failing to get created. I tried to restart Docker but the same error ("devicemapper: Can't set cookie dm_task_set_cookie failed") continued. Providing logs in case it's useful here.

Comment 38 Giovanni Tirloni 2017-06-19 16:49:47 UTC

Created attachment 1289156 [details]
devicemapper: Can't set cookie dm_task_set_cookie failed

kernel: 3.10.0-514.21.1.el7.x86_64

container-selinux-2.12-2.gite7096ce.el7.noarch
docker-1.12.6-28.git1398f24.el7.centos.x86_64
docker-client-1.12.6-28.git1398f24.el7.centos.x86_64
docker-common-1.12.6-28.git1398f24.el7.centos.x86_64
skopeo-containers-0.1.19-1.el7.x86_64

Comment 40 Giovanni Tirloni 2017-06-19 17:25:45 UTC

Increasing the semaphores limits fixed the issue for me. Thanks for the insights.

Comment 42 Daniel Walsh 2017-06-19 18:38:58 UTC

This is definitely a BLOCKER Bug.  Need to get this fixed as soon as possible.

Pull request is upstream

https://github.com/moby/moby/pull/33732

Hopefully merged soon, we will need this back ported to projectatomic/docker.

Comment 49 Luwen Su 2017-06-20 08:02:36 UTC

In docker-1.12.6-28.git1398f24.el7.x86_64

#docker run --it --rm rhel7 bash
#unshare bash

#exit

#  dmsetup udevcookies
Cookie       Semid      Value      Last semop time           Last change time
0xd4d95ea    1540096    1          Tue Jun 20 03:56:14 2017  Tue Jun 20 03:56:14 2017



and in docker-1.12.6-32.git88a4867.el7.x86_64

#dmsetup udevcookies shows nothing.

Comment 50 liujia 2017-06-20 08:34:33 UTC

Version:
docker-1.12.6-32.git88a4867.el7.x86_64

scenario 1-pass:
1. Container install ocp3.5 on docker-1.12.6-32
2. New-app to trigger sti-build
3. restart atomic-openshift-master/atomic-openshift-node/docker service


scenario 2-pass:
1. Trigger upgrade above ocp3.5(with docker-1.12.6-32) to ocp3.6
2. New-app after upgrade


OCP 3.5 with docker-1.12.6-32 works well.
Upgrade ocp3.5 with docker-1.12.6-32 works well

Comment 51 Johnny Liu 2017-06-20 08:41:10 UTC

containerized install on RHEL + openshift v3.6.116 + docker-1.12.6-32.git88a4867.el7.x86_64, PASS.

Comment 53 errata-xmlrpc 2017-06-28 15:39:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1620