Bug 1456138 - devicemapper error dm_task_set_cookie failed
Summary: devicemapper error dm_task_set_cookie failed
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: docker   
(Show other bugs)
Version: 7.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 7.3
Assignee: Daniel Walsh
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard:
Keywords: Extras, TestBlocker
Depends On:
Blocks: 1467350
TreeView+ depends on / blocked
 
Reported: 2017-05-27 08:11 UTC by liujia
Modified: 2019-03-06 00:39 UTC (History)
33 users (show)

Fixed In Version: docker-1.12.6-32.git88a4867.el7_3
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1463003 (view as bug list)
Environment:
Last Closed: 2017-06-28 15:39:34 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
devicemapper: Can't set cookie dm_task_set_cookie failed (20.23 KB, text/plain)
2017-06-19 16:49 UTC, Giovanni Tirloni
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:1620 normal SHIPPED_LIVE docker bug fix and enhancement update 2017-06-28 19:33:52 UTC
Red Hat Bugzilla 1461370 None None None Never

Internal Trackers: 1461370

Description liujia 2017-05-27 08:11:33 UTC
Description of problem:
Upgrade ocp 3.5(container install) failed at task [Restart master] for a devicemapper error: Can't set cookie dm_task_set_cookie failed.

fatal: [openshift-109.x.x.x]: FAILED! => {
    "changed": false,
    "failed": true,
    "invocation": {
        "module_args": {
            "daemon_reload": false,
            "enabled": null,
            "masked": null,
            "name": "atomic-openshift-master",
            "state": "restarted",
            "user": false
        }
    }
}

MSG:

Unable to restart service atomic-openshift-master: Job for atomic-openshift-master.service failed because the control process exited with error code. See "systemctl status atomic-openshift-master.service" and "journalctl -xe" for details.

# systemctl status atomic-openshift-master.service -l
● atomic-openshift-master.service
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master.service; enabled; vendor preset: disabled)
   Active: activating (start-post) (Result: exit-code) since Sat 2017-05-27 01:31:53 EDT; 7s ago
  Process: 20005 ExecStop=/usr/bin/docker stop atomic-openshift-master (code=exited, status=1/FAILURE)
  Process: 20020 ExecStart=/usr/bin/docker run --rm --privileged --net=host --name atomic-openshift-master --env-file=/etc/sysconfig/atomic-openshift-master -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin openshift3/ose:${IMAGE_VERSION} start master --config=${CONFIG_FILE} $OPTIONS (code=exited, status=125)
  Process: 20014 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master (code=exited, status=1/FAILURE)
 Main PID: 20020 (code=exited, status=125);         : 20021 (sleep)
   Memory: 92.0K
   CGroup: /system.slice/atomic-openshift-master.service
           └─control
             └─20021 /usr/bin/sleep 10

May 27 01:31:53 openshift-109.x.x.x systemd[1]: Starting atomic-openshift-master.service...
May 27 01:31:53 openshift-109.x.x.x docker[20014]: Error response from daemon: No such container: atomic-openshift-master
May 27 01:31:54 openshift-109.x.x.x docker[20020]: /usr/bin/docker-current: Error response from daemon: devmapper: Error activating devmapper device for 'ed6dd8b37d073aedcb636d597c81437c02e84c3a9593923dc5ccd8569f01abab-init': devicemapper: Can't set cookie dm_task_set_cookie failed.
May 27 01:31:54 openshift-109.x.x.x docker[20020]: See '/usr/bin/docker-current run --help'.
May 27 01:31:54 openshift-109.x.x.x systemd[1]: atomic-openshift-master.service: main process exited, code=exited, status=125/n/a


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.6.85-1.git.0.109a54e.el7.noarch
docker-1.12.6-28.git1398f24.el7.x86_64

How reproducible:
always

Steps to Reproduce:
1.Container install ocp3.5(one master/node/etcd + one nfs)
2.Upgrade  ocp3.5 to ocp3.6
# ansible-playbook -i hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Tried to start master server manually, failed. Tried to restart docker,failed.   Reboot host, then master and docker services restored. Re-run upgrade playbook, failed again at the same task.

Comment 2 liujia 2017-05-27 09:33:05 UTC
Now, all upgrade tests against container env have been blocked. Add "TestBlocker" keywords.

Comment 3 Scott Dodson 2017-05-30 16:04:57 UTC
liujia,

If you `yum downgrade docker-1.12.6-16.el7` and restart prior to performing the upgrade does it work? I suspect this may be a regression in docker.

Comment 4 liujia 2017-05-31 09:56:18 UTC
(In reply to Scott Dodson from comment #3)
> liujia,
> 
> If you `yum downgrade docker-1.12.6-16.el7` and restart prior to performing
> the upgrade does it work? I suspect this may be a regression in docker.

I think you are right. If downgrade docker to 1.12.6-16, upgrade succeed. Then this bug will not block left test. Thx~

Comment 5 Scott Dodson 2017-05-31 13:39:39 UTC
Assigning to containers as this looks to be a docker regression.

Comment 6 Wang Haoran 2017-06-01 07:08:19 UTC
This is not restricted to upgrade scenario, I have a new cluster with the new docker installed, it works at first, but after running some testing, the docker is broken with the error:
failed at task [Restart master] for a devicemapper error: Can't set cookie dm_task_set_cookie failed.

Comment 7 Scott Dodson 2017-06-01 13:17:39 UTC
For the containers team, Restart master tasks just calls `systemctl restart atomic-openshift-master` and this is the unit definition.

[Unit]
After=docker.service
Requires=docker.service
PartOf=docker.service
After=etcd_container.service
Wants=etcd_container.service

[Service]
EnvironmentFile=/etc/sysconfig/atomic-openshift-master
ExecStartPre=-/usr/bin/docker rm -f atomic-openshift-master
ExecStart=/usr/bin/docker run --rm --privileged --net=host --name atomic-openshift-master --env-file=/etc/sysconfig/atomic-openshift-master -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etcd/origin openshift3/ose:v3.6 start master --config=${CONFIG_FILE} $OPTIONS
ExecStartPost=/usr/bin/sleep 10
ExecStop=/usr/bin/docker stop atomic-openshift-master
Restart=always
RestartSec=5s

[Install]
WantedBy=docker.service

Comment 8 Alasdair Kergon 2017-06-01 15:29:00 UTC
Anyone got a pointer to the bit of source code producing that message?

Comment 9 Alasdair Kergon 2017-06-01 15:36:14 UTC
(All failure modes of the dm_task_set_cookie function issue a low-level error message, so perhaps that can be extracted - or the logging fixed if it wasn't captured.)

Comment 10 Alasdair Kergon 2017-06-01 15:40:06 UTC
(I don't know what parameters are used by this caller's source code, but there can be several dependencies here including use of semaphores and /dev/urandom.)

Comment 11 Johnny Liu 2017-06-07 03:39:05 UTC
I also encounter this issue on 3.5 too.

containerized install on RHEL + openshift v3.5.5.24 + docker-1.12.6-28.git1398f24.el7.x86_64, failed at restart master.

containerized install on RHEL + openshift v3.4.1.32 + docker-1.12.6-28.git1398f24.el7.x86_64, PASS.

Comment 12 Johnny Liu 2017-06-14 07:45:39 UTC
This is blocking testing with latest docker version.

Comment 13 Vivek Goyal 2017-06-14 12:19:23 UTC
I am thinking this probably is semaphore leak issue where we have exhausted maximum number of semaphores on system.

https://github.com/moby/moby/issues/33603

Can you provide output of following commands.

- dmsetup udevcookies
- ipcs
- cat /proc/sys/kernel/sem

On the failing system, try running "dmsetup udevcomplete_all and see if that gets you going.

I will also need an easy way to reproduce this problem to figure out why leak is happening.

Can you attach journal logs of failing system. Want to see if there are any messages there which indicate towards possible udev issue or something else.

Comment 37 Giovanni Tirloni 2017-06-19 16:48:19 UTC
Just faced this issue after I stress tested a Kubernetes 1.6.5 cluster (asked it to scale a nginx deployment to 800 replicas). I noticed containers were failing to get created. I tried to restart Docker but the same error ("devicemapper: Can't set cookie dm_task_set_cookie failed") continued. Providing logs in case it's useful here.

Comment 38 Giovanni Tirloni 2017-06-19 16:49 UTC
Created attachment 1289156 [details]
devicemapper: Can't set cookie dm_task_set_cookie failed

kernel: 3.10.0-514.21.1.el7.x86_64

container-selinux-2.12-2.gite7096ce.el7.noarch
docker-1.12.6-28.git1398f24.el7.centos.x86_64
docker-client-1.12.6-28.git1398f24.el7.centos.x86_64
docker-common-1.12.6-28.git1398f24.el7.centos.x86_64
skopeo-containers-0.1.19-1.el7.x86_64

Comment 40 Giovanni Tirloni 2017-06-19 17:25:45 UTC
Increasing the semaphores limits fixed the issue for me. Thanks for the insights.

Comment 42 Daniel Walsh 2017-06-19 18:38:58 UTC
This is definitely a BLOCKER Bug.  Need to get this fixed as soon as possible.

Pull request is upstream

https://github.com/moby/moby/pull/33732

Hopefully merged soon, we will need this back ported to projectatomic/docker.

Comment 49 Luwen Su 2017-06-20 08:02:36 UTC
In docker-1.12.6-28.git1398f24.el7.x86_64

#docker run --it --rm rhel7 bash
#unshare bash

#exit

#  dmsetup udevcookies
Cookie       Semid      Value      Last semop time           Last change time
0xd4d95ea    1540096    1          Tue Jun 20 03:56:14 2017  Tue Jun 20 03:56:14 2017



and in docker-1.12.6-32.git88a4867.el7.x86_64

#dmsetup udevcookies shows nothing.

Comment 50 liujia 2017-06-20 08:34:33 UTC
Version:
docker-1.12.6-32.git88a4867.el7.x86_64

scenario 1-pass:
1. Container install ocp3.5 on docker-1.12.6-32
2. New-app to trigger sti-build
3. restart atomic-openshift-master/atomic-openshift-node/docker service


scenario 2-pass:
1. Trigger upgrade above ocp3.5(with docker-1.12.6-32) to ocp3.6
2. New-app after upgrade


OCP 3.5 with docker-1.12.6-32 works well.
Upgrade ocp3.5 with docker-1.12.6-32 works well

Comment 51 Johnny Liu 2017-06-20 08:41:10 UTC
containerized install on RHEL + openshift v3.6.116 + docker-1.12.6-32.git88a4867.el7.x86_64, PASS.

Comment 53 errata-xmlrpc 2017-06-28 15:39:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1620


Note You need to log in before you can comment on or make changes to this bug.