Description of problem: Debug this for whole day to check the root cause.. Fail to create nfs-provisioner pod, if docker is set to "Deferred Removal Enabled: false" Version-Release number of selected component (if applicable): oc v3.6.126.4 kubernetes v1.6.1+5115d708d7 How reproducible: Always Steps to Reproduce: 1. Create a project 2. Create a scc $ oc create -f https://raw.githubusercontent.com/kubernetes-incubator/nfs-provisioner/master/deploy/kube-config/openshift-scc.yaml 3. Add user to this scc $ oadm policy add-scc-to-user nfs-provisioner <user> 4. Create a nfs-provisioner pod oc create -f https://raw.githubusercontent.com/kubernetes-incubator/external-storage/master/nfs/deploy/kubernetes/pod.yaml Actual results: Pod fail to create # oc get pods NAME READY STATUS RESTARTS AGE nfs-provisioner-df60e 0/1 ContainerCreating 0 17m # oc describe pods nfs-provisioner-df60e And after this failed, even we delete nfs-provisioner pod, this node cannot schedule any pod with ContainerCreating status After I changed the "Deferred Removal Enabled: false", this would work well Expected results: nfs-provisioner pod could be running No other impact bad thing happens subsequently,all the other pod could be scheduled and running. Master Log: Jun 30 07:00:36 storage-master dockerd-current: time="2017-06-30T07:00:36.664650412Z" level=error msg="Handler for POST /v1.24/containers/create?name=k8s_POD_nfs-provisioner-df60e_qxni3_d146724d-5d61-11e7-832a-000d3a181970_0 returned error: devmapper: Error activating devmapper device for '8c62cc05b67414e3d6b33fee98ce91fcede16432ca69e82040dde327d97df061-init': devicemapper: Can't set cookie dm_task_set_cookie failed" Jun 30 07:00:36 storage-master dockerd-current: time="2017-06-30T07:00:36.664706912Z" level=error msg="Handler for POST /v1.24/containers/create returned error: devmapper: Error activating devmapper device for '8c62cc05b67414e3d6b33fee98ce91fcede16432ca69e82040dde327d97df061-init': devicemapper: Can't set cookie dm_task_set_cookie failed" Jun 30 07:00:36 storage-master atomic-openshift-node: E0630 07:00:36.665145 23505 remote_runtime.go:86] RunPodSandbox from runtime service failed: rpc error: code = 2 desc = failed to create a sandbox for pod "nfs-provisioner-df60e": Error response from daemon: {"message":"devmapper: Error activating devmapper device for '8c62cc05b67414e3d6b33fee98ce91fcede16432ca69e82040dde327d97df061-init': devicemapper: Can't set cookie dm_task_set_cookie failed"} Jun 30 07:00:36 storage-master atomic-openshift-node: E0630 07:00:36.665183 23505 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "nfs-provisioner-df60e_qxni3(d146724d-5d61-11e7-832a-000d3a181970)" failed: rpc error: code = 2 desc = failed to create a sandbox for pod "nfs-provisioner-df60e": Error response from daemon: {"message":"devmapper: Error activating devmapper device for '8c62cc05b67414e3d6b33fee98ce91fcede16432ca69e82040dde327d97df061-init': devicemapper: Can't set cookie dm_task_set_cookie failed"} Jun 30 07:00:36 storage-master atomic-openshift-node: E0630 07:00:36.665199 23505 kuberuntime_manager.go:619] createPodSandbox for pod "nfs-provisioner-df60e_qxni3(d146724d-5d61-11e7-832a-000d3a181970)" failed: rpc error: code = 2 desc = failed to create a sandbox for pod "nfs-provisioner-df60e": Error response from daemon: {"message":"devmapper: Error activating devmapper device for '8c62cc05b67414e3d6b33fee98ce91fcede16432ca69e82040dde327d97df061-init': devicemapper: Can't set cookie dm_task_set_cookie failed"} Jun 30 07:00:36 storage-master atomic-openshift-node: E0630 07:00:36.665271 23505 pod_workers.go:182] Error syncing pod d146724d-5d61-11e7-832a-000d3a181970 ("nfs-provisioner-df60e_qxni3(d146724d-5d61-11e7-832a-000d3a181970)"), skipping: failed to "CreatePodSandbox" for "nfs-provisioner-df60e_qxni3(d146724d-5d61-11e7-832a-000d3a181970)" with CreatePodSandboxError: "CreatePodSandbox for pod \"nfs-provisioner-df60e_qxni3(d146724d-5d61-11e7-832a-000d3a181970)\" failed: rpc error: code = 2 desc = failed to create a sandbox for pod \"nfs-provisioner-df60e\": Error response from daemon: {\"message\":\"devmapper: Error activating devmapper device for '8c62cc05b67414e3d6b33fee98ce91fcede16432ca69e82040dde327d97df061-init': devicemapper: Can't set cookie dm_task_set_cookie failed\"}" Additional info: $ docker info | grep Deferred Deferred Removal Enabled: false And then I ssh to the node, try to run docker run, I also get error message like below: # docker run --rm -ti docker.io/aosqe/nfs-server bash /usr/bin/docker-current: Error response from daemon: devmapper: Error activating devmapper device for 'fa63515664d65c6a3466e428db15aa1cfad264b849cfe812dbf31a37bfc3a71c-init': devicemapper: Can't set cookie dm_task_set_cookie failed. See '/usr/bin/docker-current run --help'.
What version of docker is this? I don't think there is anything special about the container that should prevent it from starting it in the first place. I could only find this https://bugzilla.redhat.com/show_bug.cgi?id=1456138 which points to a recent issue with docker where that cookie error message shows up. The "solution" is to just not use the pod.yaml in test cases and use a deployment.yaml instead. The pod mounts an emptyDir and relies on docker storage driver behaving while the deployment mounts a hostPath. I have removed the pod.yaml from all documentation, can we modify the test cases to use a deployment.yaml? https://github.com/kubernetes-incubator/external-storage/blob/master/nfs/deploy/kubernetes/auth/deployment-sa.yaml this mounts /srv on the host. It could be patched to mount e.g. /tmp/something.
(In reply to Matthew Wong from comment #1) > What version of docker is this? I don't think there is anything special > about the container that should prevent it from starting it in the first > place. I could only find this > https://bugzilla.redhat.com/show_bug.cgi?id=1456138 which points to a recent > issue with docker where that cookie error message shows up. > > The "solution" is to just not use the pod.yaml in test cases and use a > deployment.yaml instead. The pod mounts an emptyDir and relies on docker > storage driver behaving while the deployment mounts a hostPath. I have > removed the pod.yaml from all documentation, can we modify the test cases to > use a deployment.yaml? > > https://github.com/kubernetes-incubator/external-storage/blob/master/nfs/ > deploy/kubernetes/auth/deployment-sa.yaml this mounts /srv on the host. It > could be patched to mount e.g. /tmp/something. Yeah, I noticed that bug, so after I debug I found the issue is from the docker option of "Deferred Removal Enabled: false". If docker started with the option of "Deferred Removal Enabled: true", this would not repro. # docker version Client: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-39.1.git6ffd653.el7.x86_64 Go version: go1.8.3 Git commit: 6ffd653/1.12.6 Built: Wed Jun 14 14:31:17 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-39.1.git6ffd653.el7.x86_64 Go version: go1.8.3 Git commit: 6ffd653/1.12.6 Built: Wed Jun 14 14:31:17 2017 OS/Arch: linux/amd64 # docker info Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 0 Server Version: 1.12.6 Storage Driver: devicemapper Pool Name: docker-8:2-475-pool Pool Blocksize: 65.54 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: /dev/loop0 Metadata file: /dev/loop1 Data Space Used: 11.8 MB Data Space Total: 107.4 GB Data Space Available: 7.441 GB Metadata Space Used: 581.6 kB Metadata Space Total: 2.147 GB Metadata Space Available: 2.147 GB Thin Pool Minimum Free Space: 10.74 GB Udev Sync Supported: true Deferred Removal Enabled: false Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 Data loop file: /var/lib/docker/devicemapper/devicemapper/data And I have tried some other emptyDir cases, did not find the issue as nfs-provisioner pod. BTW, deployment.yaml can totally work without any issue. Thanks.
This seems related to NFS. QE has a nfs-server pod which has the purpose to create nfs-server in the project and then test. But I found this pod also can repro this issue: 1. Create scc, service account 2. Create nfs-server pod: $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml Check the node log: 153061 Jul 3 06:44:38 wehe-node-1 dockerd-current: time="2017-07-03T06:44:38.055266578Z" level=error msg="devmapper: Error unmounting device 3102 fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bcf5ac5f9621db: devicemapper: Can not set cookie: dm_task_set_cookie failed" 153062 Jul 3 06:44:38 wehe-node-1 dockerd-current: time="2017-07-03T06:44:38.055340579Z" level=error msg="Error unmounting container 8b1a0322f493 d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: devicemapper: Can not set cookie: dm_task_set_cookie failed" 153074 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: I0703 06:44:38.112036 12994 kubelet_pods.go:1168] Generating status for "nfs-server_1q 9kb(4b01b604-5fb6-11e7-b37e-000d3a1bc5aa)" 153075 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: I0703 06:44:38.112087 12994 kubelet.go:1847] SyncLoop (PLEG): "nfs-server_1q9kb(4b01b6 04-5fb6-11e7-b37e-000d3a1bc5aa)", event: &pleg.PodLifecycleEvent{ID:"4b01b604-5fb6-11e7-b37e-000d3a1bc5aa", Type:"ContainerDied", Data:"f8f 2a32fa0a04d7cc1f8afb728da7e2ad09e91e3781b2601dc5c8b08053a6540"} 153076 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: W0703 06:44:38.112116 12994 pod_container_deletor.go:77] Container "f8f2a32fa0a04d7cc1 f8afb728da7e2ad09e91e3781b2601dc5c8b08053a6540" not found in pod's containers 153077 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: I0703 06:44:38.112260 12994 status_manager.go:340] Status Manager: adding pod: "4b01b6 04-5fb6-11e7-b37e-000d3a1bc5aa", with status: ('\x04', {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:10:20 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:44:38 +0000 UTC ContainersNotReady containers with unready status: [nfs-ser ver]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:10:20 +0000 UTC }] 172.27.17.5 2017-07-03 06:10:20 +0000 UTC [] [{ nfs-server {nil nil &ContainerStateTerminated{ExitCode:137,Signal:0,Reason:Error,Message:,StartedAt:2017-07-03 06:10:49 +0000 UTC,FinishedA t:2017-07-03 06:44:25 +0000 UTC,ContainerID:docker://8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15,}} {nil nil nil} fals e 0 docker.io/aosqe/nfs-server:latest docker-pullable://docker.io/aosqe/nfs-server@sha256:3e5e1fbfc51571a007451a98a43f6f30d7abbcc4a3ce63bb0 b39db253956e8b0 docker://8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15}] BestEffort}) to podStatusChannel 153078 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: I0703 06:44:38.112464 12994 status_manager.go:147] Status Manager: syncing pod: "4b01b 604-5fb6-11e7-b37e-000d3a1bc5aa", with status: (4, {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:10:20 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:44:38 +0000 UTC ContainersNotReady containers with unready status: [nfs-server] } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:10:20 +0000 UTC }] 172.27.17.5 2017-07-03 06:10:20 +0000 UTC [] [{nfs- server {nil nil &ContainerStateTerminated{ExitCode:137,Signal:0,Reason:Error,Message:,StartedAt:2017-07-03 06:10:49 +0000 UTC,FinishedAt:20 17-07-03 06:44:25 +0000 UTC,ContainerID:docker://8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15,}} {nil nil nil} false 0 docker.io/aosqe/nfs-server:latest docker-pullable://docker.io/aosqe/nfs-server@sha256:3e5e1fbfc51571a007451a98a43f6f30d7abbcc4a3ce63bb0b39d b253956e8b0 docker://8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15}] BestEffort}) from podStatusChannel 153433 Jul 3 06:45:07 wehe-node-1 dockerd-current: time="2017-07-03T06:45:07.149023906Z" level=error msg="Error removing mounted layer 8b1a0322f4 93d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bcf5ac5f9 621db:devicemapper: Can not set cookie: dm_task_set_cookie failed" 153434 Jul 3 06:45:07 wehe-node-1 dockerd-current: time="2017-07-03T06:45:07.149112407Z" level=error msg="Handler for DELETE /v1.24/containers/8b 1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15?v=1 returned error: Driver devicemapper failed to remove root filesystem 8b1 a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bc f5ac5f9621db:devicemapper: Can not set cookie: dm_task_set_cookie failed" 153435 Jul 3 06:45:07 wehe-node-1 dockerd-current: time="2017-07-03T06:45:07.149143207Z" level=error msg="Handler for DELETE /v1.24/containers/8b 1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15 returned error: Driver devicemapper failed to remove root filesystem 8b1a032 2f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bcf5ac 5f9621db:devicemapper: Can not set cookie: dm_task_set_cookie failed" 153436 Jul 3 06:45:07 wehe-node-1 atomic-openshift-node: E0703 06:45:07.149642 12994 remote_runtime.go:241] RemoveContainer "8b1a0322f493d41aa1 525a56b943e82116fa33bcebe27c5a0f9299f321addd15" from runtime service failed: rpc error: code = 2 desc = failed to remove container "8b1a032 2f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15": Error response from daemon: {"message":"Driver devicemapper failed to remove ro ot filesystem 8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38 c4886fc1ef86082bcf5ac5f9621db:devicemapper: Can not set cookie: dm_task_set_cookie failed"} 153437 Jul 3 06:45:07 wehe-node-1 atomic-openshift-node: E0703 06:45:07.149696 12994 kuberuntime_gc.go:123] Failed to remove container "8b1a032 2f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15": rpc error: code = 2 desc = failed to remove container "8b1a0322f493d41aa1525a56 b943e82116fa33bcebe27c5a0f9299f321addd15": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem 8b1 a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bc f5ac5f9621db:devicemapper: Can not set cookie: dm_task_set_cookie failed"}
I tried to reproduce this bug with today's docker and openshift and I failed docker-1.12.6-32.git88a4867.el7.x86_64 atomic-openshift-node-3.6.133-1.git.0.524e4c8.el7.x86_64 $ docker info | grep Deferred Deferred Removal Enabled: false Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml The pod gets created and running and can be deleted without any problems. I tried also docker-1.12.6-39.1.git6ffd653.el7.x86_64, still the same - works like charm. There must be something else wrong on the system.
(In reply to Jan Safranek from comment #4) > I tried to reproduce this bug with today's docker and openshift and I failed > docker-1.12.6-32.git88a4867.el7.x86_64 > atomic-openshift-node-3.6.133-1.git.0.524e4c8.el7.x86_64 > > $ docker info | grep Deferred > Deferred Removal Enabled: false > Deferred Deletion Enabled: false > Deferred Deleted Device Count: 0 > > $ oc create -f > https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/ > nfs/nfs-server.yaml > > The pod gets created and running and can be deleted without any problems. > > I tried also docker-1.12.6-39.1.git6ffd653.el7.x86_64, still the same - > works like charm. There must be something else wrong on the system. At the first time this pod could be deployed and running. But after we delete it, any other pod could not be scheduled again with my comment 3 errors. Do you still have the env? Could try to create a another pod: $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/pods/hello-pod.json
And I have tried to only create the pod https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml, it will not repro. But if we create other pod to use the nfs volume, this issue would repro: Detailed repro step: 1. Create a project 2. Create a scc https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/authorization/scc/scc_super_template.yaml 2. Add service account and user to this scc # oc get scc super-wehe -o yaml allowHostDirVolumePlugin: true allowHostIPC: false allowHostNetwork: true allowHostPID: false allowHostPorts: true allowPrivilegedContainer: true allowedCapabilities: [] apiVersion: v1 defaultAddCapabilities: [] fsGroup: type: RunAsAny groups: - system:serviceaccounts:9tw98 kind: SecurityContextConstraints metadata: creationTimestamp: 2017-07-05T09:51:43Z name: super-wehe resourceVersion: "63304" selfLink: /api/v1/securitycontextconstraints/super-wehe uid: 8ce63cbc-6167-11e7-b249-000d3a1a72a9 priority: null readOnlyRootFilesystem: false requiredDropCapabilities: [] runAsUser: type: RunAsAny seLinuxContext: type: RunAsAny supplementalGroups: type: RunAsAny users: - wehe volumes: - '*' 3. Create nfs server pod # oc create -f https://github.com/openshift-qe/v3-testfiles/raw/master/storage/nfs/nfs-server.yaml 4. Check the service $ oc get svc NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE nfs-service 172.30.109.118 <none> 2049/TCP 9s 5. Update the a nfs pv to use this nfs-server # cat pvrwo.json | oc create -f - { "apiVersion": "v1", "kind": "PersistentVolume", "metadata": { "name": "wehepv" }, "spec": { "capacity": { "storage": "2Gi" }, "accessModes": [ "ReadWriteOnce" ], "nfs": { "path": "/", "server": "172.30.109.118" }, "persistentVolumeReclaimPolicy": "Retain" } } 6. Create a pvc to bind this pv # cat pvc.json | oc create -f - { "apiVersion": "v1", "kind": "PersistentVolumeClaim", "metadata": { "name": "pvc" }, "spec": { "accessModes": [ "ReadWriteOnce" ], "resources": { "requests": { "storage": "1Gi" } } } } 7. Create a pod oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/nfs/auto/web-pod.json to consume this pvc 8. Touch some file to check the read/write on mounted dir ]$ oc get pods NAME READY STATUS RESTARTS AGE nfs 1/1 Running 0 45s nfs-server 1/1 Running 0 8m $ oc exec -it nfs sh sh-4.2$ ls /mnt/ sh-4.2$ touch /mnt/test sh-4.2$ exit 9. Delete all the pods and svc $ oc delete pods --all pod "nfs" deleted pod "nfs-server" deleted [wehe@dhcp-136-45 octest]$ oc get pods NAME READY STATUS RESTARTS AGE nfs 0/1 Terminating 0 2m nfs-server 1/1 Terminating 0 9m 10. Create a pod again, and pod will keep in containercreating status $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/pods/hello-pod.json $ oc get pods NAME READY STATUS RESTARTS AGE hello-openshift 0/1 ContainerCreating 0 10m 11. ssh to the node and run docker # docker run --rm -ti docker.io/aosqe/nfs-web bash /usr/bin/docker-current: Error response from daemon: devmapper: Error activating devmapper device for 'f76117935a9892fefc77651e018f8e0e6da797c4cd96b439053ba31925cf9154-init': devicemapper: Can't set cookie dm_task_set_cookie failed. I think this is similar root cause of nfs-provisioner pod. Thanks.
I wonder if this would happen with an "ordinary" NFS server? Since you say it happens only when a pod actually *uses* the NFS volume. Could you try using creating a pod that uses a non-nfs-provisioner pod NFS server NFS volume and see if anything breaks? There is something weird going on with devicemapper and the NFS mount maybe. Thanks. Regardless I'm leaning heavily towards it being a Docker problem that we have little control over. The NFS server pod is not doing anything particularly bad or different from x months ago, only Docker has changed.
I have tried on an real NFS server which is deployed on the node, this issue was not repro. Thanks