Bug 1466708
Summary: | Fail to create nfs-provisioner pod if docker is disable "Deferred Removal" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Wenqi He <wehe> |
Component: | Storage | Assignee: | Matthew Wong <mawong> |
Status: | CLOSED NOTABUG | QA Contact: | Wenqi He <wehe> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.6.0 | CC: | aos-bugs, bchilds, jsafrane, wehe |
Target Milestone: | --- | ||
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-09-07 19:47:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Wenqi He
2017-06-30 09:56:05 UTC
What version of docker is this? I don't think there is anything special about the container that should prevent it from starting it in the first place. I could only find this https://bugzilla.redhat.com/show_bug.cgi?id=1456138 which points to a recent issue with docker where that cookie error message shows up. The "solution" is to just not use the pod.yaml in test cases and use a deployment.yaml instead. The pod mounts an emptyDir and relies on docker storage driver behaving while the deployment mounts a hostPath. I have removed the pod.yaml from all documentation, can we modify the test cases to use a deployment.yaml? https://github.com/kubernetes-incubator/external-storage/blob/master/nfs/deploy/kubernetes/auth/deployment-sa.yaml this mounts /srv on the host. It could be patched to mount e.g. /tmp/something. (In reply to Matthew Wong from comment #1) > What version of docker is this? I don't think there is anything special > about the container that should prevent it from starting it in the first > place. I could only find this > https://bugzilla.redhat.com/show_bug.cgi?id=1456138 which points to a recent > issue with docker where that cookie error message shows up. > > The "solution" is to just not use the pod.yaml in test cases and use a > deployment.yaml instead. The pod mounts an emptyDir and relies on docker > storage driver behaving while the deployment mounts a hostPath. I have > removed the pod.yaml from all documentation, can we modify the test cases to > use a deployment.yaml? > > https://github.com/kubernetes-incubator/external-storage/blob/master/nfs/ > deploy/kubernetes/auth/deployment-sa.yaml this mounts /srv on the host. It > could be patched to mount e.g. /tmp/something. Yeah, I noticed that bug, so after I debug I found the issue is from the docker option of "Deferred Removal Enabled: false". If docker started with the option of "Deferred Removal Enabled: true", this would not repro. # docker version Client: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-39.1.git6ffd653.el7.x86_64 Go version: go1.8.3 Git commit: 6ffd653/1.12.6 Built: Wed Jun 14 14:31:17 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-1.12.6-39.1.git6ffd653.el7.x86_64 Go version: go1.8.3 Git commit: 6ffd653/1.12.6 Built: Wed Jun 14 14:31:17 2017 OS/Arch: linux/amd64 # docker info Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 0 Server Version: 1.12.6 Storage Driver: devicemapper Pool Name: docker-8:2-475-pool Pool Blocksize: 65.54 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: /dev/loop0 Metadata file: /dev/loop1 Data Space Used: 11.8 MB Data Space Total: 107.4 GB Data Space Available: 7.441 GB Metadata Space Used: 581.6 kB Metadata Space Total: 2.147 GB Metadata Space Available: 2.147 GB Thin Pool Minimum Free Space: 10.74 GB Udev Sync Supported: true Deferred Removal Enabled: false Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 Data loop file: /var/lib/docker/devicemapper/devicemapper/data And I have tried some other emptyDir cases, did not find the issue as nfs-provisioner pod. BTW, deployment.yaml can totally work without any issue. Thanks. This seems related to NFS. QE has a nfs-server pod which has the purpose to create nfs-server in the project and then test. But I found this pod also can repro this issue: 1. Create scc, service account 2. Create nfs-server pod: $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml Check the node log: 153061 Jul 3 06:44:38 wehe-node-1 dockerd-current: time="2017-07-03T06:44:38.055266578Z" level=error msg="devmapper: Error unmounting device 3102 fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bcf5ac5f9621db: devicemapper: Can not set cookie: dm_task_set_cookie failed" 153062 Jul 3 06:44:38 wehe-node-1 dockerd-current: time="2017-07-03T06:44:38.055340579Z" level=error msg="Error unmounting container 8b1a0322f493 d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: devicemapper: Can not set cookie: dm_task_set_cookie failed" 153074 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: I0703 06:44:38.112036 12994 kubelet_pods.go:1168] Generating status for "nfs-server_1q 9kb(4b01b604-5fb6-11e7-b37e-000d3a1bc5aa)" 153075 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: I0703 06:44:38.112087 12994 kubelet.go:1847] SyncLoop (PLEG): "nfs-server_1q9kb(4b01b6 04-5fb6-11e7-b37e-000d3a1bc5aa)", event: &pleg.PodLifecycleEvent{ID:"4b01b604-5fb6-11e7-b37e-000d3a1bc5aa", Type:"ContainerDied", Data:"f8f 2a32fa0a04d7cc1f8afb728da7e2ad09e91e3781b2601dc5c8b08053a6540"} 153076 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: W0703 06:44:38.112116 12994 pod_container_deletor.go:77] Container "f8f2a32fa0a04d7cc1 f8afb728da7e2ad09e91e3781b2601dc5c8b08053a6540" not found in pod's containers 153077 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: I0703 06:44:38.112260 12994 status_manager.go:340] Status Manager: adding pod: "4b01b6 04-5fb6-11e7-b37e-000d3a1bc5aa", with status: ('\x04', {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:10:20 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:44:38 +0000 UTC ContainersNotReady containers with unready status: [nfs-ser ver]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:10:20 +0000 UTC }] 172.27.17.5 2017-07-03 06:10:20 +0000 UTC [] [{ nfs-server {nil nil &ContainerStateTerminated{ExitCode:137,Signal:0,Reason:Error,Message:,StartedAt:2017-07-03 06:10:49 +0000 UTC,FinishedA t:2017-07-03 06:44:25 +0000 UTC,ContainerID:docker://8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15,}} {nil nil nil} fals e 0 docker.io/aosqe/nfs-server:latest docker-pullable://docker.io/aosqe/nfs-server@sha256:3e5e1fbfc51571a007451a98a43f6f30d7abbcc4a3ce63bb0 b39db253956e8b0 docker://8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15}] BestEffort}) to podStatusChannel 153078 Jul 3 06:44:38 wehe-node-1 atomic-openshift-node: I0703 06:44:38.112464 12994 status_manager.go:147] Status Manager: syncing pod: "4b01b 604-5fb6-11e7-b37e-000d3a1bc5aa", with status: (4, {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:10:20 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:44:38 +0000 UTC ContainersNotReady containers with unready status: [nfs-server] } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-07-03 06:10:20 +0000 UTC }] 172.27.17.5 2017-07-03 06:10:20 +0000 UTC [] [{nfs- server {nil nil &ContainerStateTerminated{ExitCode:137,Signal:0,Reason:Error,Message:,StartedAt:2017-07-03 06:10:49 +0000 UTC,FinishedAt:20 17-07-03 06:44:25 +0000 UTC,ContainerID:docker://8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15,}} {nil nil nil} false 0 docker.io/aosqe/nfs-server:latest docker-pullable://docker.io/aosqe/nfs-server@sha256:3e5e1fbfc51571a007451a98a43f6f30d7abbcc4a3ce63bb0b39d b253956e8b0 docker://8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15}] BestEffort}) from podStatusChannel 153433 Jul 3 06:45:07 wehe-node-1 dockerd-current: time="2017-07-03T06:45:07.149023906Z" level=error msg="Error removing mounted layer 8b1a0322f4 93d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bcf5ac5f9 621db:devicemapper: Can not set cookie: dm_task_set_cookie failed" 153434 Jul 3 06:45:07 wehe-node-1 dockerd-current: time="2017-07-03T06:45:07.149112407Z" level=error msg="Handler for DELETE /v1.24/containers/8b 1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15?v=1 returned error: Driver devicemapper failed to remove root filesystem 8b1 a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bc f5ac5f9621db:devicemapper: Can not set cookie: dm_task_set_cookie failed" 153435 Jul 3 06:45:07 wehe-node-1 dockerd-current: time="2017-07-03T06:45:07.149143207Z" level=error msg="Handler for DELETE /v1.24/containers/8b 1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15 returned error: Driver devicemapper failed to remove root filesystem 8b1a032 2f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bcf5ac 5f9621db:devicemapper: Can not set cookie: dm_task_set_cookie failed" 153436 Jul 3 06:45:07 wehe-node-1 atomic-openshift-node: E0703 06:45:07.149642 12994 remote_runtime.go:241] RemoveContainer "8b1a0322f493d41aa1 525a56b943e82116fa33bcebe27c5a0f9299f321addd15" from runtime service failed: rpc error: code = 2 desc = failed to remove container "8b1a032 2f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15": Error response from daemon: {"message":"Driver devicemapper failed to remove ro ot filesystem 8b1a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38 c4886fc1ef86082bcf5ac5f9621db:devicemapper: Can not set cookie: dm_task_set_cookie failed"} 153437 Jul 3 06:45:07 wehe-node-1 atomic-openshift-node: E0703 06:45:07.149696 12994 kuberuntime_gc.go:123] Failed to remove container "8b1a032 2f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15": rpc error: code = 2 desc = failed to remove container "8b1a0322f493d41aa1525a56 b943e82116fa33bcebe27c5a0f9299f321addd15": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem 8b1 a0322f493d41aa1525a56b943e82116fa33bcebe27c5a0f9299f321addd15: failed to remove device 3102fa5f99655a415a3e84c755dc5101f38c4886fc1ef86082bc f5ac5f9621db:devicemapper: Can not set cookie: dm_task_set_cookie failed"} I tried to reproduce this bug with today's docker and openshift and I failed docker-1.12.6-32.git88a4867.el7.x86_64 atomic-openshift-node-3.6.133-1.git.0.524e4c8.el7.x86_64 $ docker info | grep Deferred Deferred Removal Enabled: false Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml The pod gets created and running and can be deleted without any problems. I tried also docker-1.12.6-39.1.git6ffd653.el7.x86_64, still the same - works like charm. There must be something else wrong on the system. (In reply to Jan Safranek from comment #4) > I tried to reproduce this bug with today's docker and openshift and I failed > docker-1.12.6-32.git88a4867.el7.x86_64 > atomic-openshift-node-3.6.133-1.git.0.524e4c8.el7.x86_64 > > $ docker info | grep Deferred > Deferred Removal Enabled: false > Deferred Deletion Enabled: false > Deferred Deleted Device Count: 0 > > $ oc create -f > https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/ > nfs/nfs-server.yaml > > The pod gets created and running and can be deleted without any problems. > > I tried also docker-1.12.6-39.1.git6ffd653.el7.x86_64, still the same - > works like charm. There must be something else wrong on the system. At the first time this pod could be deployed and running. But after we delete it, any other pod could not be scheduled again with my comment 3 errors. Do you still have the env? Could try to create a another pod: $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/pods/hello-pod.json And I have tried to only create the pod https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/storage/nfs/nfs-server.yaml, it will not repro. But if we create other pod to use the nfs volume, this issue would repro: Detailed repro step: 1. Create a project 2. Create a scc https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/authorization/scc/scc_super_template.yaml 2. Add service account and user to this scc # oc get scc super-wehe -o yaml allowHostDirVolumePlugin: true allowHostIPC: false allowHostNetwork: true allowHostPID: false allowHostPorts: true allowPrivilegedContainer: true allowedCapabilities: [] apiVersion: v1 defaultAddCapabilities: [] fsGroup: type: RunAsAny groups: - system:serviceaccounts:9tw98 kind: SecurityContextConstraints metadata: creationTimestamp: 2017-07-05T09:51:43Z name: super-wehe resourceVersion: "63304" selfLink: /api/v1/securitycontextconstraints/super-wehe uid: 8ce63cbc-6167-11e7-b249-000d3a1a72a9 priority: null readOnlyRootFilesystem: false requiredDropCapabilities: [] runAsUser: type: RunAsAny seLinuxContext: type: RunAsAny supplementalGroups: type: RunAsAny users: - wehe volumes: - '*' 3. Create nfs server pod # oc create -f https://github.com/openshift-qe/v3-testfiles/raw/master/storage/nfs/nfs-server.yaml 4. Check the service $ oc get svc NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE nfs-service 172.30.109.118 <none> 2049/TCP 9s 5. Update the a nfs pv to use this nfs-server # cat pvrwo.json | oc create -f - { "apiVersion": "v1", "kind": "PersistentVolume", "metadata": { "name": "wehepv" }, "spec": { "capacity": { "storage": "2Gi" }, "accessModes": [ "ReadWriteOnce" ], "nfs": { "path": "/", "server": "172.30.109.118" }, "persistentVolumeReclaimPolicy": "Retain" } } 6. Create a pvc to bind this pv # cat pvc.json | oc create -f - { "apiVersion": "v1", "kind": "PersistentVolumeClaim", "metadata": { "name": "pvc" }, "spec": { "accessModes": [ "ReadWriteOnce" ], "resources": { "requests": { "storage": "1Gi" } } } } 7. Create a pod oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/persistent-volumes/nfs/auto/web-pod.json to consume this pvc 8. Touch some file to check the read/write on mounted dir ]$ oc get pods NAME READY STATUS RESTARTS AGE nfs 1/1 Running 0 45s nfs-server 1/1 Running 0 8m $ oc exec -it nfs sh sh-4.2$ ls /mnt/ sh-4.2$ touch /mnt/test sh-4.2$ exit 9. Delete all the pods and svc $ oc delete pods --all pod "nfs" deleted pod "nfs-server" deleted [wehe@dhcp-136-45 octest]$ oc get pods NAME READY STATUS RESTARTS AGE nfs 0/1 Terminating 0 2m nfs-server 1/1 Terminating 0 9m 10. Create a pod again, and pod will keep in containercreating status $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/pods/hello-pod.json $ oc get pods NAME READY STATUS RESTARTS AGE hello-openshift 0/1 ContainerCreating 0 10m 11. ssh to the node and run docker # docker run --rm -ti docker.io/aosqe/nfs-web bash /usr/bin/docker-current: Error response from daemon: devmapper: Error activating devmapper device for 'f76117935a9892fefc77651e018f8e0e6da797c4cd96b439053ba31925cf9154-init': devicemapper: Can't set cookie dm_task_set_cookie failed. I think this is similar root cause of nfs-provisioner pod. Thanks. I wonder if this would happen with an "ordinary" NFS server? Since you say it happens only when a pod actually *uses* the NFS volume. Could you try using creating a pod that uses a non-nfs-provisioner pod NFS server NFS volume and see if anything breaks? There is something weird going on with devicemapper and the NFS mount maybe. Thanks. Regardless I'm leaning heavily towards it being a Docker problem that we have little control over. The NFS server pod is not doing anything particularly bad or different from x months ago, only Docker has changed. I have tried on an real NFS server which is deployed on the node, this issue was not repro. Thanks |