Description of problem: On restarting docker service on one of the OCP node,docker service failed to comeup. Initially,we restarted the docker service on all the OCP nodes.Docker service came up on all the nodes.But one of the gluster pod was in failed state which was hosting on node- 10.70.46.78. Logged in into that node and checked docker service.. #systemctl status docker shows service is running.Again tried restarting the docker service on this node.Docker service restart failed. ============= # systemctl status docker.service ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/docker.service.d └─custom.conf Active: failed (Result: exit-code) since Tue 2018-12-04 14:41:26 IST; 54s ago Docs: http://docs.docker.com Process: 120950 ExecStart=/usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY $REGISTRIES (code=exited, status=1/FAILURE) Main PID: 120950 (code=exited, status=1/FAILURE) Dec 04 14:41:24 dhcp46-178.lab.eng.blr.redhat.com systemd[1]: Starting Docker Application Container Engine... Dec 04 14:41:25 dhcp46-178.lab.eng.blr.redhat.com dockerd-current[120950]: time="2018-12-04T14:41:25.025118184+05:30" level=warning msg="could not change group /var/run/docker.sock to docker: group... not found" Dec 04 14:41:25 dhcp46-178.lab.eng.blr.redhat.com dockerd-current[120950]: time="2018-12-04T14:41:25.029528387+05:30" level=info msg="libcontainerd: new containerd process, pid: 120958" Dec 04 14:41:26 dhcp46-178.lab.eng.blr.redhat.com dockerd-current[120950]: Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: dev..._run failed Dec 04 14:41:26 dhcp46-178.lab.eng.blr.redhat.com systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE Dec 04 14:41:26 dhcp46-178.lab.eng.blr.redhat.com systemd[1]: Failed to start Docker Application Container Engine. Dec 04 14:41:26 dhcp46-178.lab.eng.blr.redhat.com systemd[1]: Unit docker.service entered failed state. Dec 04 14:41:26 dhcp46-178.lab.eng.blr.redhat.com systemd[1]: docker.service failed. Hint: Some lines were ellipsized, use -l to show in full. ======================== ]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE glusterblock-storage-provisioner-dc-1-l7d7r 1/1 Running 1 2d 10.129.2.14 dhcp46-223.lab.eng.blr.redhat.com <none> glusterfs-storage-7b8cs 1/1 Running 1 2d 10.70.46.87 dhcp46-87.lab.eng.blr.redhat.com <none> glusterfs-storage-hfjbd 1/1 Running 2 2d 10.70.47.16 dhcp47-16.lab.eng.blr.redhat.com <none> glusterfs-storage-hnqh4 0/1 ContainerCreating 0 1h 10.70.46.178 dhcp46-178.lab.eng.blr.redhat.com <none> glusterfs-storage-pjn9d 1/1 Running 1 2d 10.70.47.26 dhcp47-26.lab.eng.blr.redhat.com <none> heketi-storage-1-fqh99 1/1 Running 1 2d 10.131.2.16 dhcp47-17.lab.eng.blr.redhat.com <none> ===================== Gluster pod - glusterfs-storage-hnqh4 failed to come up. Desc of the pod shows- ============ Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 1h (x8 over 1h) kubelet, dhcp46-178.lab.eng.blr.redhat.com Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "glusterfs-storage-hnqh4": Error response from daemon: devicemapper: Can't set task name /dev/mapper/docker--vg-docker--pool Warning FailedCreatePodSandBox 1h kubelet, dhcp46-178.lab.eng.blr.redhat.com Failed create pod sandbox: rpc error: code = Unknown desc = failed to inspect sandbox image "registry.access.redhat.com/openshift3/ose-pod:v3.11": Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? ============== Version-Release number of selected component (if applicable): # oc version oc v3.11.43 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp47-89.lab.eng.blr.redhat.com:8443 openshift v3.11.43 kubernetes v1.11.0+d4cacc0 # docker --version Docker version 1.13.1, build 07f3374/1.13.1 docker-1.13.1-84.git07f3374.el7.x86_64 How reproducible: 1/1 Steps to Reproduce: 1.Create OCS 3.11.1 and OCP 3.11 setup with logging and metrics configured.4 gluster pod were running on this setup 2.Create 5 postgresql app pods and 5 mongodb app pods backed up block-hosting volume. All pods were up and running 3. Restart docker service on all the nodes.Service started on all the nodes 4.Wait for pods to come up. 5.Few pods along with 1 gluster pod failed to come up. 6.Login to the node - 10.70.46.78 on which gluster pod which is in failed state was hosted. 7.Check docker status.It did showed "running" 8.Restart docker service on that node failed. Actual results: Docker service failed to restart on one of the node. Expected results: Docker service restart should not fail Additional info: ========= # oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE glusterblock-storage-provisioner-dc-1-l7d7r 1/1 Running 1 2d 10.129.2.14 dhcp46-223.lab.eng.blr.redhat.com <none> glusterfs-storage-7b8cs 1/1 Running 1 2d 10.70.46.87 dhcp46-87.lab.eng.blr.redhat.com <none> glusterfs-storage-hfjbd 1/1 Running 2 2d 10.70.47.16 dhcp47-16.lab.eng.blr.redhat.com <none> glusterfs-storage-hnqh4 0/1 ContainerCreating 0 1h 10.70.46.178 dhcp46-178.lab.eng.blr.redhat.com <none> glusterfs-storage-pjn9d 1/1 Running 1 2d 10.70.47.26 dhcp47-26.lab.eng.blr.redhat.com <none> heketi-storage-1-fqh99 1/1 Running 1 2d 10.131.2.16 dhcp47-17.lab.eng.blr.redhat.com <none> ============ docker restart failed with - =============== # journalctl -xe Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: E1204 15:05:45.053445 5273 kuberuntime_sandbox.go:198] ListPodSandbox failed: rpc error: code = Unknown desc = Cannot connect to Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: E1204 15:05:45.053462 5273 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = Unknown desc = Cannot connect Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.107107 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.107162 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.207547 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.207612 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.308030 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.308088 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.408496 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.408556 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.508895 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.508970 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.553689 5273 kubelet_node_status.go:385] Updating node status Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.555829 5273 kubelet_node_status.go:1077] Setting node status at position 0 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.557802 5273 kubelet_node_status.go:1077] Setting node status at position 1 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: E1204 15:05:45.558709 5273 remote_runtime.go:69] Version from runtime service failed: rpc error: code = Unknown desc = failed to Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: E1204 15:05:45.558740 5273 kuberuntime_manager.go:245] Get remote runtime version failed: rpc error: code = Unknown desc = failed Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.558760 5273 kubelet_node_status.go:1077] Setting node status at position 2 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.558766 5273 kubelet_node_status.go:1077] Setting node status at position 3 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.558772 5273 kubelet_node_status.go:1077] Setting node status at position 4 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.558777 5273 kubelet_node_status.go:1077] Setting node status at position 5 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.558782 5273 kubelet_node_status.go:1077] Setting node status at position 6 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.558796 5273 kubelet_node_status.go:1077] Setting node status at position 7 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.558833 5273 kubelet_node_status.go:1077] Setting node status at position 8 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.609254 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.609330 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.709706 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.709799 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.810370 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.810471 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.910841 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:45 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:45.910901 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.011315 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.011393 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.053637 5273 generic.go:183] GenericPLEG: Relisting Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: E1204 15:05:46.054383 5273 remote_runtime.go:169] ListPodSandbox with filter nil from runtime service failed: rpc error: code = U Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: E1204 15:05:46.054434 5273 kuberuntime_sandbox.go:198] ListPodSandbox failed: rpc error: code = Unknown desc = Cannot connect to Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: E1204 15:05:46.054449 5273 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = Unknown desc = Cannot connect Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.111851 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.111926 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.212385 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.212442 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.275476 5273 worker.go:177] Probe target container not found: glusterfs-storage-hnqh4_app-storage(ee6f9418-f797-11e Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.312737 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.312795 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.413295 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.413380 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.513704 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.513767 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.614143 5273 reconciler.go:283] Starting operationExecutor.UnmountDevice for volume "pvc-1d1b6715-f725-11e8-8792-00 Dec 04 15:05:46 dhcp46-178.lab.eng.blr.redhat.com atomic-openshift-node[5273]: I1204 15:05:46.614214 5273 volume_host.go:237] using default mounter/exec for kubernetes.io/iscsi ================ =============== # vgs VG #PV #LV #SN Attr VSize VFree docker-vg 1 1 0 wz--n- <2.00t 1.20t rhel_dhcp47-42 1 2 0 wz--n- <83.00g 0 vg_7db329da0d33a79ba176484bc72ec654 1 4 0 wz--n- <2.00t <1.90t vg_85f2f8099850d8396e22a54a50bd593a 1 0 0 wz--n- <2.00t <2.00t [root@dhcp46-178 ~]# [root@dhcp46-178 ~]# [root@dhcp46-178 ~]# pvs /dev/sdi: open failed: No such device or address PV VG Fmt Attr PSize PFree /dev/sda2 rhel_dhcp47-42 lvm2 a-- <83.00g 0 /dev/sdc1 docker-vg lvm2 a-- <2.00t 1.20t /dev/sdd vg_85f2f8099850d8396e22a54a50bd593a lvm2 a-- <2.00t <2.00t /dev/sde vg_7db329da0d33a79ba176484bc72ec654 lvm2 a-- <2.00t <1.90t [root@dhcp46-178 ~]# [root@dhcp46-178 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert docker-pool docker-vg twi-a-t--- 815.09g 0.50 0.85 home rhel_dhcp47-42 -wi-ao---- <33.00g root rhel_dhcp47-42 -wi-ao---- 50.00g brick_bd316ce03b83df64b35119e47d0271f0 vg_7db329da0d33a79ba176484bc72ec654 Vwi-a-tz-- 100.00g tp_68842b70a06ad3a567009f705a079f17 0.40 brick_ca9b00c54f1ae37088f5d737f7890e79 vg_7db329da0d33a79ba176484bc72ec654 Vwi-a-tz-- 2.00g tp_ca9b00c54f1ae37088f5d737f7890e79 0.72 tp_68842b70a06ad3a567009f705a079f17 vg_7db329da0d33a79ba176484bc72ec654 twi-aotz-- 100.00g 0.40 3.17 tp_ca9b00c54f1ae37088f5d737f7890e79 vg_7db329da0d33a79ba176484bc72ec654 twi-aotz-- 2.00g 0.72 10.32 =============== will be attaching sosreport shortly
I am hitting the issue described here when i try to create pgsql pods on my ocs cluster.
Just information from lvm2 POV - the usage of lvm2 commands within containers is in general ATM unsupportable operation. We do not have udev inside container - we cannot synchronize with udev - and the state of symlink will get pretty random especially if there is big stream of activation and deactivation commands. The best advice is - that lvm2 commands that do manipulate with block devices should be executed within proper environment - aka host system - where all the things do match together - kernel, udev, udev rules, binaries, libraries. Running different versions of lvm2 from a container without having access to system's udev works only with big luck.
Well overlayfs in Docker and CRI-O don't use LVM, so there is no chance of the tool deadlocking the daemon. What you are seeing here is Docker and the container fighing over the management of LVM. If Docker no longer does anything with devicemapper, then the conflict disappears. Not saying their could be other problems with doing lvm commands inside of a container, but at least they will not interfere with the container engine.
*** Bug 1748445 has been marked as a duplicate of this bug. ***
doc text has been modified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0622