Description of problem: OpenShift node status NotReady, due to docker service failure Version-Release number of selected component (if applicable): openshift v3.3.0.12 kubernetes v1.3.0+57fb9ac etcd 2.3.0+git How reproducible: 50% Steps to Reproduce: 1. set up openshift cluster by Jenkins job with rhel7.2.6 image. 2. check openshift cluster status $ oc get nodes 3. check docker service on the NotReady node. Actual results: 2. not all nodes are ready. # oc get nodes NAME STATUS AGE node1 Ready,SchedulingDisabled 1h node2 NotReady 1h node3 Ready 1h 3. docker service failed. # systemctl status docker -l ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/docker.service.d └─docker-sdn-ovs.conf Active: failed (Result: exit-code) since 一 2016-08-01 04:51:13 EDT; 51s ago Docs: http://docs.docker.com Process: 5708 ExecStart=/usr/bin/docker-current daemon --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY (code=exited, status=1/FAILURE) Main PID: 5708 (code=exited, status=1/FAILURE) Expected results: 2. All nodes are Ready 3. docker service are running Additional info:
Created attachment 1186332 [details] /var/log/messages
# rpm -qa |grep device-mapper device-mapper-event-libs-1.02.107-5.el7_2.5.x86_64 device-mapper-persistent-data-0.6.2-1.el7_2.x86_64 device-mapper-libs-1.02.107-5.el7_2.5.x86_64 device-mapper-event-1.02.107-5.el7_2.5.x86_64 device-mapper-1.02.107-5.el7_2.5.x86_64 # uname -r 3.10.0-327.22.2.el7.x86_64 # docker info Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 2 Server Version: 1.10.3 Storage Driver: devicemapper Pool Name: rhel-docker--pool Pool Blocksize: 524.3 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 404.8 MB Data Space Total: 8.884 GB Data Space Available: 8.479 GB Metadata Space Used: 102.4 kB Metadata Space Total: 33.55 MB Metadata Space Available: 33.45 MB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: true Deferred Deleted Device Count: 0 Library Version: 1.02.107-RHEL7 (2016-06-09) Execution Driver: native-0.2 Logging Driver: journald Plugins: Volume: local Network: null host bridge Authorization: rhel-push-plugin Kernel Version: 3.10.0-327.22.2.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo) OSType: linux Architecture: x86_64 Number of Docker Hooks: 2 CPUs: 2 Total Memory: 3.702 GiB Name: xxx.com ID: BSPL:CCEL:CU2U:PLGZ:ICXB:F3VH:ZYTG:4HMW:W2Z7:2JXD:4QTT:DNPN WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled # cat /etc/sysconfig/docker-storage-setup # Edit this file to override any configuration options specified in # /usr/lib/docker-storage-setup/docker-storage-setup. # # For more details refer to "man docker-storage-setup" VG=rhel GROWPART=true # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert docker-pool rhel twi-aot--- 20.26g 3.27 0.27 root rhel -wi-ao---- 10.00g # vgdisplay --- Volume group --- VG Name rhel System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 15 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 59.51 GiB PE Size 4.00 MiB Total PE 15234 Alloc PE / Size 7779 / 30.39 GiB Free PE / Size 7455 / 29.12 GiB VG UUID THPZVD-CJgg-TeTk-KMfl-wbLX-iNL1-9Lp4Jl
Reassigning to RHEL7/docker
Reproduced 100% to my 3 boxes so far.
Are ALL of these failures triggered when using OpenShift 3.3? Have ANY failures been found using OpenShift 3.2? I want any proof that this is or is not a result of https://bugzilla.redhat.com/show_bug.cgi?id=1363776
Actually the question is not 'have any failures been found with 3.2' the real question is 'have any failures been found WITHOUT 3.3'?
(In reply to Eric Paris from comment #35) > Actually the question is not 'have any failures been found with 3.2' the > real question is 'have any failures been found WITHOUT 3.3'? According to analysis - comment 30, this issue is caused by openshift components performed thinp snapshot operations. As far as I know, calling dmsetup or thin_ls command to operate devicemapper thin_pool is added in 3.3 (https://trello.com/c/5NzVnJ5P/314-13-support-getting-container-filesystem-usage-in-cadvisor-for-devicemapper), not 3.2. So 3.2 will not have such issue, and in facts QE never encounter such issue before. I will do a real testing against 3.2 + the latest docker to prove that later.
(In reply to Johnny Liu from comment #37) s/this issue is caused by openshift components performed thinp snapshot operations/this issue is caused by openshift components is performing thinp snapshot operations at meantime the docker-pool is being extended automatically after extension threshold is hit/
@Eric, > Actually the question is not 'have any failures been found with 3.2' the > real question is 'have any failures been found WITHOUT 3.3'? Just test it against openshift 3.3 and 3.2 using the same steps, 3.2 does not have such issue, 3.3 does have. 1. Configure docker-storage-setup to enable auto extend docker-pool. # cat /etc/sysconfig/docker-storage-setup VG=rhel GROWPART=true 2. install openshift 3. run the following command # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert docker-pool rhel twi-aot--- 8.27g 27.80 2.69 root rhel -wi-ao---- 10.00g # docker run rhel7 dd if=/dev/zero of=/data bs=1G For 3.2, the above command is completed successfully, docker-pool is extended successfully. # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert docker-pool rhel twi-aot--- 19.01g 63.71 4.79 root rhel -wi-ao---- 10.00g For 3.3, docker-pool crashed and cause docker daemon crashed. The reason is what I said in comment 37. This issue is caused by openshift components is performing thinp snapshot operations at meantime the docker-pool is being extended automatically after extension threshold is hit. The behavior of openshift components performing thinp snapshot operations is newly introduced in 3.3. @caiqian, > It is interesting that you guys only saw this recently. The above statement could explain why we found it recently. We just start 3.3 full-functional testing recently.
https://github.com/google/cadvisor/pull/1411, once merged, fixes cadvisor so it won't reserve/release metadata snapshots or invoke thin_ls unless a minimum kernel version containing the thin pool corruption fixes is in use.
Starting in v3.3.0.15, we won't try to reserve metadata snapshots or invoke thin_ls unless the kernel is new enough (>= 3.10.0-166) to contain the fixes for the corruption issues.
The proposed kernel version is 3.10.0-366, not 3.10.0-166.
Fixed. openshift v3.3.0.17 kubernetes v1.3.0+507d3a7 etcd 2.3.0+git openshift-105 atomic-openshift-node: E0808 23:06:16.676806 15757 factory.go:291] devicemapper filesystem stats will not be reported: RHEL/Centos 7.x kernel version 3.10.0-366 or later is required to use thin_ls - you have "3.10.0-327.22.2.el7.x86_64"
*** Bug 1364176 has been marked as a duplicate of this bug. ***
Would it be possible to fix this bug in case Rancher is starting docker containers? This is my kernel version 3.10.0-327.28.3.el7.x86_64 devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section." Sep 01 11:18:38 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:38.792047399+02:00" level=info msg="libcontainerd: new containerd process, pid: 2158" Sep 01 11:18:39 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:39.950465041+02:00" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section." Sep 01 11:18:40 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:40.043373535+02:00" level=error msg="[graphdriver] prior storage driver \"devicemapper\" failed: devmapper: Base Device UUID and Filesystem verification failed: devmapper: Failed to find uuid for device /dev/mapper/docker-253:0-17051550-base:exit status 2" Sep 01 11:18:40 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:40.043649955+02:00" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devmapper: Failed to find uuid for device /dev/mapper/docker-253:0-17051550-base:exit status 2" Sep 01 11:18:40 uxtest.rs systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE Sep 01 11:18:40 uxtest.rs systemd[1]: Failed to start Docker Application Container Engine. -- Subject: Unit docker.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit docker.service has failed. -- -- The result is failed. Sep 01 11:18:40 uxtest.rs systemd[1]: Unit docker.service entered failed state. Sep 01 11:18:40 uxtest.rs systemd[1]: docker.service failed. Sep 01 11:18:40 uxtest.rs polkitd[622]: Unregistered Authentication Agent for unix-process:2137:696636 (system bus name :1.19, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus) /org/freedesktop/PolicyKit1