Bug 1362109
| Summary: | OpenShift node status NotReady, due to docker service failure | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Weihua Meng <wmeng> | ||||
| Component: | Node | Assignee: | Andy Goldstein <agoldste> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | DeShuai Ma <dma> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 3.3.0 | CC: | agoldste, anli, aos-bugs, bbreard, branka_kostadinovic, dma, dwalsh, eparis, ghuang, haowang, imcleod, jhonce, jialiu, jokerman, lsm5, lsu, mmccomas, mmcgrath, pruan, qcai, sgraf, smahajan, tdawson, tstclair, vlaad, wsun, xiazhao, xtian, yanpzhan | ||||
| Target Milestone: | --- | Keywords: | Extras, TestBlocker | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | atomic-openshift-3.3.0.15-1.git.0.4a954d5.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-09-19 14:32:09 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1290911, 1363776, 1364462 | ||||||
| Bug Blocks: | 1353885 | ||||||
| Attachments: |
|
||||||
Created attachment 1186332 [details]
/var/log/messages
# rpm -qa |grep device-mapper
device-mapper-event-libs-1.02.107-5.el7_2.5.x86_64
device-mapper-persistent-data-0.6.2-1.el7_2.x86_64
device-mapper-libs-1.02.107-5.el7_2.5.x86_64
device-mapper-event-1.02.107-5.el7_2.5.x86_64
device-mapper-1.02.107-5.el7_2.5.x86_64
# uname -r
3.10.0-327.22.2.el7.x86_64
# docker info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 2
Server Version: 1.10.3
Storage Driver: devicemapper
Pool Name: rhel-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 404.8 MB
Data Space Total: 8.884 GB
Data Space Available: 8.479 GB
Metadata Space Used: 102.4 kB
Metadata Space Total: 33.55 MB
Metadata Space Available: 33.45 MB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.107-RHEL7 (2016-06-09)
Execution Driver: native-0.2
Logging Driver: journald
Plugins:
Volume: local
Network: null host bridge
Authorization: rhel-push-plugin
Kernel Version: 3.10.0-327.22.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 2
Total Memory: 3.702 GiB
Name: xxx.com
ID: BSPL:CCEL:CU2U:PLGZ:ICXB:F3VH:ZYTG:4HMW:W2Z7:2JXD:4QTT:DNPN
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
# cat /etc/sysconfig/docker-storage-setup
# Edit this file to override any configuration options specified in
# /usr/lib/docker-storage-setup/docker-storage-setup.
#
# For more details refer to "man docker-storage-setup"
VG=rhel
GROWPART=true
# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
docker-pool rhel twi-aot--- 20.26g 3.27 0.27
root rhel -wi-ao---- 10.00g
# vgdisplay
--- Volume group ---
VG Name rhel
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 15
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 2
Open LV 2
Max PV 0
Cur PV 1
Act PV 1
VG Size 59.51 GiB
PE Size 4.00 MiB
Total PE 15234
Alloc PE / Size 7779 / 30.39 GiB
Free PE / Size 7455 / 29.12 GiB
VG UUID THPZVD-CJgg-TeTk-KMfl-wbLX-iNL1-9Lp4Jl
Reassigning to RHEL7/docker Reproduced 100% to my 3 boxes so far. Are ALL of these failures triggered when using OpenShift 3.3? Have ANY failures been found using OpenShift 3.2? I want any proof that this is or is not a result of https://bugzilla.redhat.com/show_bug.cgi?id=1363776 Actually the question is not 'have any failures been found with 3.2' the real question is 'have any failures been found WITHOUT 3.3'? (In reply to Eric Paris from comment #35) > Actually the question is not 'have any failures been found with 3.2' the > real question is 'have any failures been found WITHOUT 3.3'? According to analysis - comment 30, this issue is caused by openshift components performed thinp snapshot operations. As far as I know, calling dmsetup or thin_ls command to operate devicemapper thin_pool is added in 3.3 (https://trello.com/c/5NzVnJ5P/314-13-support-getting-container-filesystem-usage-in-cadvisor-for-devicemapper), not 3.2. So 3.2 will not have such issue, and in facts QE never encounter such issue before. I will do a real testing against 3.2 + the latest docker to prove that later. (In reply to Johnny Liu from comment #37) s/this issue is caused by openshift components performed thinp snapshot operations/this issue is caused by openshift components is performing thinp snapshot operations at meantime the docker-pool is being extended automatically after extension threshold is hit/ @Eric, > Actually the question is not 'have any failures been found with 3.2' the > real question is 'have any failures been found WITHOUT 3.3'? Just test it against openshift 3.3 and 3.2 using the same steps, 3.2 does not have such issue, 3.3 does have. 1. Configure docker-storage-setup to enable auto extend docker-pool. # cat /etc/sysconfig/docker-storage-setup VG=rhel GROWPART=true 2. install openshift 3. run the following command # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert docker-pool rhel twi-aot--- 8.27g 27.80 2.69 root rhel -wi-ao---- 10.00g # docker run rhel7 dd if=/dev/zero of=/data bs=1G For 3.2, the above command is completed successfully, docker-pool is extended successfully. # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert docker-pool rhel twi-aot--- 19.01g 63.71 4.79 root rhel -wi-ao---- 10.00g For 3.3, docker-pool crashed and cause docker daemon crashed. The reason is what I said in comment 37. This issue is caused by openshift components is performing thinp snapshot operations at meantime the docker-pool is being extended automatically after extension threshold is hit. The behavior of openshift components performing thinp snapshot operations is newly introduced in 3.3. @caiqian, > It is interesting that you guys only saw this recently. The above statement could explain why we found it recently. We just start 3.3 full-functional testing recently. https://github.com/google/cadvisor/pull/1411, once merged, fixes cadvisor so it won't reserve/release metadata snapshots or invoke thin_ls unless a minimum kernel version containing the thin pool corruption fixes is in use. Starting in v3.3.0.15, we won't try to reserve metadata snapshots or invoke thin_ls unless the kernel is new enough (>= 3.10.0-166) to contain the fixes for the corruption issues. The proposed kernel version is 3.10.0-366, not 3.10.0-166. Fixed. openshift v3.3.0.17 kubernetes v1.3.0+507d3a7 etcd 2.3.0+git openshift-105 atomic-openshift-node: E0808 23:06:16.676806 15757 factory.go:291] devicemapper filesystem stats will not be reported: RHEL/Centos 7.x kernel version 3.10.0-366 or later is required to use thin_ls - you have "3.10.0-327.22.2.el7.x86_64" *** Bug 1364176 has been marked as a duplicate of this bug. *** Would it be possible to fix this bug in case Rancher is starting docker containers? This is my kernel version 3.10.0-327.28.3.el7.x86_64 devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section." Sep 01 11:18:38 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:38.792047399+02:00" level=info msg="libcontainerd: new containerd process, pid: 2158" Sep 01 11:18:39 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:39.950465041+02:00" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section." Sep 01 11:18:40 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:40.043373535+02:00" level=error msg="[graphdriver] prior storage driver \"devicemapper\" failed: devmapper: Base Device UUID and Filesystem verification failed: devmapper: Failed to find uuid for device /dev/mapper/docker-253:0-17051550-base:exit status 2" Sep 01 11:18:40 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:40.043649955+02:00" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devmapper: Failed to find uuid for device /dev/mapper/docker-253:0-17051550-base:exit status 2" Sep 01 11:18:40 uxtest.rs systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE Sep 01 11:18:40 uxtest.rs systemd[1]: Failed to start Docker Application Container Engine. -- Subject: Unit docker.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit docker.service has failed. -- -- The result is failed. Sep 01 11:18:40 uxtest.rs systemd[1]: Unit docker.service entered failed state. Sep 01 11:18:40 uxtest.rs systemd[1]: docker.service failed. Sep 01 11:18:40 uxtest.rs polkitd[622]: Unregistered Authentication Agent for unix-process:2137:696636 (system bus name :1.19, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus) /org/freedesktop/PolicyKit1 |
Description of problem: OpenShift node status NotReady, due to docker service failure Version-Release number of selected component (if applicable): openshift v3.3.0.12 kubernetes v1.3.0+57fb9ac etcd 2.3.0+git How reproducible: 50% Steps to Reproduce: 1. set up openshift cluster by Jenkins job with rhel7.2.6 image. 2. check openshift cluster status $ oc get nodes 3. check docker service on the NotReady node. Actual results: 2. not all nodes are ready. # oc get nodes NAME STATUS AGE node1 Ready,SchedulingDisabled 1h node2 NotReady 1h node3 Ready 1h 3. docker service failed. # systemctl status docker -l ● docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/docker.service.d └─docker-sdn-ovs.conf Active: failed (Result: exit-code) since 一 2016-08-01 04:51:13 EDT; 51s ago Docs: http://docs.docker.com Process: 5708 ExecStart=/usr/bin/docker-current daemon --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY (code=exited, status=1/FAILURE) Main PID: 5708 (code=exited, status=1/FAILURE) Expected results: 2. All nodes are Ready 3. docker service are running Additional info: