Bug 1362109 - OpenShift node status NotReady, due to docker service failure
Summary: OpenShift node status NotReady, due to docker service failure
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Andy Goldstein
QA Contact: DeShuai Ma
URL:
Whiteboard:
: 1364176 (view as bug list)
Depends On: 1290911 1363776 1364462
Blocks: 1353885
TreeView+ depends on / blocked
 
Reported: 2016-08-01 10:07 UTC by Weihua Meng
Modified: 2020-02-14 17:51 UTC (History)
29 users (show)

Fixed In Version: atomic-openshift-3.3.0.15-1.git.0.4a954d5.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-19 14:32:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/var/log/messages (3.62 MB, text/plain)
2016-08-01 10:14 UTC, Weihua Meng
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1933 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.3 Release Advisory 2016-09-27 13:24:36 UTC

Description Weihua Meng 2016-08-01 10:07:30 UTC
Description of problem:
OpenShift node status NotReady, due to docker service failure

Version-Release number of selected component (if applicable):
openshift v3.3.0.12
kubernetes v1.3.0+57fb9ac
etcd 2.3.0+git

How reproducible:
50%

Steps to Reproduce:
1. set up openshift cluster by Jenkins job with rhel7.2.6 image.
2. check openshift cluster status
$ oc get nodes
3. check docker service on the NotReady node.


Actual results:
2. not all nodes are ready.
# oc get nodes 
NAME    STATUS                     AGE
node1   Ready,SchedulingDisabled   1h
node2   NotReady                   1h
node3   Ready                      1h

3. docker service failed.
# systemctl status docker -l
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/docker.service.d
           └─docker-sdn-ovs.conf
   Active: failed (Result: exit-code) since 一 2016-08-01 04:51:13 EDT; 51s ago
     Docs: http://docs.docker.com
  Process: 5708 ExecStart=/usr/bin/docker-current daemon --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $ADD_REGISTRY $BLOCK_REGISTRY $INSECURE_REGISTRY (code=exited, status=1/FAILURE)
 Main PID: 5708 (code=exited, status=1/FAILURE)
Expected results:
2. All nodes are Ready
3. docker service are running
Additional info:

Comment 1 Weihua Meng 2016-08-01 10:14:04 UTC
Created attachment 1186332 [details]
/var/log/messages

Comment 2 Gan Huang 2016-08-01 10:23:22 UTC
# rpm -qa |grep device-mapper
device-mapper-event-libs-1.02.107-5.el7_2.5.x86_64
device-mapper-persistent-data-0.6.2-1.el7_2.x86_64
device-mapper-libs-1.02.107-5.el7_2.5.x86_64
device-mapper-event-1.02.107-5.el7_2.5.x86_64
device-mapper-1.02.107-5.el7_2.5.x86_64

# uname -r
3.10.0-327.22.2.el7.x86_64

# docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: rhel-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 404.8 MB
 Data Space Total: 8.884 GB
 Data Space Available: 8.479 GB
 Metadata Space Used: 102.4 kB
 Metadata Space Total: 33.55 MB
 Metadata Space Available: 33.45 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Execution Driver: native-0.2
Logging Driver: journald
Plugins: 
 Volume: local
 Network: null host bridge
 Authorization: rhel-push-plugin
Kernel Version: 3.10.0-327.22.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 2
Total Memory: 3.702 GiB
Name: xxx.com
ID: BSPL:CCEL:CU2U:PLGZ:ICXB:F3VH:ZYTG:4HMW:W2Z7:2JXD:4QTT:DNPN
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

# cat /etc/sysconfig/docker-storage-setup 
# Edit this file to override any configuration options specified in
# /usr/lib/docker-storage-setup/docker-storage-setup.
#
# For more details refer to "man docker-storage-setup"
VG=rhel
GROWPART=true

# lvs
  LV          VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool rhel twi-aot--- 20.26g             3.27   0.27                            
  root        rhel -wi-ao---- 10.00g  
                                                  
# vgdisplay
  --- Volume group ---
  VG Name               rhel
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  15
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               59.51 GiB
  PE Size               4.00 MiB
  Total PE              15234
  Alloc PE / Size       7779 / 30.39 GiB
  Free  PE / Size       7455 / 29.12 GiB
  VG UUID               THPZVD-CJgg-TeTk-KMfl-wbLX-iNL1-9Lp4Jl

Comment 4 Andy Goldstein 2016-08-01 13:51:02 UTC
Reassigning to RHEL7/docker

Comment 12 Xia Zhao 2016-08-02 02:38:01 UTC
Reproduced 100% to my 3 boxes so far.

Comment 34 Eric Paris 2016-08-03 15:48:15 UTC
Are ALL of these failures triggered when using OpenShift 3.3? Have ANY failures been found using OpenShift 3.2? I want any proof that this is or is not a result of https://bugzilla.redhat.com/show_bug.cgi?id=1363776

Comment 35 Eric Paris 2016-08-03 15:56:57 UTC
Actually the question is not 'have any failures been found with 3.2' the real question is 'have any failures been found WITHOUT 3.3'?

Comment 37 Johnny Liu 2016-08-04 03:26:15 UTC
(In reply to Eric Paris from comment #35)
> Actually the question is not 'have any failures been found with 3.2' the
> real question is 'have any failures been found WITHOUT 3.3'?

According to analysis - comment 30, this issue is caused by openshift components performed thinp snapshot operations. As far as I know, calling dmsetup or thin_ls command to operate devicemapper thin_pool is added in 3.3 (https://trello.com/c/5NzVnJ5P/314-13-support-getting-container-filesystem-usage-in-cadvisor-for-devicemapper), not 3.2. So 3.2 will not have such issue, and in facts QE never encounter such issue before.  I will do a real testing against 3.2 + the latest docker to prove that later.

Comment 38 Johnny Liu 2016-08-04 05:21:07 UTC
(In reply to Johnny Liu from comment #37)
s/this issue is caused by openshift components performed thinp snapshot operations/this issue is caused by openshift components is performing thinp snapshot operations at meantime the docker-pool is being extended automatically after extension threshold is hit/

Comment 39 Johnny Liu 2016-08-04 06:10:39 UTC
@Eric,
> Actually the question is not 'have any failures been found with 3.2' the
> real question is 'have any failures been found WITHOUT 3.3'?

Just test it against openshift 3.3 and 3.2 using the same steps, 3.2 does not have such issue, 3.3 does have.

1. Configure docker-storage-setup to enable auto extend docker-pool.
# cat /etc/sysconfig/docker-storage-setup 
VG=rhel
GROWPART=true
2. install openshift
3. run the following command
# lvs
  LV          VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool rhel twi-aot---  8.27g             27.80  2.69                            
  root        rhel -wi-ao---- 10.00g
# docker run rhel7 dd if=/dev/zero of=/data bs=1G


For 3.2, the above command is completed successfully, docker-pool is extended successfully.
# lvs
  LV          VG   Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool rhel twi-aot--- 19.01g             63.71  4.79                            
  root        rhel -wi-ao---- 10.00g                

For 3.3, docker-pool crashed and cause docker daemon crashed. The reason is what I said in comment 37.  This issue is caused by openshift components is performing thinp snapshot operations at meantime the docker-pool is being extended automatically after extension threshold is hit.

The behavior of openshift components performing thinp snapshot operations is newly introduced in 3.3.

@caiqian, 
> It is interesting that you guys only saw this recently. 
The above statement could explain why we found it recently. We just start 3.3 full-functional testing recently.

Comment 40 Andy Goldstein 2016-08-04 16:04:51 UTC
https://github.com/google/cadvisor/pull/1411, once merged, fixes cadvisor so it won't reserve/release metadata snapshots or invoke thin_ls unless a minimum kernel version containing the thin pool corruption fixes is in use.

Comment 43 Andy Goldstein 2016-08-11 18:33:12 UTC
Starting in v3.3.0.15, we won't try to reserve metadata snapshots or invoke thin_ls unless the kernel is new enough (>= 3.10.0-166) to contain the fixes for the corruption issues.

Comment 44 Weihua Meng 2016-08-11 23:09:30 UTC
The proposed kernel version is 3.10.0-366, not 3.10.0-166.

Comment 45 Weihua Meng 2016-08-11 23:13:21 UTC
Fixed.

openshift v3.3.0.17
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git


 openshift-105 atomic-openshift-node: E0808 23:06:16.676806   15757 factory.go:291] devicemapper filesystem stats will not be reported: RHEL/Centos 7.x kernel version 3.10.0-366 or later is required to use thin_ls - you have "3.10.0-327.22.2.el7.x86_64"

Comment 46 Derek Carr 2016-08-12 15:25:17 UTC
*** Bug 1364176 has been marked as a duplicate of this bug. ***

Comment 47 branka_kostadinovic 2016-09-01 09:29:34 UTC
Would it be possible to fix this bug in case Rancher is starting docker containers?
This is my kernel version 3.10.0-327.28.3.el7.x86_64

devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section."

Sep 01 11:18:38 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:38.792047399+02:00" level=info msg="libcontainerd: new containerd process, pid: 2158"
Sep 01 11:18:39 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:39.950465041+02:00" level=warning msg="devmapper: Usage of loopback devices is strongly discouraged for production use. Please use `--storage-opt dm.thinpooldev` or use `man docker` to refer to dm.thinpooldev section."

Sep 01 11:18:40 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:40.043373535+02:00" level=error msg="[graphdriver] prior storage driver \"devicemapper\" failed: devmapper: Base Device UUID and Filesystem verification failed: devmapper: Failed to find uuid for device /dev/mapper/docker-253:0-17051550-base:exit status 2"
Sep 01 11:18:40 uxtest.rs dockerd[2152]: time="2016-09-01T11:18:40.043649955+02:00" level=fatal msg="Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devmapper: Failed to find uuid for device /dev/mapper/docker-253:0-17051550-base:exit status 2"
Sep 01 11:18:40 uxtest.rs systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Sep 01 11:18:40 uxtest.rs systemd[1]: Failed to start Docker Application Container Engine.
-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has failed.
--
-- The result is failed.
Sep 01 11:18:40 uxtest.rs systemd[1]: Unit docker.service entered failed state.
Sep 01 11:18:40 uxtest.rs systemd[1]: docker.service failed.
Sep 01 11:18:40 uxtest.rs polkitd[622]: Unregistered Authentication Agent for unix-process:2137:696636 (system bus name :1.19, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus) /org/freedesktop/PolicyKit1


Note You need to log in before you can comment on or make changes to this bug.