Bug 1752713 - heketidbstorage bricks go down during PVC creation
Summary: heketidbstorage bricks go down during PVC creation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: rhgs-3.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.5.0
Assignee: Mohit Agrawal
QA Contact: Rachael
URL:
Whiteboard:
Depends On:
Blocks: 1696809 1755900
TreeView+ depends on / blocked
 
Reported: 2019-09-17 04:39 UTC by Rachael
Modified: 2019-11-25 12:39 UTC (History)
12 users (show)

Fixed In Version: glusterfs-6.0-16
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1755900 (view as bug list)
Environment:
Last Closed: 2019-10-30 12:23:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2019:3249 0 None None None 2019-10-30 12:23:29 UTC

Description Rachael 2019-09-17 04:39:06 UTC
Description of problem:

On an OCS 3.11.4 setup, after the 650th PVC creation, further volume creations failed. It was observed that the heketi pod had gone into CrashLoopBack Off state. On further debugging it was found that heketi was in crashloopbackoff due to a failed heketidbstorage mount on the node. This in turn appears to be due to two of the three bricks being down. No cores were found in the glusterfs pods.

Debugging log follows
------------------------------

  Warning  Failed   1h (x41 over 4h)    kubelet, 
dhcp47-132.lab.eng.blr.redhat.com  Error: failed to start container "heketi": 
Error response from daemon: oci runtime error: container_linux.go:235: 
starting container process caused "container init exited prematurely"
  Normal   Created  1h (x411 over 1d)   kubelet, 
dhcp47-132.lab.eng.blr.redhat.com  Created container
  Normal   Pulled   46m (x414 over 1d)  kubelet, 
dhcp47-132.lab.eng.blr.redhat.com  Container image "brew-pulp-
docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-volmanager-
rhel7:3.11.4-6" already present on machine
  Warning  BackOff  1m (x5043 over 1d)  kubelet, 
dhcp47-132.lab.eng.blr.redhat.com  Back-off restarting failed container



  [root@dhcp46-199 ~]# oc get pods heketi-storage-2-tbmnv -owide
NAME                     READY     STATUS             RESTARTS   AGE       IP            
NODE                                NOMINATED NODE
heketi-storage-2-tbmnv   0/1       CrashLoopBackOff   421        1d        
10.129.2.91   dhcp47-132.lab.eng.blr.redhat.com   <none>


[root@dhcp47-132 ~]# mount | grep heketidbstorage
10.70.46.75:heketidbstorage on /var/lib/origin/openshift.local.volumes/pods/
0d1c4eb2-d71b-11e9-8d3e-005056b29828/volumes/kubernetes.io~glusterfs/db type 
fuse.glusterfs 
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

[root@dhcp47-132 ~]# ls -l /var/lib/origin/openshift.local.volumes/pods/
0d1c4eb2-d71b-11e9-8d3e-005056b29828/volumes/kubernetes.io~glusterfs/db
ls: cannot access /var/lib/origin/openshift.local.volumes/pods/0d1c4eb2-
d71b-11e9-8d3e-005056b29828/volumes/kubernetes.io~glusterfs/db: Transport 
endpoint is not connected



[root@dhcp46-199 ~]# oc rsh glusterfs-storage-88z7f

sh-4.2# gluster volume status heketidbstorage
Status of volume: heketidbstorage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.73:/var/lib/heketi/mounts/vg
_e12f51fe3b32bc2bd4fe31aa8661b8d9/brick_808
04e25b98690e5109d7de267758eba/brick         49152     0          Y       380  
Brick 10.70.46.75:/var/lib/heketi/mounts/vg
_d2cc7570593dce8e2ba0c6810648c3cd/brick_25b
f51b28d28e3cc2b031760a37628fa/brick         N/A       N/A        N       N/A  
Brick 10.70.46.182:/var/lib/heketi/mounts/v
g_c9713e51abf8ca746e9f3ec3440071a1/brick_07
57760ac00016cf2a2270b3b61daa3d/brick        N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       89188
Self-heal Daemon on 10.70.46.75             N/A       N/A        Y       
111812
Self-heal Daemon on 10.70.46.73             N/A       N/A        Y       34373

Task Status of Volume heketidbstorage
------------------------------------------------------------------------------
There are no active volume tasks


Version-Release number of selected component (if applicable):

rhgs3/rhgs-volmanager-rhel7:3.11.4-6
rhgs3/rhgs-gluster-block-prov-rhel7:3.11.4-6
rhgs3/rhgs-server-rhel7:3.11.4-11

glusterfs-api-6.0-13.el7rhgs.x86_64
python2-gluster-6.0-13.el7rhgs.x86_64
glusterfs-server-6.0-13.el7rhgs.x86_64
glusterfs-libs-6.0-13.el7rhgs.x86_64
glusterfs-6.0-13.el7rhgs.x86_64
glusterfs-client-xlators-6.0-13.el7rhgs.x86_64
glusterfs-cli-6.0-13.el7rhgs.x86_64
glusterfs-fuse-6.0-13.el7rhgs.x86_64
glusterfs-geo-replication-6.0-13.el7rhgs.x86_64
gluster-block-0.2.1-34.el7rhgs.x86_64

How reproducible: 1/1


Steps to Reproduce:
1. Run a loop to create 1000 file PVCs
2. Check if the PVCs are getting bound
3. After 650th PVC, PVC creation fails
4. Check heketi pod status


Actual results:
Heketi pod is in CrashLoopBack Off state. Two bricks of heketidbstorage are down

Expected results:
PVC creation should succeed and heketi pod should be up and running

Comment 31 errata-xmlrpc 2019-10-30 12:23:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3249


Note You need to log in before you can comment on or make changes to this bug.