Description of problem: On an OCS 3.11.4 setup, after the 650th PVC creation, further volume creations failed. It was observed that the heketi pod had gone into CrashLoopBack Off state. On further debugging it was found that heketi was in crashloopbackoff due to a failed heketidbstorage mount on the node. This in turn appears to be due to two of the three bricks being down. No cores were found in the glusterfs pods. Debugging log follows ------------------------------ Warning Failed 1h (x41 over 4h) kubelet, dhcp47-132.lab.eng.blr.redhat.com Error: failed to start container "heketi": Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused "container init exited prematurely" Normal Created 1h (x411 over 1d) kubelet, dhcp47-132.lab.eng.blr.redhat.com Created container Normal Pulled 46m (x414 over 1d) kubelet, dhcp47-132.lab.eng.blr.redhat.com Container image "brew-pulp- docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-volmanager- rhel7:3.11.4-6" already present on machine Warning BackOff 1m (x5043 over 1d) kubelet, dhcp47-132.lab.eng.blr.redhat.com Back-off restarting failed container [root@dhcp46-199 ~]# oc get pods heketi-storage-2-tbmnv -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE heketi-storage-2-tbmnv 0/1 CrashLoopBackOff 421 1d 10.129.2.91 dhcp47-132.lab.eng.blr.redhat.com <none> [root@dhcp47-132 ~]# mount | grep heketidbstorage 10.70.46.75:heketidbstorage on /var/lib/origin/openshift.local.volumes/pods/ 0d1c4eb2-d71b-11e9-8d3e-005056b29828/volumes/kubernetes.io~glusterfs/db type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) [root@dhcp47-132 ~]# ls -l /var/lib/origin/openshift.local.volumes/pods/ 0d1c4eb2-d71b-11e9-8d3e-005056b29828/volumes/kubernetes.io~glusterfs/db ls: cannot access /var/lib/origin/openshift.local.volumes/pods/0d1c4eb2- d71b-11e9-8d3e-005056b29828/volumes/kubernetes.io~glusterfs/db: Transport endpoint is not connected [root@dhcp46-199 ~]# oc rsh glusterfs-storage-88z7f sh-4.2# gluster volume status heketidbstorage Status of volume: heketidbstorage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.73:/var/lib/heketi/mounts/vg _e12f51fe3b32bc2bd4fe31aa8661b8d9/brick_808 04e25b98690e5109d7de267758eba/brick 49152 0 Y 380 Brick 10.70.46.75:/var/lib/heketi/mounts/vg _d2cc7570593dce8e2ba0c6810648c3cd/brick_25b f51b28d28e3cc2b031760a37628fa/brick N/A N/A N N/A Brick 10.70.46.182:/var/lib/heketi/mounts/v g_c9713e51abf8ca746e9f3ec3440071a1/brick_07 57760ac00016cf2a2270b3b61daa3d/brick N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 89188 Self-heal Daemon on 10.70.46.75 N/A N/A Y 111812 Self-heal Daemon on 10.70.46.73 N/A N/A Y 34373 Task Status of Volume heketidbstorage ------------------------------------------------------------------------------ There are no active volume tasks Version-Release number of selected component (if applicable): rhgs3/rhgs-volmanager-rhel7:3.11.4-6 rhgs3/rhgs-gluster-block-prov-rhel7:3.11.4-6 rhgs3/rhgs-server-rhel7:3.11.4-11 glusterfs-api-6.0-13.el7rhgs.x86_64 python2-gluster-6.0-13.el7rhgs.x86_64 glusterfs-server-6.0-13.el7rhgs.x86_64 glusterfs-libs-6.0-13.el7rhgs.x86_64 glusterfs-6.0-13.el7rhgs.x86_64 glusterfs-client-xlators-6.0-13.el7rhgs.x86_64 glusterfs-cli-6.0-13.el7rhgs.x86_64 glusterfs-fuse-6.0-13.el7rhgs.x86_64 glusterfs-geo-replication-6.0-13.el7rhgs.x86_64 gluster-block-0.2.1-34.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. Run a loop to create 1000 file PVCs 2. Check if the PVCs are getting bound 3. After 650th PVC, PVC creation fails 4. Check heketi pod status Actual results: Heketi pod is in CrashLoopBack Off state. Two bricks of heketidbstorage are down Expected results: PVC creation should succeed and heketi pod should be up and running
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3249