Bug 1466979 - [UPSHIFT] Over 1,500+ volumes exist on gluster nodes in an OpenShift 3.5 CNS environment [NEEDINFO]
[UPSHIFT] Over 1,500+ volumes exist on gluster nodes in an OpenShift 3.5 CNS...
Status: ASSIGNED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: heketi (Show other bugs)
3.1
All All
unspecified Severity urgent
: ---
: ---
Assigned To: Raghavendra Talur
Anoop
aos-scalability-35
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-30 22:00 EDT by Peter Portante
Modified: 2017-11-15 07:32 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rtalur: needinfo? (pportant)


Attachments (Terms of Use)

  None (edit)
Description Peter Portante 2017-06-30 22:00:11 EDT
We are struggling with a situation in our OpenShift 3.5 bare-metal deployed GlusterFS environment.

We have 12 OpenShift nodes (64 GB, 2 Socket boxes, with 6 300GB disks), deployed as two GlusterFS clusters containerized with Heketi.

We had a problem on the boxes where some of the had their system disks filled up due to external factors.  Once cleaned up and the nodes rebooted, the GlusterFS clusters did not restore quickly, and on inspection we have 1,500+ GlusterFS volumes on one of the clusters.
Comment 2 Raghavendra Talur 2017-07-04 11:49:41 EDT
(In reply to Peter Portante from comment #0)
> We had a problem on the boxes where some of the had their system disks
> filled up due to external factors.  Once cleaned up and the nodes rebooted,
> the GlusterFS clusters did not restore quickly, and on inspection we have
> 1,500+ GlusterFS volumes on one of the clusters.

It is possible when /var/lib/glusterd is full that Glusterd is not able to sync the updates to disk. When the nodes were rebooted, it read the old state from disk which said 1500 volumes exist. 

The must be a corresponding bug in heketi for this to happen. Is it possible that so many volumes were created and deleted using heketi-cli but Glusterd did not save it successfully in the backend?
Comment 3 Mohamed Ashiq 2017-07-20 16:03:18 EDT
Thanks for the setup peter.

I was able to debug to a point to find why there was less volumes in heketi but gluster actually had lot more volumes.

heketi had 53 volumes for that particular cluster.

where as gluster had 1551 volumes. 
Most of them were not started(couldn't figure out the exact number as my bash scripting ability is down cause of sleep). Will give exact numbers soon.

Elvir faced the same issue and his cluster had no space error. This looks to be problem on too many parallel volume request.

Conclusion: 
Heketi getting lot of volume create request has caused this issue.  Will try to reproduce in my setup and will let you know about the progress.

Old heketi logs are missing to pin point the behavior of heketi when the issue occurred.

Note You need to log in before you can comment on or make changes to this bug.