Red Hat Bugzilla – Bug 1466979
[UPSHIFT] Over 1,500+ volumes exist on gluster nodes in an OpenShift 3.5 CNS environment
Last modified: 2018-02-26 13:08:32 EST
We are struggling with a situation in our OpenShift 3.5 bare-metal deployed GlusterFS environment.
We have 12 OpenShift nodes (64 GB, 2 Socket boxes, with 6 300GB disks), deployed as two GlusterFS clusters containerized with Heketi.
We had a problem on the boxes where some of the had their system disks filled up due to external factors. Once cleaned up and the nodes rebooted, the GlusterFS clusters did not restore quickly, and on inspection we have 1,500+ GlusterFS volumes on one of the clusters.
(In reply to Peter Portante from comment #0)
> We had a problem on the boxes where some of the had their system disks
> filled up due to external factors. Once cleaned up and the nodes rebooted,
> the GlusterFS clusters did not restore quickly, and on inspection we have
> 1,500+ GlusterFS volumes on one of the clusters.
It is possible when /var/lib/glusterd is full that Glusterd is not able to sync the updates to disk. When the nodes were rebooted, it read the old state from disk which said 1500 volumes exist.
The must be a corresponding bug in heketi for this to happen. Is it possible that so many volumes were created and deleted using heketi-cli but Glusterd did not save it successfully in the backend?
Thanks for the setup peter.
I was able to debug to a point to find why there was less volumes in heketi but gluster actually had lot more volumes.
heketi had 53 volumes for that particular cluster.
where as gluster had 1551 volumes.
Most of them were not started(couldn't figure out the exact number as my bash scripting ability is down cause of sleep). Will give exact numbers soon.
Elvir faced the same issue and his cluster had no space error. This looks to be problem on too many parallel volume request.
Heketi getting lot of volume create request has caused this issue. Will try to reproduce in my setup and will let you know about the progress.
Old heketi logs are missing to pin point the behavior of heketi when the issue occurred.