Created attachment 1645849 [details] Gluster vo info and status, df -hT, heal info, logs of glfsheal and all related bricks Description of problem: Setup: 3-Node VMWare Cluster (2 Storage Nodes and 1 Arbiter Node), Distribute-Replica 2 Volume with 1 Arbiter brick per Replica-Tupel (see attached file for the detail configuration). Version-Release number of selected component (if applicable): Gluster FS v5.10 How reproducible: Steps to Reproduce: 1. Mount volume from a dedicated client machine 2. Disable network of node 2 3. Write to node 1 in the volume until it is full. The storage.reserve limit of the local bricks should take effect and the bricks should therefore be +-1% empty. 4. Disable network of node 1 5. Enable network of node 2 6. Write to node 2 in the same volume, but write the data into another subfolder or use completely different data. Otherwise one would get an Split-brain error which is not the issue here. Also write data until the bricks reaches the storage.reserve limit. 7. Now the volume is filled up with twice the amount of data 8. Enable network of node 1 Actual results: storage.reserve was ignored and all bricks are 100% full within a few seconds. All brick processes died. Volume not mountable and can not trigger heal. Expected results: self-heal process should be blocked by storage.reserve and brick processes still running and volume is accessible. Additional info: See attached file The above scenario was not only reproduced on a VM Cluster. We could also monitor it on a real HW Cluster
Question for the assigned maintainer/developer - (1) can this be reproduced in a newer release (2) is this something that was known for this specific release as reported? Please review (2) in terms of how, if at all, a recovery sequence can be made available so as to not cause this space exhaustion issue.
I think this behaviour is peculiar to arbiter volumes (as opposed to replica 3) as arbiter does not store data. If it had been a normal replica 3, then step-6 in the description would have failed because node 3 would have been full. Mohit, what is your take on the bug?
storage.reserve restriction check is applicable only for an external client not for an internal client. I think it is an internal client responsibility before writing the data to check disk space.
That would be a leak in abstraction for an option that is per brick specific. It looks like you added the check for internal clients via BZ 1506083 but I can't find any specific problem in the BZ. One problem is that if we subject writes from self-heal also to the same check, then with the case described in this bug, heals would never be able to complete. But that is not any different than the case where this option is *not* enabled but the I/O was pumped till the disk was full. So maybe we should allow internal clients as well?
(In reply to Ravishankar N from comment #4) > So maybe we should allow internal clients as well? Sorry I mean we should not allow internal clients as well.
We can't ignore fops for internal client otherwise there was no requirement to implement this feature. We restricted for an internal client because the feature was primarily implemented for rebalance daemon. At the time of adding a brick rebalance daemon needs some space at the backend for rebalancing the data so we put a check to ignore the internal client.
This bug is moved to https://github.com/gluster/glusterfs/issues/869, and will be tracked there from now on. Visit GitHub issues URL for further details
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days