Cluster hangs when one of the node in replicated cluster is overloading. Usually all nodes in cluster have the same level of load. But rarely nodes could hang or freeze, but still be reachable. For instance, when there are problems with filesystem or HDDs. In this cases whole cluster hangs. We are able to set ping-timeout but it doesn't help in this case. Is it possible to set thresholds based on node's performance and mark nodes as failed or unreachable when they are hang or work very slow? Thanks.
This is the way we want to deal with in container storage. The challenge is with data migration involved. We can 'migrate' the process to new node, but the data migration again spikes up the load. Recommended way is restrict the CPU for glusterfs using cgroups, and focus on fixing some of the lock contention issues which are being identified.
This bug is moved to https://github.com/gluster/glusterfs/issues/1109, and will be tracked there from now on. Visit GitHub issues URL for further details