Description of problem: The volume becomes unresponsive when a server in a 4-server distributed-replicate 14x2 cluster is brought back into cluster after it had been down for 2 days. This is probably caused by the self-heal system. Version-Release number of selected component (if applicable): 3.3git-v3.3.2qa2-3-g3490689 Actual results: Volume is unresponsive. Expected results: A working volume with only slightly higher access times. Additional info: Before stor2 was brought back online : stor1: gluster filehandles: 550 , load: 0.87 0.76 0.64 1/385 9521 stor3: gluster filehandles: 649 , load: 0.40 0.94 1.21 1/494 16570 stor4: gluster filehandles: 573 , load: 0.58 0.55 0.51 1/439 27743 First 5 minutes after stor2 was brought back online showed : stor1: gluster filehandles: 596 , load: 0.52 0.73 0.72 1/385 10320 stor2: gluster filehandles: 759 , load: 28.09 18.32 8.09 25/294 2455 stor3: gluster filehandles: 774 , load: 11.76 7.44 3.92 1/499 17568 stor4: gluster filehandles: 683 , load: 4.74 3.55 1.86 1/439 28438 After 5 minutes I shut down stor2 to make the volume available again.
In addition : this issue is not about data transfer saturation but rather about IO per second saturation on the individual harddisks. Thus moving to faster ethernet or infiniband won't help. We see gluster management operations saturating brick IO/s in several cases : replace-brick, rebalance and now self-heal. In order to prevent this behaviour we must throttle gluster management traffic to, say, 100 IOPS per brick. Ideally it'll be configurable to either a hardcoded limit or a percentage.
In addition : moving the stor2 server to a new DNS name stor5 and IP address and force-replace a single brick from the stor2 to stor5 still DOSes the entire volume. In addition : disabling the self-heal daemon on the volume does NOT help. Bringing the new server with the single brick up still DOSes the entire volume.
pre-release version is ambiguous and about to be removed as a choice. If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.