Description of problem: There could be a use case in CNS environment where we have large number of small size volumes. In such case bricks corresponding to each volume ends up landing in one or two drives and not on all the backend drives. In our 3 node CNS environment when we scale from 100 5GB Volumes to 1000 5GB Volumes, corresponding 100 to 1000 bricks created on each CNS nodes doesn't distribute across all 12 drives in the backend. For the case of 100, 200 Volume only 2 out of 12 HDDs were utilized whereas in case of 500, 1000 Volume test only 6 out of 12 HDDs were utilized. Lesser the number of HDDs used in the backend less the performance so its imperative we change the code of Heketi to make sure bricks are distributed homogeneously. The following link points to utilization of HDDs for 100, 200, 500 and 1000 volume case: http://perf1.perf.lab.eng.bos.redhat.com/pub/shberry/disk_utilization/ If you see the Images in link above you can see how mane drives are actually working when Write IO is being performed on them .This will show how many drives are actually hosting bricks. Version-Release number of selected component (if applicable): 8 servers were used to setup Openshift with 1 of them being master server. Master node was schedulable 3 servers out of 8 above were dedicated for CNS deployment. These 3 servers were non-schedulable i.e. only hosting storage pods and no application pods on them. 48 GB RAM, 2 CPU sockets with 6 cores each, 12 processors in total on all 8 servers 3 CNS nodes comprised 12 7200 RPMs Hard Drives of 930GB capacity. All were part of CNS topology giving it a total capacity ~11TB (replica 3 setup) kernal : 3.10.0-693.el7.x86_64 penshift Version : v3.6.173.0.7 Kubernetes : v1.6.1+5115d708d7 Docker : 1.12.6 , docker-1.12.6-31.1.git97ba2c0.el7.x86_64 rhgs server image : rhgs3/rhgs-server-rhel7:3.3.0-24 volmanager : rhgs3/rhgs-volmanager-rhel7:3.3.0-27 heketi : 5.0.0-11.el7rhgs.x86_64 and heketi-client-5.0.0-11.el7rhgs.x86_64 cns-deploy : cns-deploy-5.0.0-41.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Scale 100 small size Volumes in a 3 nodes CNS environment (make sure there is significant number of HDDs in the config) 2. Check where the bricks land 3. Actual results: Bricks land in 1-2 HDDs out of 12 HDDs Expected results: Bricks should be homogeneously distributed across all the HDDs in the backend. Additional info:
Thanks for reporting! We will look into improving this soon. Apart from the situation of multiple disks per node on a 3-node cluster, there is also the situation of more than three nodes, where inhomogeneity could occur.
John, IIRC you re-modelled the brick allocation logic. Do you see a chance to improve the distribution so that the devices will host the same amount of bricks? I am not sure how the algorithm is implemented now, but it may be more random and a round-robin way could result in a more 'equal' distribution.
It can be done, and certainly should be done eventually. However, it's not a small job IMO. Recent refactoring should make this easier but all current "placers" still rely on the same basic code that can produce these uneven layouts.
Moving this out to cns-3.11.0, might become a glusterd2 enhancement later.