Bug 1575529

Summary: [RFE] load balancing the IO connections (active paths) across HA nodes
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Prasanna Kumar Kalever <prasanna.kalever>
Component: gluster-blockAssignee: Prasanna Kumar Kalever <prasanna.kalever>
Status: CLOSED ERRATA QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.10CC: akrishna, asriram, bgoyal, hchiramm, kramdoss, pkarampu, pprakash, prasanna.kalever, rhs-bugs, sankarshan, vbellur, xiubli
Target Milestone: ---Keywords: FutureFeature
Target Release: CNS 3.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: gluster-block-0.2.1-19.el7rhgs Doc Type: Enhancement
Doc Text:
Currently, multipathing priority configuration is set to constant with all the HA paths. Hence it is possible that the load might not be distributed uniformly across the available HA nodes. With this update, priority based load balancing at gluster-block is introduced. The management daemon gluster-block reads the load balance information from the metadata and selects a high priority path from HA based on the data. When the block device is requested for creation, high priority is set on a path whose node is least used. While logging to the device, the initiator side multipath tools picks the high priority path and marks it as active. This way it is possible to distribute the balance across the nodes.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-12 09:25:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1585197    
Bug Blocks: 1568860    

Description Prasanna Kumar Kalever 2018-05-07 08:14:17 UTC
Description of problem:

Currently we recommend multipathing priority configuration as constant.

# cat /etc/multipath.conf
[...]
# LIO iSCSI
devices {
        device {
                vendor "LIO-ORG"
                user_friendly_names "yes" # names like mpatha
                path_grouping_policy "failover" # one path per group
                path_selector "round-robin 0"
                failback immediate
                path_checker "tur"
                prio "const"
                no_path_retry 120
                rr_weight "uniform"
        }
}

Setting 'prio const' means setting all paths priority as 1, hence if we create blocks with NODE1, NODE2 and NODE3 (HA 3), they will be assigned to portals for tpg1, tpg2 and tp3 respectively.

Since the path prio is 1 for all, it is very likely that tpg1 is mostly picked and hence the active path will be target on NODE1 for all the blocks.

In order that we distribute the IO load more or less equally, we need to have a way to set the priorities of the paths at the target side.

Currently

[root@localhost ~]# targetcli ls                                                    
o- / ......................................................................... [...]
  o- backstores .............................................................. [...]
  | o- block .................................................. [Storage Objects: 0]
  | o- fileio ................................................. [Storage Objects: 0]
  | o- pscsi .................................................. [Storage Objects: 0]
  | o- ramdisk ................................................ [Storage Objects: 0]
  | o- user:glfs .............................................. [Storage Objects: 1]
  |   o- tcmublock  [sample.124.162/block-store/d87e2981-51b2-4f82-acaa-0dc8
2037854e (1.0GiB) activated]                                                        
  |     o- alua ................................................... [ALUA Groups: 2]
  |       o- default_tg_pt_gp ....................... [ALUA state: Active/optimized]
  |       o- glfs_tg_pt_gp .......................... [ALUA state: Active/optimized]
  o- iscsi ............................................................ [Targets: 1]
  | o- iqn.2016-12.org.gluster-block:d87e2981-51b2-4f82-acaa-0dc82037854e  [TPGs: 3]
  |   o- tpg1 .................................................. [gen-acls, no-auth]
  |   | o- acls .......................................................... [ACLs: 0]
  |   | o- luns .......................................................... [LUNs: 1]
  |   | | o- lun0 ................................. [user/tcmublock (glfs_tg_pt_gp)]
  |   | o- portals .................................................... [Portals: 1]
  |   |   o- 192.168.124.162:3260 ............................................. [OK]
  |   o- tpg2 ........................................................... [disabled]
  |   | o- acls .......................................................... [ACLs: 0]
  |   | o- luns .......................................................... [LUNs: 1]
  |   | | o- lun0 ................................. [user/tcmublock (glfs_tg_pt_gp)]
  |   | o- portals .................................................... [Portals: 1]
  |   |   o- 192.168.124.149:3260 ............................................. [OK]
  |   o- tpg3 ........................................................... [disabled]
  |     o- acls .......................................................... [ACLs: 0]
  |     o- luns .......................................................... [LUNs: 1]
  |     | o- lun0 ................................. [user/tcmublock (glfs_tg_pt_gp)]
  |     o- portals .................................................... [Portals: 1]
  |       o- 192.168.124.184:3260 ............................................. [OK]
  o- loopback ......................................................... [Targets: 0]
as part of specifying priorities, we will have to create two target portal groups per storage object i.e. glfs_tg_pt_gp_ao and glfs_tg_pt_gp_ano, active optimized (AO) and active non optimized(ANO). AO will have prio 50 and ANO will have priority 10.

[root@localhost ~]# targetcli ls                                                    
o- / ......................................................................... [...]
  o- backstores .............................................................. [...]
  | o- block .................................................. [Storage Objects: 0]
  | o- fileio ................................................. [Storage Objects: 0]
  | o- pscsi .................................................. [Storage Objects: 0]
  | o- ramdisk ................................................ [Storage Objects: 0]
  | o- user:glfs .............................................. [Storage Objects: 1]
  |   o- tcmublock  [sample.124.162/block-store/d87e2981-51b2-4f82-acaa-0dc8
2037854e (1.0GiB) activated]                                                        
  |     o- alua ................................................... [ALUA Groups: 3]
  |       o- default_tg_pt_gp ....................... [ALUA state: Active/optimized]
  |       o- glfs_tg_pt_gp_ano .................. [ALUA state: Active/non-optimized]
  |       o- glfs_tg_pt_gp_ao ....................... [ALUA state: Active/optimized]
  o- iscsi ............................................................ [Targets: 1]
  | o- iqn.2016-12.org.gluster-block:d87e2981-51b2-4f82-acaa-0dc82037854e  [TPGs: 3]
  |   o- tpg1 .................................................. [gen-acls, no-auth]
  |   | o- acls .......................................................... [ACLs: 0]
  |   | o- luns .......................................................... [LUNs: 1]
  |   | | o- lun0 .............................. [user/tcmublock (glfs_tg_pt_gp_ao)]
  |   | o- portals .................................................... [Portals: 1]
  |   |   o- 192.168.124.162:3260 ............................................. [OK]
  |   o- tpg2 ........................................................... [disabled]
  |   | o- acls .......................................................... [ACLs: 0]
  |   | o- luns .......................................................... [LUNs: 1]
  |   | | o- lun0 ............................. [user/tcmublock (glfs_tg_pt_gp_ano)]
  |   | o- portals .................................................... [Portals: 1]
  |   |   o- 192.168.124.149:3260 ............................................. [OK]
  |   o- tpg3 ........................................................... [disabled]
  |     o- acls .......................................................... [ACLs: 0]
  |     o- luns .......................................................... [LUNs: 1]
  |     | o- lun0 ............................. [user/tcmublock (glfs_tg_pt_gp_ano)]
  |     o- portals .................................................... [Portals: 1]
  |       o- 192.168.124.184:3260 ............................................. [OK]
  o- loopback ......................................................... [Targets: 0]

And to use this benefit, on the initiator side we will have to set 'prio alua'

# cat /etc/multipath.conf
[...]
# LIO iSCSI
devices {
        device {
                vendor "LIO-ORG"
                user_friendly_names "yes" # names like mpatha
                path_grouping_policy "failover" # one path per group
                path_selector "round-robin 0"
                failback immediate
                path_checker "tur"
                prio "alua"
                no_path_retry 120
                rr_weight "uniform"
        }
}

Unfortunately we don't have a way to get the active devices on a given target node from target side.
We many have to build a logic to distribute the AO tpgs equally.
We want to maintain a counter for AO per node in every volume.

say, If we have IP1, IP2, IP3 (HA 3)
We will have a counter file (in every volume), where we maintain AO count:
$ cat .counter
IP1: 70
1P2: 58
IP3: 63

For example, the next block create command will pick AO on IP2 (since it has less load) any better solutions are welcome.

Changes needed:
gluster-block
tcmu-runner (dummy lock implementation in the glfs handler)

useful links:
https://github.com/open-iscsi/rtslib-fb/blob/master/rtslib/alua.py#L312



Upstream discussion:
https://github.com/gluster/gluster-block/issues/82

Comment 6 Neha Berry 2018-07-04 11:23:20 UTC
The changes suggested by this RFE are in place since gluster-block version gluster-block-0.2.1-19.el7rhgs.


Versions used for verification
=============


#  for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep targetcli ; done
glusterfs-storage-krlwr
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
glusterfs-storage-pngnm
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
glusterfs-storage-v8z6s
+++++++++++++++++++++++
targetcli-2.1.fb46-6.el7_5.noarch
#
# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa|grep gluster-block ; done
glusterfs-storage-krlwr
+++++++++++++++++++++++
gluster-block-0.2.1-20.el7rhgs.x86_64
glusterfs-storage-pngnm
+++++++++++++++++++++++
gluster-block-0.2.1-20.el7rhgs.x86_64
glusterfs-storage-v8z6s
+++++++++++++++++++++++
gluster-block-0.2.1-20.el7rhgs.x86_64
#
# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa|grep tcmu-runner ; done
glusterfs-storage-krlwr
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
glusterfs-storage-pngnm
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
glusterfs-storage-v8z6s
+++++++++++++++++++++++
tcmu-runner-1.2.0-20.el7rhgs.x86_64
#
# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep python-configshell ; done
glusterfs-storage-krlwr
+++++++++++++++++++++++
python-configshell-1.1.fb23-4.el7_5.noarch
glusterfs-storage-pngnm
+++++++++++++++++++++++
python-configshell-1.1.fb23-4.el7_5.noarch
glusterfs-storage-v8z6s
+++++++++++++++++++++++
python-configshell-1.1.fb23-4.el7_5.noarch
#
# for i in `oc get pods -o wide| grep glusterfs|cut -d " " -f1` ; do echo $i; echo +++++++++++++++++++++++; oc exec $i -- rpm -qa | grep python-rtslib ; done
glusterfs-storage-krlwr
+++++++++++++++++++++++
python-rtslib-2.1.fb63-12.el7_5.noarch
glusterfs-storage-pngnm
+++++++++++++++++++++++
python-rtslib-2.1.fb63-12.el7_5.noarch
glusterfs-storage-v8z6s
+++++++++++++++++++++++
python-rtslib-2.1.fb63-12.el7_5.noarch


Verified the following:
+++++++++++++++++++++++++++++++

1. created multiple block volumes mounted on app pods
2. Confirmed in targetcli ls, that we now have  two target portal groups per storage object i.e. glfs_tg_pt_gp_ao and glfs_tg_pt_gp_ano, active optimized (AO) and active non optimized(ANO). AO has prio 50 and ANO has priority 10.

3. The load balancing works fine once we have changes in the /etc/multipath.conf file to have prio "alua"

Some snippet from the setup for one blockvolume
+++++++++++++++++++++++++++++

multipath -ll

mpathd (3600140589bba72ef49445bf9501b7d9e) dm-34 LIO-ORG ,TCMU device     
size=5.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 47:0:0:0 sdo  8:224  active ready running
|-+- policy='round-robin 0' prio=10 status=enabled
| `- 49:0:0:0 sdq  65:0   active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 48:0:0:0 sdp  8:240  active ready running


#ll /dev/disk/by-path/ip*|grep sdo 
lrwxrwxrwx. 1 root root  9 Jul  4 10:04 /dev/disk/by-path/ip-10.70.43.230:3260-iscsi-iqn.2016-12.org.gluster-block:89bba72e-f494-45bf-9501-b7d9ec328213-lun-0 -> ../../sdo


| o- iqn.2016-12.org.gluster-block:89bba72e-f494-45bf-9501-b7d9ec328213 ................................................ [TPGs: 3]
  | | o- tpg1 ........................................................................................................... [disabled]
  | | | o- acls .......................................................................................................... [ACLs: 0]
  | | | o- luns .......................................................................................................... [LUNs: 1]
  | | | | o- lun0 ........................ [user/test-vol_glusterfs_claim14_7d6cfdf9-7f43-11e8-a6bb-0a580a800209 (glfs_tg_pt_gp_ao)]
  | | | o- portals .................................................................................................... [Portals: 1]
  | | |   o- 10.70.43.230:3260 ................................................................................................ [OK]
  | | o- tpg2 ..................................................................................... [gen-acls, tpg-auth, 1-way auth]
  | | | o- acls .......................................................................................................... [ACLs: 0]
  | | | o- luns .......................................................................................................... [LUNs: 1]
  | | | | o- lun0 ....................... [user/test-vol_glusterfs_claim14_7d6cfdf9-7f43-11e8-a6bb-0a580a800209 (glfs_tg_pt_gp_ano)]
  | | | o- portals .................................................................................................... [Portals: 1]
  | | |   o- 10.70.43.19:3260 ................................................................................................. [OK]
  | | o- tpg3 ........................................................................................................... [disabled]
  | |   o- acls .......................................................................................................... [ACLs: 0]
  | |   o- luns .......................................................................................................... [LUNs: 1]
  | |   | o- lun0 ....................... [user/test-vol_glusterfs_claim14_7d6cfdf9-7f43-11e8-a6bb-0a580a800209 (glfs_tg_pt_gp_ano)]
  | |   o- portals .................................................................................................... [Portals: 1]
  | |     o- 10.70.43.53:3260 ................................................................................................. [OK]


4. The /etc/multipath.conf file

# cat /etc/multipath.conf
# LIO iSCSI
# TODO: Add env variables for tweaking
devices {
        device {
                vendor "LIO-ORG"
                user_friendly_names "yes" 
                path_grouping_policy "failover"
                path_selector "round-robin 0"
                failback immediate
                path_checker "tur"
                prio "alua"
                no_path_retry 120
                rr_weight "uniform"
        }
}
defaults {
	user_friendly_names yes
	find_multipaths yes
}


blacklist {
}


---------------------------

5. We have around 35 block devices created, hence confirmed that a counter is maintained to keep track of distributing the AO tpgs equally

# attr -l prio.info 
Attribute "selinux" has a 30 byte value for prio.info
Attribute "block.10.70.43.53" has a 1024 byte value for prio.info
Attribute "block.10.70.43.19" has a 1024 byte value for prio.info
Attribute "block.10.70.43.230" has a 1024 byte value for prio.info
[root@dhcp43-29 block-meta]# for i in `attr -l prio.info|grep "block.10"|cut -d "\"" -f2`; do attr -g $i prio.info; done
Attribute "block.10.70.43.53" had a 1024 byte value for prio.info:
4
Attribute "block.10.70.43.19" had a 1024 byte value for prio.info:
5
Attribute "block.10.70.43.230" had a 1024 byte value for prio.info:
5


# cd block-meta
[root@dhcp43-29 block-meta]# attr -l prio.info 
Attribute "selinux" has a 30 byte value for prio.info
Attribute "block.10.70.43.53" has a 1024 byte value for prio.info
Attribute "block.10.70.43.19" has a 1024 byte value for prio.info
Attribute "block.10.70.43.230" has a 1024 byte value for prio.info
[root@dhcp43-29 block-meta]# for i in `attr -l prio.info|grep "block.10"|cut -d "\"" -f2`; do attr -g $i prio.info; done
Attribute "block.10.70.43.53" had a 1024 byte value for prio.info:
7
Attribute "block.10.70.43.19" had a 1024 byte value for prio.info:
7
Attribute "block.10.70.43.230" had a 1024 byte value for prio.info:
8

Attaching the multipath and targetcli output from the setup for further confirmation.


This bug is now moved to verified

Comment 9 Anjana KD 2018-09-07 10:08:44 UTC
Have updated the doc text field , kindly review.

Comment 11 Anjana KD 2018-09-07 13:05:34 UTC
made the required changes.

Comment 13 Anjana KD 2018-09-07 14:16:21 UTC
updated that. thankyou

Comment 15 errata-xmlrpc 2018-09-12 09:25:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2691