1523258 – [GSS] Rebalance slow on large cluster

Bug 1523258 - [GSS] Rebalance slow on large cluster

Summary: [GSS] Rebalance slow on large cluster

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Nithya Balachandran
QA Contact:	Prasad Desala
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-07 14:35 UTC by Cal Calhoun
Modified:	2021-06-10 13:51 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-22 06:32:19 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Cal Calhoun 2017-12-07 14:35:32 UTC

Description of problem:
  4.7 PB volume (16 node x 5 Bricks each one x 60 Disks in Raid 6 ) running a rebalance volume from 744 hours, the rebalance command status output show this:

[root@gs1 ~]# gluster volume rebalance <volume> status
Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
localhost          40893        25.3TB         89076             1         28945          in progress      774:54:21
node15              4772         4.0TB         47724             1         10244          in progress      774:54:20
node6              45146        22.7TB        100728             1         24980          in progress      774:54:20
node4              57755        35.4TB         97283             1         15143          in progress      774:54:20
node12             54851        34.0TB         96159             1         14335          in progress      774:54:20
node3              36124        21.6TB         88480            12         31814          in progress      774:54:20
node11             57502        31.6TB         97654             1         13150          in progress      774:54:20
node10             44509        22.9TB        102635             1         26957          in progress      774:54:20
node9              33127        20.7TB        126724             1         59430          in progress      774:54:20
node8              58226        36.1TB        103005             1         11446          in progress      774:54:20
node2              35714        19.7TB         85995             1         24227          in progress      774:54:20
node5              41277        26.3TB        104782             1         35168          in progress      774:54:20
node14             12916        10.2TB         72550             1         24339          in progress      774:54:20
node13              8940         6.5TB         89816             1         40115          in progress      774:54:20
node16              1816         1.5TB         39147             1          9720          in progress      774:54:20
node7              58457        33.9TB        100447             1         11820          in progress      774:54:20
Estimated time left for rebalance to complete :     6595:53:14
 
Version-Release number of selected component (if applicable):
  SERVER VERSIONS:
    OS:      RHEL 6.9
    Kernel:  kernel-2.6.32-696.13.2.el6.x86_64
    Gluster: glusterfs-server-3.8.4-44.el6rhs.x86_64

How reproducible:
  Ongoing

Steps to Reproduce:
  Just kick off rebalance on the volume

Actual results:
  744 hours so far an estimated 6595:53:14 to complete

Expected results:
  Customer expects better perfomance

Additional info:
  SOSREPORTS: http://collab-shell.usersys.redhat.com/01985137/

Comment 14 Cal Calhoun 2017-12-14 13:46:03 UTC

@Nithya, I have compiled the information regarding the proposed solution, presented it to the customer and am waiting for their response.  I had also asked them about their preference re: rebalance estimate output on 12/8 but had not received a reply.

Comment 17 Luis Rico 2018-01-03 16:49:19 UTC

Sorry for the delay coming back to you.
I spoke with customer directly regarding this case, and I’m transmitting their concerns, as they are less comfortable explaining all this in English... (I’ll update case publicly with an excerpt from the following), please:
- First, customer is very thankful about your help to provide a customize way, according to their environment, to improve rebalance process, so it’s achievable in a more reasonable timeframe.
- Despite customer is quite used to work with Gluster for last years, they don’t feel confident enough to implement these steps by themselves, as the technical level required to understand what they are doing and their impact is limited.
- The explanation provided in the case journal is quite difficult to follow. @Cal, I understood that you copied first the manual steps for each mount and then put the script provided that does in a more automated way the previous steps... so in my understanding that is redundant, they don’t have to execute steps first then script, but going directly to execute script.
- They also would like to understand the potential impact or executing manually in the nodes, thus having the security that is not going to cause any issues... and understanding performance impact
- Could it be possible to enhance script as much as possible to reduce risks of executing it wrongly, please?
- They plan to expand again cluster next year 2018 with other 4 nodes, as their growing rate is 700-900 TB net per year. So, they will have to execute next year another rebalance of 4 PB of potential data.
- In this sense, any improvement in script would be very helpful
- If, despite all efforts, customer is still not comfortable, we’ll propose an engagement with Red Hat Consulting to make sure our engineers help directly with this process (Dani, our cloud & storage architect, in CC)
- About question on message about estimation time provided by rebalance status, their opinion is that it’s better to get a real estimation even if it’s very long, than a generic message “> 2 months”. Please, update BZ to change our official documentation in this sense.
- In any case, and due to Christmas campaign, any action on the IT systems has to be postponed until 8th of January. Please, put ticket in standby or close it, and we’ll reopen it later.

Many many thanks to all of you.

Comment 18 Luis Rico 2018-01-03 16:53:58 UTC

On 23 Dec 2017, at 04:19, sankarshan <sankarshan> wrote:

Thank you for the detail in your response.

|On 23-Dec-2017 03:05, "Luis Rico" <lricomor> wrote:

|- They also would like to understand the potential impact or executing manually in the nodes, thus having the security that is not going to cause any issues... and understanding performance impact
What forms of impact to performance are they highlighting?

|- Could it be possible to enhance script as much as possible to reduce risks of executing it wrongly, please?
|- They plan to expand again cluster next year 2018 with other 4 nodes, as their growing rate is 700-900 TB net per year. So, they will have to execute next year another rebalance of 4 PB of potential data.
|- In this sense, any improvement in script would be very helpful
|- If, despite all efforts, customer is still not comfortable, we’ll propose an engagement with Red Hat Consulting to make sure our engineers help directly with this process (Dani, our cloud & storage architect, in CC) 

The script and documentation will help. However, if there is indeed a manner and form in which Consulting can be engaged, it would make more sense to deliver the customer experience through this means. It enables Red Hat to deliver expertise in a planned, structured and repeatable manner that can allay the concerns raised. Especially, since they have raised the flag of not feeling comfortable and confident of executing the steps by themselves.

Comment 19 Luis Rico 2018-01-03 16:56:13 UTC

On 26 Dec 2017, at 04:57, Nithya Balachandran <nbalacha> wrote:

One potential impact is that ops from these scripts will have the same priority as that from any other other client. The rebalance processes are treated as internal daemons and any ops from them will have lower priority on the bricks. This may cause an impact to client operations to the same volume. How often is this volume accessed?

The other impact is:
- The rebalance status will not be available via the cli. You will need to check for any errors using the log files.
- The log messages will be written to the log file for the mount point used to run the script instead of the regular rebalance log files.
- The customer needs to ensure that no add-brick/remove-brick operations are performed while the script is executing. They will need to stop the script and start it again. This however allows them to control when the script is running and which directory to process unlike with a regular rebalance operation.
- Could it be possible to enhance script as much as possible to reduce risks of executing it wrongly, please?
I can try to do this. Is there anything in particular they would like the script to check for?
|- They plan to expand again cluster next year 2018 with other 4 nodes, as their growing rate is 700-900 TB net per year. So, they will have to execute next year another rebalance of 4 PB of potential data.
Would this be at the beginning of the year or a later date?

Comment 20 Luis Rico 2018-01-03 16:59:21 UTC

On 26 Dec 2017, at 05:46, Raghavendra Gowdappa <rgowdapp> wrote:

|One potential impact is that ops from these scripts will have the same
priority as that from any other other client. The rebalance processes are
treated as internal daemons and any ops from them will have lower priority
on the bricks. This *may* cause an impact to client operations to the same
volume. 

This can be fixed by passing the $rebalance-pid to the option "client-pid" while mounting the mount on which the scripts are run. Note that mount.glusterfs script doesn't support this option now, but fixing it is trivial. Also note that this should be a special mount (specifically used for running the script) which shouldn't be used by applications.

Comment 21 Luis Rico 2018-01-03 17:03:32 UTC

(In reply to Luis Rico from comment #18)
> On 23 Dec 2017, at 04:19, sankarshan <sankarshan> wrote:
> 
> Thank you for the detail in your response.
> 
> |On 23-Dec-2017 03:05, "Luis Rico" <lricomor> wrote:
> 
> |- They also would like to understand the potential impact or executing
> manually in the nodes, thus having the security that is not going to cause
> any issues... and understanding performance impact
> What forms of impact to performance are they highlighting?

I'm referring to the potential performance impact of running the script in a production environment. We just need to clarify that.

> 
> |- Could it be possible to enhance script as much as possible to reduce
> risks of executing it wrongly, please?
> |- They plan to expand again cluster next year 2018 with other 4 nodes, as
> their growing rate is 700-900 TB net per year. So, they will have to execute
> next year another rebalance of 4 PB of potential data.
> |- In this sense, any improvement in script would be very helpful
> |- If, despite all efforts, customer is still not comfortable, we’ll propose
> an engagement with Red Hat Consulting to make sure our engineers help
> directly with this process (Dani, our cloud & storage architect, in CC) 
> 
> The script and documentation will help. However, if there is indeed a manner
> and form in which Consulting can be engaged, it would make more sense to
> deliver the customer experience through this means. It enables Red Hat to
> deliver expertise in a planned, structured and repeatable manner that can
> allay the concerns raised. Especially, since they have raised the flag of
> not feeling comfortable and confident of executing the steps by themselves.

Agree.

Comment 22 Luis Rico 2018-01-03 17:13:03 UTC

(In reply to Luis Rico from comment #19)
> On 26 Dec 2017, at 04:57, Nithya Balachandran <nbalacha> wrote:
> 
> One potential impact is that ops from these scripts will have the same
> priority as that from any other other client. The rebalance processes are
> treated as internal daemons and any ops from them will have lower priority
> on the bricks. This may cause an impact to client operations to the same
> volume. 
That is exaclty the kind of information we should provide to customer for their awareness.

>How often is this volume accessed?
At any time for reading archived videos, and at certain times of the day DIVA is creating new folders with new videos+metadata, controlled by admin.

> 
> The other impact is:
> - The rebalance status will not be available via the cli. You will need to
> check for any errors using the log files.
> - The log messages will be written to the log file for the mount point used
> to run the script instead of the regular rebalance log files.
> - The customer needs to ensure that no add-brick/remove-brick operations are
> performed while the script is executing. They will need to stop the script
> and start it again. This however allows them to control when the script is
> running and which directory to process unlike with a regular rebalance
> operation.

This is excellent information to provide to customer. That is exactly what they are requesting.

> - Could it be possible to enhance script as much as possible to reduce risks
> of executing it wrongly, please?
> I can try to do this. Is there anything in particular they would like the
> script to check for?
No idea on this, as no idea of what could go wrong if it's not executed as expected.

> |- They plan to expand again cluster next year 2018 with other 4 nodes, as
> their growing rate is 700-900 TB net per year. So, they will have to execute
> next year another rebalance of 4 PB of potential data.
> Would this be at the beginning of the year or a later date?

Later date.

Comment 61 Cal Calhoun 2018-03-14 17:04:58 UTC

@Nithya, No, I didn't realize that Damian had already requested the information.  We're still waiting for the customer to respond.
-Cal

Comment 66 Cal Calhoun 2018-03-22 16:01:49 UTC

Nithya,

  Here's the mediaset info:
  
  http://collab-shell.usersys.redhat.com/01985137/var_lib_glusterd_vols_mediaset.tar/

  I didn't see a new gluster volume info so I'll ask for it again.

Comment 67 Cal Calhoun 2018-03-22 16:04:25 UTC

Nithya, 

  Nevermind.  It was posted to a comment:

[root@gs5 ~]# gluster volume info

Volume Name: ctdbmeta
Type: Distributed-Replicate
Volume ID: 359d578b-6f27-47c4-b7da-785085f1fa6a
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gs1p:/datos/ctdb/data
Brick2: gs2p:/datos/ctdb/data
Brick3: gs3p:/datos/ctdb/data
Brick4: gs4p:/datos/ctdb/data
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
auto-delete: disable

Volume Name: mediaset
Type: Distribute
Volume ID: b02d5777-74c5-496d-a2b5-766cf5d439db
Status: Started
Snapshot Count: 0
Number of Bricks: 80
Transport-type: tcp
Bricks:
Brick1: gs1p:/datos/brick1/data
Brick2: gs1p:/datos/brick2/data
Brick3: gs1p:/datos/brick3/data
Brick4: gs1p:/datos/brick4/data
Brick5: gs1p:/datos/brick5/data
Brick6: gs2p:/datos/brick1/data
Brick7: gs2p:/datos/brick2/data
Brick8: gs2p:/datos/brick3/data
Brick9: gs2p:/datos/brick4/data
Brick10: gs2p:/datos/brick5/data
Brick11: gs3p:/datos/brick1/data
Brick12: gs3p:/datos/brick2/data
Brick13: gs3p:/datos/brick3/data
Brick14: gs3p:/datos/brick4/data
Brick15: gs3p:/datos/brick5/data
Brick16: gs4p:/datos/brick1/data
Brick17: gs4p:/datos/brick2/data
Brick18: gs4p:/datos/brick3/data
Brick19: gs4p:/datos/brick4/data
Brick20: gs4p:/datos/brick5/data
Brick21: gs5p:/datos/brick1/data
Brick22: gs5p:/datos/brick2/data
Brick23: gs5p:/datos/brick3/data
Brick24: gs5p:/datos/brick4/data
Brick25: gs5p:/datos/brick5/data
Brick26: gs6p:/datos/brick1/data
Brick27: gs6p:/datos/brick2/data
Brick28: gs6p:/datos/brick3/data
Brick29: gs6p:/datos/brick4/data
Brick30: gs6p:/datos/brick5/data
Brick31: gs7p:/datos/brick1/data
Brick32: gs7p:/datos/brick2/data
Brick33: gs7p:/datos/brick3/data
Brick34: gs7p:/datos/brick4/data
Brick35: gs7p:/datos/brick5/data
Brick36: gs8p:/datos/brick1/data
Brick37: gs8p:/datos/brick2/data
Brick38: gs8p:/datos/brick3/data
Brick39: gs8p:/datos/brick4/data
Brick40: gs8p:/datos/brick5/data
Brick41: gs9p:/datos/brick1/data
Brick42: gs9p:/datos/brick2/data
Brick43: gs9p:/datos/brick3/data
Brick44: gs9p:/datos/brick4/data
Brick45: gs9p:/datos/brick5/data
Brick46: gs10p:/datos/brick1/data
Brick47: gs10p:/datos/brick2/data
Brick48: gs10p:/datos/brick3/data
Brick49: gs10p:/datos/brick4/data
Brick50: gs10p:/datos/brick5/data
Brick51: gs11p:/datos/brick1/data
Brick52: gs11p:/datos/brick2/data
Brick53: gs11p:/datos/brick3/data
Brick54: gs11p:/datos/brick4/data
Brick55: gs11p:/datos/brick5/data
Brick56: gs12p:/datos/brick1/data
Brick57: gs12p:/datos/brick2/data
Brick58: gs12p:/datos/brick3/data
Brick59: gs12p:/datos/brick4/data
Brick60: gs12p:/datos/brick5/data
Brick61: gs13p:/datos/brick1/data
Brick62: gs13p:/datos/brick2/data
Brick63: gs13p:/datos/brick3/data
Brick64: gs13p:/datos/brick4/data
Brick65: gs13p:/datos/brick5/data
Brick66: gs14p:/datos/brick1/data
Brick67: gs14p:/datos/brick2/data
Brick68: gs14p:/datos/brick3/data
Brick69: gs14p:/datos/brick4/data
Brick70: gs14p:/datos/brick5/data
Brick71: gs15p:/datos/brick1/data
Brick72: gs15p:/datos/brick2/data
Brick73: gs15p:/datos/brick3/data
Brick74: gs15p:/datos/brick4/data
Brick75: gs15p:/datos/brick5/data
Brick76: gs16p:/datos/brick1/data
Brick77: gs16p:/datos/brick2/data
Brick78: gs16p:/datos/brick3/data
Brick79: gs16p:/datos/brick4/data
Brick80: gs16p:/datos/brick5/data
Options Reconfigured:
cluster.rebal-throttle: aggressive
performance.rda-cache-limit: 300Mb
server.allow-insecure: on
performance.readdir-ahead: on
performance.force-readdirp: on
storage.batch-fsync-delay-usec: 0
cluster.min-free-disk: 120GB
nfs.disable: on
performance.stat-prefetch: off
cluster.weighted-rebalance: on
transport.address-family: inet
cluster.lookup-optimize: off
performance.parallel-readdir: on
auto-delete: disable

Comment 82 Cal Calhoun 2018-04-21 00:03:23 UTC

@Nithya, After being given the recommended settings, the customer closed the case with no further comments so I'm fine with the BZ being closed unless there are other cases to link to it.
-Cal

Comment 84 Nithya Balachandran 2018-04-22 06:32:19 UTC

(In reply to Cal Calhoun from comment #82)
> @Nithya, After being given the recommended settings, the customer closed the
> case with no further comments so I'm fine with the BZ being closed unless
> there are other cases to link to it.
> -Cal

Thank you Cal.

I am closing this BZ with WontFix. Let me know if this is alright.

Comment 85 Cal Calhoun 2018-04-23 16:32:56 UTC

@Nithya, Ack.  thank you for your help with this. -Cal

Comment 86 Luis Rico 2018-04-23 17:46:32 UTC

Thank you ALL for your help, improving and customizing a complex rebalance operation!

Note You need to log in before you can comment on or make changes to this bug.