Description of problem: 4.7 PB volume (16 node x 5 Bricks each one x 60 Disks in Raid 6 ) running a rebalance volume from 744 hours, the rebalance command status output show this: [root@gs1 ~]# gluster volume rebalance <volume> status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 40893 25.3TB 89076 1 28945 in progress 774:54:21 node15 4772 4.0TB 47724 1 10244 in progress 774:54:20 node6 45146 22.7TB 100728 1 24980 in progress 774:54:20 node4 57755 35.4TB 97283 1 15143 in progress 774:54:20 node12 54851 34.0TB 96159 1 14335 in progress 774:54:20 node3 36124 21.6TB 88480 12 31814 in progress 774:54:20 node11 57502 31.6TB 97654 1 13150 in progress 774:54:20 node10 44509 22.9TB 102635 1 26957 in progress 774:54:20 node9 33127 20.7TB 126724 1 59430 in progress 774:54:20 node8 58226 36.1TB 103005 1 11446 in progress 774:54:20 node2 35714 19.7TB 85995 1 24227 in progress 774:54:20 node5 41277 26.3TB 104782 1 35168 in progress 774:54:20 node14 12916 10.2TB 72550 1 24339 in progress 774:54:20 node13 8940 6.5TB 89816 1 40115 in progress 774:54:20 node16 1816 1.5TB 39147 1 9720 in progress 774:54:20 node7 58457 33.9TB 100447 1 11820 in progress 774:54:20 Estimated time left for rebalance to complete : 6595:53:14 Version-Release number of selected component (if applicable): SERVER VERSIONS: OS: RHEL 6.9 Kernel: kernel-2.6.32-696.13.2.el6.x86_64 Gluster: glusterfs-server-3.8.4-44.el6rhs.x86_64 How reproducible: Ongoing Steps to Reproduce: Just kick off rebalance on the volume Actual results: 744 hours so far an estimated 6595:53:14 to complete Expected results: Customer expects better perfomance Additional info: SOSREPORTS: http://collab-shell.usersys.redhat.com/01985137/
@Nithya, I have compiled the information regarding the proposed solution, presented it to the customer and am waiting for their response. I had also asked them about their preference re: rebalance estimate output on 12/8 but had not received a reply.
Sorry for the delay coming back to you. I spoke with customer directly regarding this case, and I’m transmitting their concerns, as they are less comfortable explaining all this in English... (I’ll update case publicly with an excerpt from the following), please: - First, customer is very thankful about your help to provide a customize way, according to their environment, to improve rebalance process, so it’s achievable in a more reasonable timeframe. - Despite customer is quite used to work with Gluster for last years, they don’t feel confident enough to implement these steps by themselves, as the technical level required to understand what they are doing and their impact is limited. - The explanation provided in the case journal is quite difficult to follow. @Cal, I understood that you copied first the manual steps for each mount and then put the script provided that does in a more automated way the previous steps... so in my understanding that is redundant, they don’t have to execute steps first then script, but going directly to execute script. - They also would like to understand the potential impact or executing manually in the nodes, thus having the security that is not going to cause any issues... and understanding performance impact - Could it be possible to enhance script as much as possible to reduce risks of executing it wrongly, please? - They plan to expand again cluster next year 2018 with other 4 nodes, as their growing rate is 700-900 TB net per year. So, they will have to execute next year another rebalance of 4 PB of potential data. - In this sense, any improvement in script would be very helpful - If, despite all efforts, customer is still not comfortable, we’ll propose an engagement with Red Hat Consulting to make sure our engineers help directly with this process (Dani, our cloud & storage architect, in CC) - About question on message about estimation time provided by rebalance status, their opinion is that it’s better to get a real estimation even if it’s very long, than a generic message “> 2 months”. Please, update BZ to change our official documentation in this sense. - In any case, and due to Christmas campaign, any action on the IT systems has to be postponed until 8th of January. Please, put ticket in standby or close it, and we’ll reopen it later. Many many thanks to all of you.
On 23 Dec 2017, at 04:19, sankarshan <sankarshan> wrote: Thank you for the detail in your response. |On 23-Dec-2017 03:05, "Luis Rico" <lricomor> wrote: |- They also would like to understand the potential impact or executing manually in the nodes, thus having the security that is not going to cause any issues... and understanding performance impact What forms of impact to performance are they highlighting? |- Could it be possible to enhance script as much as possible to reduce risks of executing it wrongly, please? |- They plan to expand again cluster next year 2018 with other 4 nodes, as their growing rate is 700-900 TB net per year. So, they will have to execute next year another rebalance of 4 PB of potential data. |- In this sense, any improvement in script would be very helpful |- If, despite all efforts, customer is still not comfortable, we’ll propose an engagement with Red Hat Consulting to make sure our engineers help directly with this process (Dani, our cloud & storage architect, in CC) The script and documentation will help. However, if there is indeed a manner and form in which Consulting can be engaged, it would make more sense to deliver the customer experience through this means. It enables Red Hat to deliver expertise in a planned, structured and repeatable manner that can allay the concerns raised. Especially, since they have raised the flag of not feeling comfortable and confident of executing the steps by themselves.
On 26 Dec 2017, at 04:57, Nithya Balachandran <nbalacha> wrote: One potential impact is that ops from these scripts will have the same priority as that from any other other client. The rebalance processes are treated as internal daemons and any ops from them will have lower priority on the bricks. This may cause an impact to client operations to the same volume. How often is this volume accessed? The other impact is: - The rebalance status will not be available via the cli. You will need to check for any errors using the log files. - The log messages will be written to the log file for the mount point used to run the script instead of the regular rebalance log files. - The customer needs to ensure that no add-brick/remove-brick operations are performed while the script is executing. They will need to stop the script and start it again. This however allows them to control when the script is running and which directory to process unlike with a regular rebalance operation. - Could it be possible to enhance script as much as possible to reduce risks of executing it wrongly, please? I can try to do this. Is there anything in particular they would like the script to check for? |- They plan to expand again cluster next year 2018 with other 4 nodes, as their growing rate is 700-900 TB net per year. So, they will have to execute next year another rebalance of 4 PB of potential data. Would this be at the beginning of the year or a later date?
On 26 Dec 2017, at 05:46, Raghavendra Gowdappa <rgowdapp> wrote: |One potential impact is that ops from these scripts will have the same priority as that from any other other client. The rebalance processes are treated as internal daemons and any ops from them will have lower priority on the bricks. This *may* cause an impact to client operations to the same volume. This can be fixed by passing the $rebalance-pid to the option "client-pid" while mounting the mount on which the scripts are run. Note that mount.glusterfs script doesn't support this option now, but fixing it is trivial. Also note that this should be a special mount (specifically used for running the script) which shouldn't be used by applications.
(In reply to Luis Rico from comment #18) > On 23 Dec 2017, at 04:19, sankarshan <sankarshan> wrote: > > Thank you for the detail in your response. > > |On 23-Dec-2017 03:05, "Luis Rico" <lricomor> wrote: > > |- They also would like to understand the potential impact or executing > manually in the nodes, thus having the security that is not going to cause > any issues... and understanding performance impact > What forms of impact to performance are they highlighting? I'm referring to the potential performance impact of running the script in a production environment. We just need to clarify that. > > |- Could it be possible to enhance script as much as possible to reduce > risks of executing it wrongly, please? > |- They plan to expand again cluster next year 2018 with other 4 nodes, as > their growing rate is 700-900 TB net per year. So, they will have to execute > next year another rebalance of 4 PB of potential data. > |- In this sense, any improvement in script would be very helpful > |- If, despite all efforts, customer is still not comfortable, we’ll propose > an engagement with Red Hat Consulting to make sure our engineers help > directly with this process (Dani, our cloud & storage architect, in CC) > > The script and documentation will help. However, if there is indeed a manner > and form in which Consulting can be engaged, it would make more sense to > deliver the customer experience through this means. It enables Red Hat to > deliver expertise in a planned, structured and repeatable manner that can > allay the concerns raised. Especially, since they have raised the flag of > not feeling comfortable and confident of executing the steps by themselves. Agree.
(In reply to Luis Rico from comment #19) > On 26 Dec 2017, at 04:57, Nithya Balachandran <nbalacha> wrote: > > One potential impact is that ops from these scripts will have the same > priority as that from any other other client. The rebalance processes are > treated as internal daemons and any ops from them will have lower priority > on the bricks. This may cause an impact to client operations to the same > volume. That is exaclty the kind of information we should provide to customer for their awareness. >How often is this volume accessed? At any time for reading archived videos, and at certain times of the day DIVA is creating new folders with new videos+metadata, controlled by admin. > > The other impact is: > - The rebalance status will not be available via the cli. You will need to > check for any errors using the log files. > - The log messages will be written to the log file for the mount point used > to run the script instead of the regular rebalance log files. > - The customer needs to ensure that no add-brick/remove-brick operations are > performed while the script is executing. They will need to stop the script > and start it again. This however allows them to control when the script is > running and which directory to process unlike with a regular rebalance > operation. This is excellent information to provide to customer. That is exactly what they are requesting. > - Could it be possible to enhance script as much as possible to reduce risks > of executing it wrongly, please? > I can try to do this. Is there anything in particular they would like the > script to check for? No idea on this, as no idea of what could go wrong if it's not executed as expected. > |- They plan to expand again cluster next year 2018 with other 4 nodes, as > their growing rate is 700-900 TB net per year. So, they will have to execute > next year another rebalance of 4 PB of potential data. > Would this be at the beginning of the year or a later date? Later date.
@Nithya, No, I didn't realize that Damian had already requested the information. We're still waiting for the customer to respond. -Cal
Nithya, Here's the mediaset info: http://collab-shell.usersys.redhat.com/01985137/var_lib_glusterd_vols_mediaset.tar/ I didn't see a new gluster volume info so I'll ask for it again.
Nithya, Nevermind. It was posted to a comment: [root@gs5 ~]# gluster volume info Volume Name: ctdbmeta Type: Distributed-Replicate Volume ID: 359d578b-6f27-47c4-b7da-785085f1fa6a Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gs1p:/datos/ctdb/data Brick2: gs2p:/datos/ctdb/data Brick3: gs3p:/datos/ctdb/data Brick4: gs4p:/datos/ctdb/data Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet auto-delete: disable Volume Name: mediaset Type: Distribute Volume ID: b02d5777-74c5-496d-a2b5-766cf5d439db Status: Started Snapshot Count: 0 Number of Bricks: 80 Transport-type: tcp Bricks: Brick1: gs1p:/datos/brick1/data Brick2: gs1p:/datos/brick2/data Brick3: gs1p:/datos/brick3/data Brick4: gs1p:/datos/brick4/data Brick5: gs1p:/datos/brick5/data Brick6: gs2p:/datos/brick1/data Brick7: gs2p:/datos/brick2/data Brick8: gs2p:/datos/brick3/data Brick9: gs2p:/datos/brick4/data Brick10: gs2p:/datos/brick5/data Brick11: gs3p:/datos/brick1/data Brick12: gs3p:/datos/brick2/data Brick13: gs3p:/datos/brick3/data Brick14: gs3p:/datos/brick4/data Brick15: gs3p:/datos/brick5/data Brick16: gs4p:/datos/brick1/data Brick17: gs4p:/datos/brick2/data Brick18: gs4p:/datos/brick3/data Brick19: gs4p:/datos/brick4/data Brick20: gs4p:/datos/brick5/data Brick21: gs5p:/datos/brick1/data Brick22: gs5p:/datos/brick2/data Brick23: gs5p:/datos/brick3/data Brick24: gs5p:/datos/brick4/data Brick25: gs5p:/datos/brick5/data Brick26: gs6p:/datos/brick1/data Brick27: gs6p:/datos/brick2/data Brick28: gs6p:/datos/brick3/data Brick29: gs6p:/datos/brick4/data Brick30: gs6p:/datos/brick5/data Brick31: gs7p:/datos/brick1/data Brick32: gs7p:/datos/brick2/data Brick33: gs7p:/datos/brick3/data Brick34: gs7p:/datos/brick4/data Brick35: gs7p:/datos/brick5/data Brick36: gs8p:/datos/brick1/data Brick37: gs8p:/datos/brick2/data Brick38: gs8p:/datos/brick3/data Brick39: gs8p:/datos/brick4/data Brick40: gs8p:/datos/brick5/data Brick41: gs9p:/datos/brick1/data Brick42: gs9p:/datos/brick2/data Brick43: gs9p:/datos/brick3/data Brick44: gs9p:/datos/brick4/data Brick45: gs9p:/datos/brick5/data Brick46: gs10p:/datos/brick1/data Brick47: gs10p:/datos/brick2/data Brick48: gs10p:/datos/brick3/data Brick49: gs10p:/datos/brick4/data Brick50: gs10p:/datos/brick5/data Brick51: gs11p:/datos/brick1/data Brick52: gs11p:/datos/brick2/data Brick53: gs11p:/datos/brick3/data Brick54: gs11p:/datos/brick4/data Brick55: gs11p:/datos/brick5/data Brick56: gs12p:/datos/brick1/data Brick57: gs12p:/datos/brick2/data Brick58: gs12p:/datos/brick3/data Brick59: gs12p:/datos/brick4/data Brick60: gs12p:/datos/brick5/data Brick61: gs13p:/datos/brick1/data Brick62: gs13p:/datos/brick2/data Brick63: gs13p:/datos/brick3/data Brick64: gs13p:/datos/brick4/data Brick65: gs13p:/datos/brick5/data Brick66: gs14p:/datos/brick1/data Brick67: gs14p:/datos/brick2/data Brick68: gs14p:/datos/brick3/data Brick69: gs14p:/datos/brick4/data Brick70: gs14p:/datos/brick5/data Brick71: gs15p:/datos/brick1/data Brick72: gs15p:/datos/brick2/data Brick73: gs15p:/datos/brick3/data Brick74: gs15p:/datos/brick4/data Brick75: gs15p:/datos/brick5/data Brick76: gs16p:/datos/brick1/data Brick77: gs16p:/datos/brick2/data Brick78: gs16p:/datos/brick3/data Brick79: gs16p:/datos/brick4/data Brick80: gs16p:/datos/brick5/data Options Reconfigured: cluster.rebal-throttle: aggressive performance.rda-cache-limit: 300Mb server.allow-insecure: on performance.readdir-ahead: on performance.force-readdirp: on storage.batch-fsync-delay-usec: 0 cluster.min-free-disk: 120GB nfs.disable: on performance.stat-prefetch: off cluster.weighted-rebalance: on transport.address-family: inet cluster.lookup-optimize: off performance.parallel-readdir: on auto-delete: disable
@Nithya, After being given the recommended settings, the customer closed the case with no further comments so I'm fine with the BZ being closed unless there are other cases to link to it. -Cal
(In reply to Cal Calhoun from comment #82) > @Nithya, After being given the recommended settings, the customer closed the > case with no further comments so I'm fine with the BZ being closed unless > there are other cases to link to it. > -Cal Thank you Cal. I am closing this BZ with WontFix. Let me know if this is alright.
@Nithya, Ack. thank you for your help with this. -Cal
Thank you ALL for your help, improving and customizing a complex rebalance operation!