Bug 1024643 - quota+rebalance: bricks are down and rebalance results in stopped status
Summary: quota+rebalance: bricks are down and rebalance results in stopped status
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: quota
Version: 2.1
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
: ---
Assignee: Vijaikumar Mallikarjuna
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-10-30 07:13 UTC by Saurabh
Modified: 2023-09-14 01:52 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-11-17 08:51:28 UTC
Embargoed:


Attachments (Terms of Use)

Description Saurabh 2013-10-30 07:13:05 UTC
Description of problem:

Well, in this case I was having a volume with quota enabled and limit set of root of the volume, the underneath dorectories and some I/O going on.
 
I/O was happening over nfs mount, in two different directories.

Invoked an add-brick and rebalacnce following it.

Since the data inside volume was quite high, roundabout 2.4 TB, hence rebalance kept on going for long.

But, rebalance resulted in with status as "stopped" on all nodes of the cluster.

The cause for rebalance to stop was the bricks going down. 
But could not find the reason why the bricks went down. this remains a question!

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.36rhs

How reproducible:
seen on one set of rhs cluster of 4 nodes.

Steps to Reproduce:
Rather I am providing you the scenario, that I saw this issue,
1. quota enabled volume with limit set on "/" and directoried underneath
2. already having data inside the directorires.
3. in two more directories, keep creating data. with two different script executions, but both execution happening in parallel.

4. add-brick + rebalance

Actual results:
[root@quota5 ~]# gluster volume rebalance dist-rep status
                                    Node Rebalanced-files          size       scanned      failures       skipped         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------   ------------   --------------
                               localhost           149234         1.6GB        477247             5           534        stopped         62895.00
                            10.70.35.191            87293       977.8MB        735462            20        184122        stopped         62894.00
                            10.70.35.108              447         8.7MB        712639             5           235        stopped         62894.00
                            10.70.35.144                0        0Bytes        712494             5             0        stopped         62892.00
       rhsauto004.lab.eng.blr.redhat.com                0        0Bytes        713601             5             0        stopped         62893.00

because bricks went down,
[root@quota6 ~]# gluster volume status
Status of volume: dist-rep
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.35.188:/rhs/brick1/d1r1			N/A	N	9376
Brick 10.70.35.108:/rhs/brick1/d1r2			N/A	N	9157
Brick 10.70.35.191:/rhs/brick1/d2r1			49152	Y	9151
Brick 10.70.35.144:/rhs/brick1/d2r2			49152	Y	9148
Brick 10.70.35.188:/rhs/brick1/d3r1			49153	Y	9387
Brick 10.70.35.108:/rhs/brick1/d3r2			49153	Y	9168
Brick 10.70.35.191:/rhs/brick1/d4r1			49153	Y	9162
Brick 10.70.35.144:/rhs/brick1/d4r2			49153	Y	9159
Brick 10.70.35.188:/rhs/brick1/d5r1			N/A	N	9398
Brick 10.70.35.108:/rhs/brick1/d5r2			49154	Y	9179
Brick 10.70.35.191:/rhs/brick1/d6r1			49154	Y	9173
Brick 10.70.35.144:/rhs/brick1/d6r2			49154	Y	9170
Brick 10.70.35.188:/rhs/brick1/d1r1-add			49155	Y	11217
Brick 10.70.35.108:/rhs/brick1/d1r2-add			49155	Y	10092
NFS Server on localhost					2049	Y	10104
Self-heal Daemon on localhost				N/A	Y	10111
Quota Daemon on localhost				N/A	Y	10118
NFS Server on 10.70.35.191				2049	Y	10191
Self-heal Daemon on 10.70.35.191			N/A	Y	10200
Quota Daemon on 10.70.35.191				N/A	Y	10205
NFS Server on 10.70.35.144				2049	Y	10086
Self-heal Daemon on 10.70.35.144			N/A	Y	10095
Quota Daemon on 10.70.35.144				N/A	Y	10100
NFS Server on rhsauto004.lab.eng.blr.redhat.com		2049	Y	15138
Self-heal Daemon on rhsauto004.lab.eng.blr.redhat.com	N/A	Y	15145
Quota Daemon on rhsauto004.lab.eng.blr.redhat.com	N/A	Y	15153
NFS Server on 10.70.35.188				2049	Y	11236
Self-heal Daemon on 10.70.35.188			N/A	Y	11243
Quota Daemon on 10.70.35.188				N/A	Y	11250
 
           Task                                      ID         Status
           ----                                      --         ------
      Rebalance    719dd624-b733-47c9-a487-296ec18544c7              2


but from logs could not make why bricks down



Expected results:
bricks going down , because rebalance does not provide a healthy status, is this happening because some of directories the quota limit were already reached.

Additional info:

[root@quota6 ~]# gluster volume quota dist-rep list 
                  Path                   Hard-limit Soft-limit   Used  Available
--------------------------------------------------------------------------------
/                                          2.9TB       80%       1.3TB   1.6TB
/qa1                                     512.0GB       80%     421.5GB  90.5GB
/qa2                                     512.0GB       80%     399.7GB 112.3GB
/qa3                                     100.0GB       80%      83.4GB  16.6GB
/qa4                                     100.0GB       80%      83.3GB  16.7GB
/qa1/dir1                                500.0GB       80%     337.8GB 162.2GB
/qa2/dir1                                500.0GB       80%     316.4GB 183.6GB
/qa5                                     500.0GB       80%     361.7GB 138.3GB

Comment 5 Vijaikumar Mallikarjuna 2015-11-17 08:51:28 UTC
Please file a new bug if this issue is still seen in 3.1.x.

Comment 7 Red Hat Bugzilla 2023-09-14 01:52:53 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.