Bug 862618

Summary: Mismatch in failure counts between rebalance logs and status
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED WORKSFORME QA Contact: storage-qa-internal <storage-qa-internal>
Severity: low Docs Contact:
Priority: low    
Version: unspecifiedCC: grajaiya, nbalacha, rgowdapp, rhs-bugs, rwheeler, smohan, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: triaged, dht-rebalance-usability
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-29 06:08:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rebalance fail counts
none
glusterd logs none

Description shylesh 2012-10-03 11:53:40 UTC
Created attachment 620777 [details]
rebalance fail counts

Description of problem:
There is a mismatch in failure counts between status and logs of rebalance

Version-Release number of selected component (if applicable):
[root@rhs-gp-srv4 glusterfs]# rpm -qa | grep gluster
glusterfs-fuse-3.3.0rhsvirt1-6.el6rhs.x86_64
vdsm-gluster-4.9.6-14.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-3.3.0rhsvirt1-6.el6rhs.x86_64
glusterfs-server-3.3.0rhsvirt1-6.el6rhs.x86_64
glusterfs-rdma-3.3.0rhsvirt1-6.el6rhs.x86_64
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-geo-replication-3.3.0rhsvirt1-6.el6rhs.x86_64
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch


How reproducible:


Steps to Reproduce:
1. created a single brick distribute volume 
2. had some VM images on this volume 
3. added a new brick and started rebalance
4. while rebalance is running re-started glusterd on one of the node
5. On that node rebalance status command shows failure count as 1

Actual results:

If we look at the status failure count is 1 but log says failure count as 0

Additional info:
Volume Name: rebal
Type: Distribute
Volume ID: 0952e193-a12c-420a-b752-a77c54b3bf98
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv4.lab.eng.blr.redhat.com:/rebal
Brick2: rhs-gp-srv11.lab.eng.blr.redhat.com:/rebal
Options Reconfigured:
cluster.eager-lock: enable
storage.linux-aio: off
performance.read-ahead: disable
performance.stat-prefetch: disable
performance.io-cache: disable
performance.quick-read: disable



[root@rhs-gp-srv4 glusterfs]# gluster v rebalance rebal status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost               11 128849050259           42            1      completed
     rhs-gp-srv12.lab.eng.blr.redhat.com                0            0           32            0      completed
     rhs-gp-srv11.lab.eng.blr.redhat.com                0            0           32            0      completed
     rhs-gp-srv15.lab.eng.blr.redhat.com                0            0           32            0      completed




where as log on the peer where we can see the failure says 
====================================
[2012-10-03 07:02:36.639436] I [dht-rebalance.c:1063:gf_defrag_migrate_data] 0-rebal-dht: migrate data called on /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/master/vms/59a94ec0-bdf2-4df5-ade0-0812e1ec6ecd
[2012-10-03 07:02:36.642296] I [dht-rebalance.c:647:dht_migrate_file] 0-rebal-dht: /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/master/vms/59a94ec0-bdf2-4df5-ade0-0812e1ec6ecd/59a94ec0-bdf2-4df5-ade0-0812e1ec6ecd.ovf: attempting to move from rebal-client-0 to rebal-client-1
[2012-10-03 07:02:36.647204] I [dht-rebalance.c:856:dht_migrate_file] 0-rebal-dht: completed migration of /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/master/vms/59a94ec0-bdf2-4df5-ade0-0812e1ec6ecd/59a94ec0-bdf2-4df5-ade0-0812e1ec6ecd.ovf from subvolume rebal-client-0 to rebal-client-1
[2012-10-03 07:02:36.652056] I [dht-common.c:2337:dht_setxattr] 0-rebal-dht: fixing the layout of /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/master/tasks
[2012-10-03 07:02:36.652578] I [dht-rebalance.c:1063:gf_defrag_migrate_data] 0-rebal-dht: migrate data called on /89d20fdd-e22f-4ee5-92a5-2e6540cbcae5/master/tasks
[2012-10-03 07:02:36.657795] I [dht-rebalance.c:1619:gf_defrag_status_get] 0-glusterfs: Rebalance is completed
[2012-10-03 07:02:36.657823] I [dht-rebalance.c:1622:gf_defrag_status_get] 0-glusterfs: Files migrated: 11, size: 128849050259, lookups: 42, failures: 0
[2012-10-03 07:02:36.658403] W [glusterfsd.c:906:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3910ae5ccd] (-->/lib64/libpthread.so.0() [0x39112077f1] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd) [0x405d2d]))) 0-: received signum (15), shutting down

Comment 2 shishir gowda 2012-10-04 05:54:21 UTC
Not able to reproduce the issue.
Please update the bug, if you hit the issue again.
Also, attach the glusterd logs, along with the cli logs

Comment 3 shylesh 2012-10-04 09:36:49 UTC
Created attachment 621494 [details]
glusterd logs

Comment 4 shishir gowda 2012-10-05 07:24:51 UTC
Not able to reproduce the issue, and bug is related to a stats being incorrect for rebalance. Reducing the severity and priority