Bug 1024638

Summary: quota+rebalance: C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-dist-rep11-client-6: cannot lookup the saved frame corresponding to xid
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Saurabh <saujain>
Component: distributeAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED DEFERRED QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: mzywusko, spalai, vagarwal, vbellur, vmallika
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1286197 1286205 (view as bug list) Environment:
Last Closed: 2015-11-27 12:25:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1286197, 1286205    

Description Saurabh 2013-10-30 06:55:51 UTC
Description of problem:

Having a 6x2 volume, did an add-brick operation, followed by rebalance.
The volume was already having around 100GB of data inside a directory.
There was no I/O happening, just rebalance going on.

Still I find that rebalance status says, "stopped" for one node.

and that same it says, in the rebalance logs that,

[2013-10-29 14:36:51.255142] W [dht-common.c:5131:dht_notify] 0-dist-rep11-dht: Received CHILD_DOWN. Exiting
[2013-10-29 14:36:51.255150] I [dht-rebalance.c:1788:gf_defrag_stop] 0-: Received stop command on rebalance
But if do the gluster volume status, all the bricks are up.

The above mentioned logs are followed by the "critical" message in the rebalance.log

C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-dist-rep11-client-6: cannot lookup the saved frame corresponding to xid (849056)


Version-Release number of selected component (if applicable):
glusterfs-3.4.0.36rhs

How reproducible:
seen on one setup

Steps to Reproduce:
1. here I was a volume with these settings,
Volume Name: dist-rep11
Type: Distributed-Replicate
Volume ID: 57a3fe8c-974c-4a8c-a5ec-6112e00ae9de
Status: Started
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: 10.70.37.58:/rhs/bricks/d1r1-n11
Brick2: 10.70.37.196:/rhs/bricks/d1r2-n11
Brick3: 10.70.37.138:/rhs/bricks/d2r1-n11
Brick4: 10.70.37.186:/rhs/bricks/d2r2-n11
Brick5: 10.70.37.58:/rhs/bricks/d3r1-n11
Brick6: 10.70.37.196:/rhs/bricks/d3r2-n11
Brick7: 10.70.37.138:/rhs/bricks/d4r1-n11
Brick8: 10.70.37.186:/rhs/bricks/d4r2-n11
Brick9: 10.70.37.58:/rhs/bricks/d5r1-n11
Brick10: 10.70.37.196:/rhs/bricks/d5r2-n11
Brick11: 10.70.37.138:/rhs/bricks/d6r1-n11
Brick12: 10.70.37.186:/rhs/bricks/d6r2-n11
Brick13: 10.70.37.58:/rhs/bricks/d1r1-n11-add
Brick14: 10.70.37.196:/rhs/bricks/d1r2-n11-add
Options Reconfigured:
features.quota-deem-statfs: on
features.quota: on
[root@nfs1 ~]# gluster volume quota dist-rep11 list
                  Path                   Hard-limit Soft-limit   Used  Available
--------------------------------------------------------------------------------
/                                        300.0GB       80%     106.6GB 193.4GB
/dir1                                    100.0GB       80%     100.0GB  0Bytes


2. add-brick and rebalance


Actual results:
present status of rebalance,
[root@nfs1 ~]# gluster volume rebalance dist-rep11 status
                                    Node Rebalanced-files          size       scanned      failures       skipped         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------   ------------   --------------
                               localhost            52501       603.2MB        324071             0         59384    in progress         64108.00
                            10.70.37.186               96       742.3KB        459200             0            33    in progress         64107.00
                            10.70.37.138            46183       517.3MB        216020             0         26566    in progress         64107.00
                            10.70.37.196                0        0Bytes        302926             5             0        stopped         11098.00
volume rebalance: dist-rep11: success: 

from 10.70.37.196,
rebalance logs,

[2013-10-29 14:36:51.255045] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-dist-rep11-client-6: cannot lookup the saved frame corresponding to xid (849056)
[2013-10-29 14:36:51.255085] W [socket.c:1804:__socket_read_reply] 0-dist-rep11-client-6: notify for event MAP_XID failed
[2013-10-29 14:36:51.255121] I [client.c:2103:client_rpc_notify] 0-dist-rep11-client-6: disconnected from 10.70.37.138:49159. Client process will keep trying to connect to glusterd until brick's port is available. 
[2013-10-29 14:36:51.255134] E [afr-common.c:3919:afr_notify] 0-dist-rep11-replicate-3: All subvolumes are down. Going offline until atleast one of them comes back up.
[2013-10-29 14:36:51.255142] W [dht-common.c:5131:dht_notify] 0-dist-rep11-dht: Received CHILD_DOWN. Exiting
[2013-10-29 14:36:51.255150] I [dht-rebalance.c:1788:gf_defrag_stop] 0-: Received stop command on rebalance
[2013-10-29 14:36:51.315190] C [rpc-clnt.c:460:rpc_clnt_fill_request_info] 0-dist-rep11-client-10: cannot lookup the saved frame corresponding to xid (843570)
[2013-10-29 14:36:51.315233] W [socket.c:1804:__socket_read_reply] 0-dist-rep11-client-10: notify for event MAP_XID failed
[2013-10-29 14:36:51.315298] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x3720e0e0f4] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x3720e0dc33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x3720e0db4e]))) 0-dist-rep11-client-10: forced unwinding frame type(GlusterFS 3.3) op(GETXATTR(18)) called at 2013-10-29 14:34:53.989042 (xid=0x843658x)
[2013-10-29 14:36:51.315314] W [client-rpc-fops.c:1103:client3_3_getxattr_cbk] 0-dist-rep11-client-10: remote operation failed: Transport endpoint is not connected. Path: /linux_untar1382971802/linux-2.6.31.1/arch/sparc/mm/tlb.c (b8884399-daf9-422e-a462-ae88469ba227). Key: trusted.glusterfs.node-uuid
[2013-10-29 14:36:51.315339] W [afr-inode-read.c:843:afr_getxattr_node_uuid_cbk] 0-dist-rep11-replicate-5: op_ret (-1): Re-querying afr-child (1/2)
[2013-10-29 14:36:51.315425] I [client.c:2103:client_rpc_notify] 0-dist-rep11-client-10: disconnected from 10.70.37.138:49160. Client process will keep trying to connect to glusterd until brick's port is available. 
[2013-10-29 14:36:51.316421] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802/linux-2.6.31.1/arch/sparc/mm
[2013-10-29 14:36:51.316636] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802/linux-2.6.31.1/arch/sparc
[2013-10-29 14:36:51.316867] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802/linux-2.6.31.1/arch
[2013-10-29 14:36:51.317071] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802/linux-2.6.31.1
[2013-10-29 14:36:51.317263] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-dist-rep11-dht: Fix layout failed for /linux_untar1382971802
[2013-10-29 14:36:51.317524] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 11098.00 secs
[2013-10-29 14:36:51.317538] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 302926, failures: 5, skipped: 0
[2013-10-29 14:36:51.317967] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3a2f6e894d] (-->/lib64/libpthread.so.0() [0x3a2fe07851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down


Expected results:
could not make out why the rebalance is displayed as stopped. If it is finished it should say completed,

Additional info:

Status of volume: dist-rep11
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.37.58:/rhs/bricks/d1r1-n11			49155	Y	15848
Brick 10.70.37.196:/rhs/bricks/d1r2-n11			49155	Y	9080
Brick 10.70.37.138:/rhs/bricks/d2r1-n11			49158	Y	2589
Brick 10.70.37.186:/rhs/bricks/d2r2-n11			49155	Y	9085
Brick 10.70.37.58:/rhs/bricks/d3r1-n11			49156	Y	15859
Brick 10.70.37.196:/rhs/bricks/d3r2-n11			49156	Y	9091
Brick 10.70.37.138:/rhs/bricks/d4r1-n11			49159	Y	2596
Brick 10.70.37.186:/rhs/bricks/d4r2-n11			49156	Y	9096
Brick 10.70.37.58:/rhs/bricks/d5r1-n11			49157	Y	15870
Brick 10.70.37.196:/rhs/bricks/d5r2-n11			49157	Y	9102
Brick 10.70.37.138:/rhs/bricks/d6r1-n11			49160	Y	2601
Brick 10.70.37.186:/rhs/bricks/d6r2-n11			49157	Y	9107
Brick 10.70.37.58:/rhs/bricks/d1r1-n11-add		49158	Y	2238
Brick 10.70.37.196:/rhs/bricks/d1r2-n11-add		49158	Y	27544
NFS Server on localhost					2049	Y	20286
Self-heal Daemon on localhost				N/A	Y	20293
Quota Daemon on localhost				N/A	Y	20300
NFS Server on 10.70.37.58				2049	Y	2250
Self-heal Daemon on 10.70.37.58				N/A	Y	2257
Quota Daemon on 10.70.37.58				N/A	Y	2277
NFS Server on 10.70.37.196				2049	Y	27557
Self-heal Daemon on 10.70.37.196			N/A	Y	27564
Quota Daemon on 10.70.37.196				N/A	Y	27579
NFS Server on 10.70.37.186				2049	Y	27304
Self-heal Daemon on 10.70.37.186			N/A	Y	27311
Quota Daemon on 10.70.37.186				N/A	Y	27318
 
           Task                                      ID         Status
           ----                                      --         ------
      Rebalance    08fcf21d-e0e2-4bc5-a511-f7e1601c9b90              1

Comment 3 Vijaikumar Mallikarjuna 2015-01-09 10:37:56 UTC
Seems to be issue in re-balance. Changing the component to DHT