Bug 1331628

Summary: [Tiering]: detach tier operation fails
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: tierAssignee: hari gowtham <hgowtham>
Status: CLOSED WORKSFORME QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, dlambrig, hgowtham, kramdoss, nbalacha, rcyriac, rhinduja, rhs-bugs, rkavunga, sankarshan, storage-qa-internal
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-16 03:11:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1332957    

Description krishnaram Karthick 2016-04-29 05:38:55 UTC
Description of problem:
Detach tier operation on tiered volume failed. Although ganesha mount was used for the testing, it shouldn't be the cause as no IO was performed during detach tier except for tier migrations.

Volume Name: ganesha-tier
Type: Tier
Volume ID: 5dc054c0-b15c-49dc-9494-38bd04d05819
Status: Started
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 2 = 2
Brick1: 10.70.47.156:/bricks/brick1/l1
Brick2: 10.70.47.156:/bricks/brick0/l1
Cold Tier:
Cold Tier Type : Disperse
Number of Bricks: 1 x (4 + 2) = 6
Brick3: 10.70.47.192:/bricks/brick0/l1
Brick4: 10.70.47.178:/bricks/brick0/l1
Brick5: 10.70.47.160:/bricks/brick0/l1
Brick6: 10.70.47.192:/bricks/brick1/l1
Brick7: 10.70.47.178:/bricks/brick1/l1
Brick8: 10.70.47.160:/bricks/brick1/l1
Options Reconfigured:
cluster.watermark-hi: 10
cluster.watermark-low: 5
cluster.tier-mode: cache
features.ctr-enabled: on
features.inode-quota: off
features.quota: off
ganesha.enable: on
features.cache-invalidation: on
nfs.disable: on
performance.readdir-ahead: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable



[root@dhcp47-156 gluster]# gluster v status ganesha-tier
Status of volume: ganesha-tier
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.47.156:/bricks/brick1/l1        49153     0          Y       19944
Brick 10.70.47.156:/bricks/brick0/l1        49152     0          Y       19924
Cold Bricks:
Brick 10.70.47.192:/bricks/brick0/l1        49153     0          Y       8824 
Brick 10.70.47.178:/bricks/brick0/l1        49152     0          Y       4509 
Brick 10.70.47.160:/bricks/brick0/l1        49152     0          Y       692  
Brick 10.70.47.192:/bricks/brick1/l1        49154     0          Y       8869 
Brick 10.70.47.178:/bricks/brick1/l1        49153     0          Y       4528 
Brick 10.70.47.160:/bricks/brick1/l1        49153     0          Y       766  
Self-heal Daemon on localhost               N/A       N/A        Y       20019
Self-heal Daemon on 10.70.47.178            N/A       N/A        Y       16055
Self-heal Daemon on 10.70.47.192            N/A       N/A        Y       8247 
Self-heal Daemon on 10.70.47.160            N/A       N/A        Y       14582
 
Task Status of Volume ganesha-tier
------------------------------------------------------------------------------
Task                 : Detach tier         
ID                   : 91018093-6969-4406-b494-9007a661d167
Status               : failed              


Following messages are seen in the tier logs.

[2016-04-29 10:01:32.676177] W [glusterfsd.c:1251:cleanup_and_exit] (-->/lib64/libglusterfs.so.0(synctask_wrap+0x12) [0x7f35a8be0fe2] -->/usr/sbin/glusterfs(glusterfs_handle_terminate+0x15) [0x7f35a9075415] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7f35a9072739] ) 0-: received signum (15), shutting down
[2016-04-29 10:01:34.772142] I [timer.c:48:gf_timer_call_after] (-->/lib64/libglusterfs.so.0(gf_timer_proc+0x11b) [0x7f35a8bbe98b] -->/lib64/libgfrpc.so.0(+0xff83) [0x7f35a896cf83] -->/lib64/libglusterfs.so.0(gf_timer_call_after+0x166) [0x7f35a8bbe6c6] ) 0-timer: ctx cleanup started
[2016-04-29 10:01:34.772196] W [rpc-clnt.c:170:call_bail] 0-glusterfs: Cannot create bailout timer for 127.0.0.1:24007 	


[2016-04-29 10:01:44.314059] E [MSGID: 109037] [tier.c:694:tier_migrate_using_query_file] 0-ganesha-tier-tier-dht: Failed to lookup file omap2420-n8x0-common.dtsi
 [Invalid argument]
[2016-04-29 10:00:23.626146] E [MSGID: 109037] [tier.c:694:tier_migrate_using_query_file] 0-ganesha-tier-tier-dht: Failed to lookup file snvs-pwrkey.txt
 [Invalid argument]
[2016-04-29 10:00:23.634539] E [MSGID: 109037] [tier.c:694:tier_migrate_using_query_file] 0-ganesha-tier-tier-dht: Failed to lookup file map.h
 [Invalid argument]
[2016-04-29 10:00:23.784007] E [MSGID: 109037] [tier.c:694:tier_migrate_using_query_file] 0-ganesha-tier-tier-dht: Failed to lookup file gpio.txt
 [Invalid argument]
[2016-04-29 10:00:23.920081] E [MSGID: 109037] [tier.c:694:tier_migrate_using_query_file] 0-ganesha-tier-tier-dht: Failed to lookup file file-957
 [Invalid argument]
[2016-04-29 10:01:48.799826] W [glusterfsd.c:1251:cleanup_and_exit] (-->/lib64/libglusterfs.so.0(synctask_wrap+0x12) [0x7fb760552fe2] -->/usr/sbin/glusterfs(glusterfs_handle_terminate+0x15) [0x7fb7609e7415] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7fb7609e4739] ) 0-: received signum (15), shutting down


Version-Release number of selected component (if applicable):
glusterfs-server-3.7.9-2.el7rhgs.x86_64

How reproducible:
frequently

Steps to Reproduce:
1. create a dispersed volume
2. create bunch of files, dirs, kernel untar
3. while step 2 is in progress, attach tier
4. Allow files to be promoted, new files to be written in hot tier
5. reduce watermark levels so that high watermark is hit
6. detach tier

Actual results:
detach tier starts, but fails eventually.

Expected results:
detach tier should succeed.

Additional info:
sosreports shall be attached shortly.

Comment 2 krishnaram Karthick 2016-04-29 06:06:29 UTC
In the test run, detach tier was executed after fix layout was complete.

Comment 6 Joseph Elwin Fernandes 2016-05-06 13:13:10 UTC

*** This bug has been marked as a duplicate of bug 1332957 ***

Comment 7 Joseph Elwin Fernandes 2016-05-06 13:14:41 UTC
*** Bug 1333804 has been marked as a duplicate of this bug. ***

Comment 8 hari gowtham 2016-05-09 09:54:20 UTC
Partial RCA: there was a GFID mismatch found during detach operation.

Comment 9 hari gowtham 2016-05-09 11:41:44 UTC
the error messages are :
[2016-05-01 08:07:31.938737] W [MSGID: 122019] [ec-helpers.c:361:ec_loc_gfid_check] 0-ganesha-tier-disperse-0: Mismatching GFID's in loc
[2016-05-01 08:07:31.938886] E [MSGID: 109023] [dht-rebalance.c:2353:gf_defrag_get_entry] 0-ganesha-tier-tier-dht: Migrate file failed:/linux-kernel/linux-4.5.2/Kbuild lookup failed
[2016-05-01 08:07:31.938945] I [dht-rebalance.c:2672:gf_defrag_process_dir] 0-DHT: Found critical error from gf_defrag_get_entry
[2016-05-01 08:07:31.939203] E [MSGID: 109111] [dht-rebalance.c:2943:gf_defrag_fix_layout] 0-ganesha-tier-tier-dht: gf_defrag_process_dir failed for directory: /linux-kernel/linux-4.5.2
[2016-05-01 08:07:31.939245] E [MSGID: 109016] [dht-rebalance.c:3120:gf_defrag_fix_layout] 0-ganesha-tier-tier-dht: Fix layout failed for /linux-kernel/linux-4.5.2
[2016-05-01 08:07:31.939264] E [MSGID: 109016] [dht-rebalance.c:3120:gf_defrag_fix_layout] 0-ganesha-tier-tier-dht: Fix layout failed for /linux-kernel

Comment 10 Dan Lambright 2016-05-16 15:32:39 UTC
Discussed in scrum, QE is working on steps to make  this reproducable.

Comment 14 Dan Lambright 2016-07-16 03:11:01 UTC
Closing per discussion with QE.

Nithya: We could not reproduce this. Karthick, can we close this as WorksForMe and
reopen if seen again?

Karthik: Yes, This issue wasn't seen in later stages of 3.1.3