Bug 1276245 - [Tier]: Stopping and Starting tier volume triggers fixing layout which fails on local host
Summary: [Tier]: Stopping and Starting tier volume triggers fixing layout which fails ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: tier
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: RHGS 3.1.2
Assignee: Mohammed Rafi KC
QA Contact: Rahul Hinduja
URL:
Whiteboard:
: 1229270 (view as bug list)
Depends On: 1288051 1294594 1315659 1318498
Blocks: 1260783 1260923 1284372 1285335
TreeView+ depends on / blocked
 
Reported: 2015-10-29 08:48 UTC by Rahul Hinduja
Modified: 2018-01-25 00:37 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.7.5-8
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1284372 (view as bug list)
Environment:
Last Closed: 2016-03-01 05:47:55 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0193 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.1 update 2 2016-03-01 10:20:36 UTC

Description Rahul Hinduja 2015-10-29 08:48:08 UTC
Description of problem:
=======================

On longevity setup consists of hot tier {6x2} and cold tier {2x (4 + 2)}. Stopping volume and starting back triggers the layout fixing and eventually fails on the local host. 

Node 1: which is used in stop and start of volume:

Tier logs:
==========

[2015-10-29 08:17:06.988839] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /
[2015-10-29 08:17:06.988865] W [MSGID: 109016] [dht-selfheal.c:1487:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: Layout fix failed: 1 subvolume(s) are down. Skipping fix layout.
[2015-10-29 08:17:06.989085] E [MSGID: 109026] [dht-rebalance.c:2992:gf_defrag_start_crawl] 0-tiervolume-tier-dht: fix layout on / failed
[2015-10-29 08:17:06.989127] I [MSGID: 109028] [dht-rebalance.c:3327:gf_defrag_status_get] 0-tiervolume-tier-dht: Rebalance is failed. Time taken is 0.00 secs


All the other nodes logs the following:
=======================================

[2015-10-29 08:41:58.665874] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d7312%%DIIOR3J5QX lookup failed
[2015-10-29 08:41:58.679069] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d72ec%%J3OIWCRM3T lookup failed
[2015-10-29 08:41:58.687659] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d72ed%%NO1PDGAEIH lookup failed
[2015-10-29 08:41:58.698332] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d7315%%ZPOFELKK0S lookup failed
[2015-10-29 08:41:58.706011] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d732a%%JZAJN3YPML lookup failed
[2015-10-29 08:41:58.716345] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d72e9%%NBAS57MF0H lookup failed
[2015-10-29 08:41:58.724244] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d72ed%%NL8UAU9PYM lookup failed
[2015-10-29 08:41:58.735774] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d733a%%RJS5J2ETDI lookup failed
[2015-10-29 08:41:58.743622] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d72eb%%TLH672U96B lookup failed
[2015-10-29 08:41:58.749406] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d7322%%04BKVRVH3U lookup failed
[2015-10-29 08:41:58.760800] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d732c%%TV3DZHKDER lookup failed
[2015-10-29 08:41:58.764740] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread0/level01/562d733b%%1F737UYTWG lookup failed
[2015-10-29 08:41:58.789417] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread0/level02
[2015-10-29 08:41:58.789460] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 0 (tiervolume-cold-dht): 1222023 chunks
[2015-10-29 08:41:58.789472] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 1 (tiervolume-hot-dht): 916518 chunks
[2015-10-29 08:41:58.841486] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread0/level02/level12
[2015-10-29 08:41:58.841531] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 0 (tiervolume-cold-dht): 1222023 chunks
[2015-10-29 08:41:58.841542] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 1 (tiervolume-hot-dht): 916518 chunks
[2015-10-29 08:41:58.930999] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread0/level02/level12/level22
[2015-10-29 08:41:58.931045] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 0 (tiervolume-cold-dht): 1222023 chunks
[2015-10-29 08:41:58.931056] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 1 (tiervolume-hot-dht): 916518 chunks
[2015-10-29 08:41:58.968563] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread0/level02/level12/level22/level32
[2015-10-29 08:41:58.968608] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 0 (tiervolume-cold-dht): 1222023 chunks
[2015-10-29 08:41:58.968627] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 1 (tiervolume-hot-dht): 916518 chunks
[2015-10-29 08:41:59.009282] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread0/level02/level12/level22/level32/level42
[2015-10-29 08:41:59.009340] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 0 (tiervolume-cold-dht): 1222023 chunks
[2015-10-29 08:41:59.009371] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 1 (tiervolume-hot-dht): 916518 chunks
[2015-10-29 08:41:59.089329] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread0/level02/level12/level22/level32/level42/level52


Numbers in which these logs reported as: 

[root@dhcp37-133 glusterfs]# grep "dht-common.c:3810:dht_setxattr" tiervolume-tier.log  | wc -l 
16545
[root@dhcp37-133 glusterfs]# grep "gf_fix_layout_tier_attach_lookup" tiervolume-tier.log  | wc -l 
24178
[root@dhcp37-133 glusterfs]# 

Even after 15 mins, the logs keep generating for same lookup and fixlayout issues.

[2015-10-29 08:44:09.023724] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 1 (tiervolume-hot-dht): 916518 chunks
[2015-10-29 08:44:09.093113] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread2/level07/level17/level27/level37/level47/level57/level67/level77/level87/level97
[2015-10-29 08:44:09.093166] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 0 (tiervolume-cold-dht): 1222023 chunks
[2015-10-29 08:44:09.093179] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 1 (tiervolume-hot-dht): 916518 chunks
[2015-10-29 08:44:09.197870] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread2/level07/level17/level27/level37/level47/level57/level67/level77/level87/level97/562d85cb%%NU7OTMO8OQ lookup failed
[2015-10-29 08:44:09.231579] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread2/level07/level17/level27/level37/level47/level57/level67/level77/level87/level97/562d85d4%%I0FZDK9ZMY lookup failed
[2015-10-29 08:44:09.243277] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread2/level07/level17/level27/level37/level47/level57/level67/level77/level87/level97/562d85e5%%5DMZ9FFO9H lookup failed
[2015-10-29 08:44:09.249661] E [MSGID: 109037] [dht-rebalance.c:2666:gf_fix_layout_tier_attach_lookup] 0-tiervolume-tier-dht: /thread2/level07/level17/level27/level37/level47/level57/level67/level77/level87/level97/562d85ed%%NYTHVBACBH lookup failed
[2015-10-29 08:44:10.654213] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread2/level08
[2015-10-29 08:44:10.654266] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 0 (tiervolume-cold-dht): 1222023 chunks
[2015-10-29 08:44:10.654284] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 1 (tiervolume-hot-dht): 916518 chunks
[2015-10-29 08:44:10.701865] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread2/level08/level18
[2015-10-29 08:44:10.701917] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 0 (tiervolume-cold-dht): 1222023 chunks
[2015-10-29 08:44:10.701937] I [MSGID: 109045] [dht-selfheal.c:1509:dht_fix_layout_of_directory] 0-tiervolume-tier-dht: subvolume 1 (tiervolume-hot-dht): 916518 chunks
[2015-10-29 08:44:10.749581] I [MSGID: 109081] [dht-common.c:3810:dht_setxattr] 0-tiervolume-tier-dht: fixing the layout of /thread2/level08/level18/level28


Rebalance on local node failed as:
===================================

[root@dhcp37-165 glusterfs]# gluster v rebal tiervolume status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             0             1             0               failed               0.00
                            10.70.37.133                0        0Bytes             0             0             0          in progress            1672.00
                            10.70.37.160                0        0Bytes             0             0             0          in progress            1672.00
                            10.70.37.158                0        0Bytes             0             0             0          in progress            1672.00
                            10.70.37.110                0        0Bytes             0             0             0          in progress            1672.00
                            10.70.37.155                0        0Bytes             0             0             0          in progress            1672.00
                             10.70.37.99                0        0Bytes             0             0             0          in progress            1672.00
                             10.70.37.88                0        0Bytes             0             0             0          in progress            1672.00
                            10.70.37.112                0        0Bytes             0             0             0          in progress            1672.00
                            10.70.37.199                0        0Bytes             0             0             0          in progress            1672.00
                            10.70.37.162                0        0Bytes             0             0             0          in progress            1672.00
                             10.70.37.87                0        0Bytes             0             0             0          in progress            1672.00
volume rebalance: tiervolume: success: 
[root@dhcp37-165 glusterfs]# 


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.5-0.3.el7rhgs.x86_64


Steps carried:
==============
1. 12 node cluster
2. Hot tier {6x2} , Cold tier {2x(4=2)}
3. Mounted the volume on 7.2,7.1 and 6.7 clients
4. Huge set of data is created on volume {148GB}
5. Stopped the volume {No data creation or IO was in progress at this time}
6. Started the volume

Comment 6 Mohammed Rafi KC 2015-11-24 06:13:54 UTC
*** Bug 1229270 has been marked as a duplicate of this bug. ***

Comment 7 Mohammed Rafi KC 2015-11-26 08:44:17 UTC
master : http://review.gluster.org/#/c/12718/
release-3.7 : http://review.gluster.org/#/c/12749/

Comment 9 Rahul Hinduja 2015-12-04 06:16:34 UTC
During the verification of this bug hit another issue mentioned in bz 1288051 .  Since the tierd goes to faulty, verification of this bug depends on the closure of bz 1288051 . Marking dependent.

Comment 11 Rahul Hinduja 2016-01-07 11:00:45 UTC
Verified with build: glusterfs-3.7.5-14.el7rhgs.x86_64

Restarting a volume triggers fix layout which is known to tier team but the issue mentioned in this bug for failing fixing layout is not seen. Moving this bug to verified state. 

[root@dhcp37-165 glusterfs]# grep -i "gf_defrag_start_crawl" tiervolume-tier.log  | grep -i "failed"
[root@dhcp37-165 glusterfs]# 


[root@dhcp37-165 glusterfs]# grep -i "failed" tiervolume-tier.log | grep -i "fix"
[root@dhcp37-165 glusterfs]#

Comment 13 errata-xmlrpc 2016-03-01 05:47:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html


Note You need to log in before you can comment on or make changes to this bug.