Bug 1002069

Summary:	AFR : change-logs becoming "0xffffffff0000000000000000" when a brick goes offline and comes back online
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	spandura
Component:	glusterfs	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED ERRATA	QA Contact:	spandura
Severity:	urgent	Docs Contact:
Priority:	high
Version:	2.1	CC:	amarts, grajaiya, rhs-bugs, shaines, surs, tkatarki, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.4.0.28rhs-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1002698 (view as bug list)		Environment:
Last Closed:	2013-09-23 22:36:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1002698

Description spandura 2013-08-28 12:37:47 UTC

Description of problem:
=======================
In a replicate volume when dd is in progress on 2 mount points (fuse mounts) and a brick goes offline and come back online.After sometime the change-logs is set to "0xffffffff0000000000000000"

Even after dd completes , the change-logs are not cleared. 

Version-Release number of selected component (if applicable):
=============================================================
glusterfs 3.4.0.24rhs built on Aug 27 2013 08:08:42

How reproducible:
================
Often

Steps to Reproduce:
====================
1.Create a replicate volume ( 1 x 2 ). Start the volume. 

2.Create 2 fuse mounts. 

3.From both the mount points execute: 
dd if=/dev/urandom of=./test_file1 bs=128k count=16000

4. while the dd is in progress kill brick-1 {kill -KILL <pid_of_brick1_process>}

5. After some time while the dd is still in progress bring back the brick-1 online. 

6. Keep checking the extended attributes of the file on both the bricks.

7. Even after the dd is complete, the change-logs are not cleared. 

Actual results:
===================
++++++++++++++++++++++++++++++++++++++++++
Extended attributes of the file on brick-0
++++++++++++++++++++++++++++++++++++++++++
root.42.194[Aug-28-2013-12:20:19] >ls -l /proc/`cat /var/lib/glusterd/vols/xcube/run/10.70.42.194-data-gluster-brick1-brick.pid`/fd ; getfattr -d -e hex -m . /data/gluster/brick1/brick/test_file1 ; ls -lh /data/gluster/brick1/brick/test_file1
total 0
lr-x------ 1 root root 64 Aug 28 09:41 0 -> /dev/null
l-wx------ 1 root root 64 Aug 28 09:41 1 -> /dev/null
lrwx------ 1 root root 64 Aug 28 09:41 10 -> socket:[77624]
lr-x------ 1 root root 64 Aug 28 09:41 11 -> /dev/urandom
lr-x------ 1 root root 64 Aug 28 09:41 12 -> /data/gluster/brick1/brick
lrwx------ 1 root root 64 Aug 28 09:41 13 -> socket:[77760]
lrwx------ 1 root root 64 Aug 28 09:41 14 -> socket:[78633]
lrwx------ 1 root root 64 Aug 28 09:41 15 -> socket:[78649]
lrwx------ 1 root root 64 Aug 28 09:41 16 -> socket:[118621]
lrwx------ 1 root root 64 Aug 28 09:41 17 -> socket:[118631]
l-wx------ 1 root root 64 Aug 28 09:41 18 -> /data/gluster/brick1/brick/test_file1
l-wx------ 1 root root 64 Aug 28 09:41 19 -> /data/gluster/brick1/brick/test_file1
l-wx------ 1 root root 64 Aug 28 09:41 2 -> /dev/null
lrwx------ 1 root root 64 Aug 28 09:45 21 -> /data/gluster/brick1/brick/test_file1
lrwx------ 1 root root 64 Aug 28 09:41 3 -> anon_inode:[eventpoll]
l-wx------ 1 root root 64 Aug 28 09:41 4 -> /var/log/glusterfs/bricks/data-gluster-brick1-brick.log
lrwx------ 1 root root 64 Aug 28 09:41 5 -> /var/lib/glusterd/vols/xcube/run/10.70.42.194-data-gluster-brick1-brick.pid
lrwx------ 1 root root 64 Aug 28 09:41 6 -> socket:[77475]
lrwx------ 1 root root 64 Aug 28 09:41 7 -> socket:[77701]
lrwx------ 1 root root 64 Aug 28 09:41 8 -> socket:[77482]
lrwx------ 1 root root 64 Aug 28 09:41 9 -> socket:[77710]
getfattr: Removing leading '/' from absolute path names
# file: data/gluster/brick1/brick/test_file1
trusted.afr.xcube-client-0=0x000000830000000000000000
trusted.afr.xcube-client-1=0x0000008c0000000000000000
trusted.gfid=0x33936c7be4b446a7b81409dc8293c573

root.42.194[Aug-28-2013-12:20:25] >ls -l /proc/`cat /var/lib/glusterd/vols/xcube/run/10.70.42.194-data-gluster-brick1-brick.pid`/fd ; getfattr -d -e hex -m . /data/gluster/brick1/brick/test_file1 ; ls -lh /data/gluster/brick1/brick/test_file1
total 0
lr-x------ 1 root root 64 Aug 28 09:41 0 -> /dev/null
l-wx------ 1 root root 64 Aug 28 09:41 1 -> /dev/null
lrwx------ 1 root root 64 Aug 28 09:41 10 -> socket:[77624]
lr-x------ 1 root root 64 Aug 28 09:41 11 -> /dev/urandom
lr-x------ 1 root root 64 Aug 28 09:41 12 -> /data/gluster/brick1/brick
lrwx------ 1 root root 64 Aug 28 09:41 13 -> socket:[77760]
lrwx------ 1 root root 64 Aug 28 09:41 14 -> socket:[78633]
lrwx------ 1 root root 64 Aug 28 09:41 15 -> socket:[78649]
lrwx------ 1 root root 64 Aug 28 09:41 16 -> socket:[118621]
lrwx------ 1 root root 64 Aug 28 09:41 17 -> socket:[118631]
l-wx------ 1 root root 64 Aug 28 09:41 18 -> /data/gluster/brick1/brick/test_file1
l-wx------ 1 root root 64 Aug 28 09:41 19 -> /data/gluster/brick1/brick/test_file1
l-wx------ 1 root root 64 Aug 28 09:41 2 -> /dev/null
lrwx------ 1 root root 64 Aug 28 09:41 3 -> anon_inode:[eventpoll]
l-wx------ 1 root root 64 Aug 28 09:41 4 -> /var/log/glusterfs/bricks/data-gluster-brick1-brick.log
lrwx------ 1 root root 64 Aug 28 09:41 5 -> /var/lib/glusterd/vols/xcube/run/10.70.42.194-data-gluster-brick1-brick.pid
lrwx------ 1 root root 64 Aug 28 09:41 6 -> socket:[77475]
lrwx------ 1 root root 64 Aug 28 09:41 7 -> socket:[77701]
lrwx------ 1 root root 64 Aug 28 09:41 8 -> socket:[77482]
lrwx------ 1 root root 64 Aug 28 09:41 9 -> socket:[77710]
getfattr: Removing leading '/' from absolute path names
# file: data/gluster/brick1/brick/test_file1
trusted.afr.xcube-client-0=0xffffffff0000000000000000
trusted.afr.xcube-client-1=0xffffffff0000000000000000
trusted.gfid=0x33936c7be4b446a7b81409dc8293c573


++++++++++++++++++++++++++++++++++++++++++++++
Extended attributes of the file on brick-1
++++++++++++++++++++++++++++++++++++++++++++++
root.42.211[Aug-28-2013-12:20:19] >ls -l /proc/`cat /var/lib/glusterd/vols/xcube/run/10.70.42.211-data-gluster-brick1-brick.pid`/fd ; getfattr -d -e hex -m . /data/gluster/brick1/brick/test_file1 ; ls -lh /data/gluster/brick1/brick/test_file1
total 0
lr-x------ 1 root root 64 Aug 28 12:18 0 -> /dev/null
l-wx------ 1 root root 64 Aug 28 12:18 1 -> /dev/null
lrwx------ 1 root root 64 Aug 28 12:18 10 -> socket:[120548]
lr-x------ 1 root root 64 Aug 28 12:18 11 -> /dev/urandom
lr-x------ 1 root root 64 Aug 28 12:18 12 -> /data/gluster/brick1/brick
lrwx------ 1 root root 64 Aug 28 12:18 14 -> socket:[120614]
lrwx------ 1 root root 64 Aug 28 12:18 15 -> socket:[120616]
lrwx------ 1 root root 64 Aug 28 12:18 16 -> socket:[120618]
l-wx------ 1 root root 64 Aug 28 12:18 17 -> /data/gluster/brick1/brick/.glusterfs/33/93/33936c7b-e4b4-46a7-b814-09dc8293c573
lrwx------ 1 root root 64 Aug 28 12:18 18 -> socket:[120624]
lrwx------ 1 root root 64 Aug 28 12:18 19 -> socket:[120630]
l-wx------ 1 root root 64 Aug 28 12:18 2 -> /dev/null
lrwx------ 1 root root 64 Aug 28 12:18 20 -> /data/gluster/brick1/brick/.glusterfs/33/93/33936c7b-e4b4-46a7-b814-09dc8293c573
lrwx------ 1 root root 64 Aug 28 12:18 3 -> anon_inode:[eventpoll]
l-wx------ 1 root root 64 Aug 28 12:18 4 -> /var/log/glusterfs/bricks/data-gluster-brick1-brick.log
lrwx------ 1 root root 64 Aug 28 12:18 5 -> /var/lib/glusterd/vols/xcube/run/10.70.42.211-data-gluster-brick1-brick.pid
lrwx------ 1 root root 64 Aug 28 12:18 6 -> socket:[120535]
lrwx------ 1 root root 64 Aug 28 12:18 7 -> socket:[120577]
lrwx------ 1 root root 64 Aug 28 12:18 8 -> socket:[120541]
l-wx------ 1 root root 64 Aug 28 12:18 9 -> /data/gluster/brick1/brick/.glusterfs/33/93/33936c7b-e4b4-46a7-b814-09dc8293c573
getfattr: Removing leading '/' from absolute path names
# file: data/gluster/brick1/brick/test_file1
trusted.afr.xcube-client-0=0x000000840000000000000000
trusted.afr.xcube-client-1=0x000000840000000000000000
trusted.gfid=0x33936c7be4b446a7b81409dc8293c573

root.42.211[Aug-28-2013-12:20:25] >ls -l /proc/`cat /var/lib/glusterd/vols/xcube/run/10.70.42.211-data-gluster-brick1-brick.pid`/fd ; getfattr -d -e hex -m . /data/gluster/brick1/brick/test_file1 ; ls -lh /data/gluster/brick1/brick/test_file1
total 0
lr-x------ 1 root root 64 Aug 28 12:18 0 -> /dev/null
l-wx------ 1 root root 64 Aug 28 12:18 1 -> /dev/null
lrwx------ 1 root root 64 Aug 28 12:18 10 -> socket:[120548]
lr-x------ 1 root root 64 Aug 28 12:18 11 -> /dev/urandom
lr-x------ 1 root root 64 Aug 28 12:18 12 -> /data/gluster/brick1/brick
lrwx------ 1 root root 64 Aug 28 12:18 14 -> socket:[120614]
lrwx------ 1 root root 64 Aug 28 12:18 15 -> socket:[120616]
lrwx------ 1 root root 64 Aug 28 12:18 16 -> socket:[120618]
l-wx------ 1 root root 64 Aug 28 12:18 17 -> /data/gluster/brick1/brick/.glusterfs/33/93/33936c7b-e4b4-46a7-b814-09dc8293c573
lrwx------ 1 root root 64 Aug 28 12:18 18 -> socket:[120624]
lrwx------ 1 root root 64 Aug 28 12:18 19 -> socket:[120630]
l-wx------ 1 root root 64 Aug 28 12:18 2 -> /dev/null
lrwx------ 1 root root 64 Aug 28 12:18 3 -> anon_inode:[eventpoll]
l-wx------ 1 root root 64 Aug 28 12:18 4 -> /var/log/glusterfs/bricks/data-gluster-brick1-brick.log
lrwx------ 1 root root 64 Aug 28 12:18 5 -> /var/lib/glusterd/vols/xcube/run/10.70.42.211-data-gluster-brick1-brick.pid
lrwx------ 1 root root 64 Aug 28 12:18 6 -> socket:[120535]
lrwx------ 1 root root 64 Aug 28 12:18 7 -> socket:[120577]
lrwx------ 1 root root 64 Aug 28 12:18 8 -> socket:[120541]
l-wx------ 1 root root 64 Aug 28 12:18 9 -> /data/gluster/brick1/brick/.glusterfs/33/93/33936c7b-e4b4-46a7-b814-09dc8293c573
getfattr: Removing leading '/' from absolute path names
# file: data/gluster/brick1/brick/test_file1
trusted.afr.xcube-client-0=0xffffffff0000000000000000
trusted.afr.xcube-client-1=0xffffffff0000000000000000
trusted.gfid=0x33936c7be4b446a7b81409dc8293c573

-rw-r--r-- 2 root root 1.9G Aug 28 12:20 /data/gluster/brick1/brick/test_file1


Expected results:
==================
Change-logs should be cleared. 

Additional info:
===================
root.42.211[Aug-28-2013-12:36:24] >gluster v info
 
Volume Name: xcube
Type: Replicate
Volume ID: df372f5e-8dc4-4b39-b975-0232a250079e
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.42.194:/data/gluster/brick1/brick
Brick2: 10.70.42.211:/data/gluster/brick1/brick
Options Reconfigured:
cluster.self-heal-daemon: off

Comment 3 Gowrishankar Rajaiyan 2013-08-29 13:13:36 UTC

I hit this in RHS-RHOS testing as well while healing a cinder volume. I have the setup at its current state, let me know if you need any info.


[root@localhost ~]# gluster vol heal cinder-volume info healed
Gathering list of healed entries on volume cinder-volume has been successful 

Brick 10.70.42.160:/rhs/bricks/cinder
Number of entries: 0

Brick 10.70.42.179:/rhs/bricks/cinder
Number of entries: 0

Brick 10.70.42.175:/rhs/bricks/cinder
Number of entries: 1
at                    path on brick
-----------------------------------
2013-08-29 12:22:32 /volume-2495f6ae-0472-4a31-8a81-48c733fd64f6

Brick 10.70.43.112:/rhs/bricks/cinder
Number of entries: 2
at                    path on brick
-----------------------------------
2013-08-29 11:42:50 /volume-73af0e71-ceeb-4d72-8120-f86b31aed26c
2013-08-29 12:13:34 /volume-2495f6ae-0472-4a31-8a81-48c733fd64f6

Brick 10.70.43.148:/rhs/bricks/cinder
Number of entries: 2
at                    path on brick
-----------------------------------
2013-08-29 11:30:05 /volume-30464de5-cbe7-46ca-a074-e1822aa1dd5f
2013-08-29 11:35:07 /volume-d073aab7-820d-4797-88d2-e73eccca5a80

Brick 10.70.42.165:/rhs/bricks/cinder
Number of entries: 0
[root@localhost ~]# 


On machine:
===========

10.70.42.175
=============

[root@localhost ~]# getfattr -d -e hex -m . /rhs/bricks/cinder/volume-2495f6ae-0472-4a31-8a81-48c733fd64f6
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder/volume-2495f6ae-0472-4a31-8a81-48c733fd64f6
trusted.afr.cinder-volume-client-2=0xffffffff0000000000000000
trusted.afr.cinder-volume-client-3=0xffffffff0000000000000000
trusted.gfid=0x5e7dbe0b5a164638b61809225b62c85d

[root@localhost ~]# ls -l /rhs/bricks/cinder/volume-2495f6ae-0472-4a31-8a81-48c733fd64f6
-rw-rw-rw- 2 qemu qemu 10737418240 Aug 29 18:08 /rhs/bricks/cinder/volume-2495f6ae-0472-4a31-8a81-48c733fd64f6
[root@localhost ~]# 



10.70.43.112
=============

[root@localhost ~]# getfattr -d -e hex -m . /rhs/bricks/cinder/volume-2495f6ae-0472-4a31-8a81-48c733fd64f6
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder/volume-2495f6ae-0472-4a31-8a81-48c733fd64f6
trusted.afr.cinder-volume-client-2=0xffffffff0000000000000000
trusted.afr.cinder-volume-client-3=0xffffffff0000000000000000
trusted.gfid=0x5e7dbe0b5a164638b61809225b62c85d

[root@localhost ~]# ls -l /rhs/bricks/cinder/volume-2495f6ae-0472-4a31-8a81-48c733fd64f6
-rw-rw-rw- 2 qemu qemu 10737418240 Aug 29 18:10 /rhs/bricks/cinder/volume-2495f6ae-0472-4a31-8a81-48c733fd64f6
[root@localhost ~]# 



[root@localhost ~]# gluster volume info
 
Volume Name: cinder-volume
Type: Distributed-Replicate
Volume ID: da6458cb-9034-45ee-a721-4f906cbca2db
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.42.160:/rhs/bricks/cinder
Brick2: 10.70.42.179:/rhs/bricks/cinder
Brick3: 10.70.42.175:/rhs/bricks/cinder
Brick4: 10.70.43.112:/rhs/bricks/cinder
Brick5: 10.70.43.148:/rhs/bricks/cinder
Brick6: 10.70.42.165:/rhs/bricks/cinder
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 165
storage.owner-gid: 165
diagnostics.client-log-level: DEBUG
diagnostics.brick-log-level: DEBUG
[root@localhost ~]#

Comment 4 spandura 2013-08-29 14:40:53 UTC

SOSReports: http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/1002069/

Comment 5 Pranith Kumar K 2013-08-29 18:04:24 UTC

The bug is appearing because resetting of two of the variables is not happening. Because of the stale variable values it thinks that it already has the locks and changes changelog values in a wrong manner leading to -ve changelog.

Detailed reasoning:

Problem:
internal_lock->lk_attempted_count keeps track of the number of blocking
locks attempted. lk_expected_count keeps track of the number locks expected.
Here are the sequence of steps that happen which lead to the illution that
a full file lock is achieved, even without attempting any lock.

2 mounts are doing dd on same file. Both of them witness a brick going
down and coming back up again. Both of the mounts issue self-heal
1) Both mount-1, mount-2 attempt full file locks in self-heal domain.
lets say mount-1 got the lock, mount-2 attempts blocking lock.

2) mount-1 attempts full file lock in data domain. It goes into blocking
mode because some other writes are in progress. Eventually it gets the lock.
But this results in lk_attempted_count to be still as 2 and will not be reset.
It completes syncing the data.

3) mount-1 before unlocking final small range lock attempts full file lock in
data domain to figure out the source/sink. This will be put into blocked mode
again because some other writes are in progress. But this time seeing the
stale value of lk_attempted_count being equal to lk_expected_count, blocking_lock
phase thinks it completed locking without acquiring a single lock :-O.

4) mount-1 reads xattrs without any lock but since it does not modify the xattrs,
no harm is done by this phase. It tries to do unlocks and the unlocks will fail
because the locks are never taken in data domain. mount-1 also unlocks
self-heal domain locks.

Our beloved mount-2 now gets the chance to cause horror :-(.

5) mount-2 gets the full range blocking lock in self-heal domain.
Please note that this sets lk_attempted_count to 2.

6) mount-2 attempts full range lock in data domain, since there are still
writes on going, it switches to blocking mode. But since lk_attempted_count is 2
which is same as lk_expected_count, blocking phase locks thinks it actually got
the full range locks even though not a single lock request went out the wire.

7) mount-2 reads the change-log xattrs, which would give the number of operations
in progress (lets call this 'X'). It does the syncing and at the end of the sync
decrements the changelog by 'X'. But since that 'X' was introduced by 'X' number
of transactions that are in progress, they also decrement the changelog by 'X'.
Effectively for 'X' operations number of pre-ops are done but 2 times 'X' number
of post-ops are done resulting in -ve changelog numbers.

Fix:
Reset the lk_attempted_count and inode locks array that is used to remember locks
that are granted.

Comment 6 Amar Tumballi 2013-08-29 19:57:29 UTC

https://code.engineering.redhat.com/gerrit/#/c/12212/

Comment 7 spandura 2013-09-02 06:50:43 UTC

Verified the fix on the build :
================================
glusterfs 3.4.0.30rhs built on Aug 30 2013 08:15:37

Bug is fixed. Moving it to verified state.

Comment 8 Scott Haines 2013-09-23 22:36:06 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html