Bug 999569 - [RHS-RHOS] VM damaged beyond repair if the cinder volume attached to it is rebalanced post add-brick. [NEEDINFO]
[RHS-RHOS] VM damaged beyond repair if the cinder volume attached to it is re...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
2.1
Unspecified Unspecified
high Severity high
: ---
: RHGS 2.1.2
Assigned To: crisbud@redhat.com
shilpa
: ZStream
Depends On:
Blocks: 1013456
  Show dependency treegraph
 
Reported: 2013-08-21 11:07 EDT by Gowrishankar Rajaiyan
Modified: 2015-05-15 14:17 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously when you performed an add-brick operation followed by a Rebalanace operation, the Cinder volumes created on top of the Red Hat Storage volumes were rendered unusable. This issue was also observed with VMs managed with the Red Hat Enterprise Virtualization Manager. With this fix, the cinder volumes work as expected in such a scenario.
Story Points: ---
Clone Of:
: 1013456 (view as bug list)
Environment:
virt rhos cinder rhs integration
Last Closed: 2014-02-25 02:35:08 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
psriniva: needinfo? (pkarampu)


Attachments (Terms of Use)
cinder-vol-rebalance.log (186.12 KB, text/plain)
2013-08-21 11:07 EDT, Gowrishankar Rajaiyan
no flags Details
cinder-vol-rebalance logs from all the nodes (5.37 MB, application/x-zip)
2013-08-22 00:54 EDT, Gowrishankar Rajaiyan
no flags Details
/var/log/glusterfs/bricks/rhs-bricks-cinder1.log logs (5.60 MB, application/x-zip)
2013-08-22 01:19 EDT, Gowrishankar Rajaiyan
no flags Details
glustershd logs from all nodes (1.59 MB, application/x-zip)
2013-08-22 01:26 EDT, Gowrishankar Rajaiyan
no flags Details
Client log (3.50 MB, text/plain)
2013-08-22 02:49 EDT, Gowrishankar Rajaiyan
no flags Details
var-lib-cinder-volumes-01cbb41254aca5f79b543409c40afa79.log (321.88 KB, application/x-zip)
2013-08-27 06:53 EDT, Gowrishankar Rajaiyan
no flags Details
Screenshot of the instance (94.17 KB, image/png)
2013-08-27 06:59 EDT, Gowrishankar Rajaiyan
no flags Details
sosreport-1-to-9-rhs-nodes.tar.bz (14.30 MB, application/x-zip)
2013-08-28 11:15 EDT, Gowrishankar Rajaiyan
no flags Details
sosreport-10-to-18-rhs-nodes.tar.bz (16.48 MB, application/x-zip)
2013-08-28 11:17 EDT, Gowrishankar Rajaiyan
no flags Details

  None (edit)
Description Gowrishankar Rajaiyan 2013-08-21 11:07:10 EDT
Created attachment 788916 [details]
cinder-vol-rebalance.log

Description of problem: Looks similar to bug 993119. Openstack cinder image files are not migrated after an add-brick operation. DHT hex is not reset. If we happen to use this volume from within an instance, the vm is damaged beyond repair and the only option that works after this stage is "Terminate Instance".


Version-Release number of selected component (if applicable):
glusterfs-3.4.0.21rhs-1.el6_4.x86_64


How reproducible: Tested once.


Steps to Reproduce: (detail steps can be provided upon request)
1. Create a 6X2 distribute-replicate volume.
2. Setup openstack cinder to automatically mount the volume.
3. Create glance images and cinder volumes
4. Launch instances and attach cinder volumes to them.
5. Perform add-brick and then rebalance.
6. Ensure that rebalance status is "completed" for all the nodes.

Actual results: Empty linkto files still exist on the bricks.


Expected results: Files should be migrated successfully.


Additional info:

[root@rhsqa-virt06 ~]# ssh 10.70.43.66 ls -l /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e
---------T 2 qemu qemu 0 Aug 21 18:24 /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e

[root@rhsqa-virt06 ~]# ssh 10.70.43.47 ls -l /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e
-rw-rw-rw- 2 qemu qemu 10737418240 Aug 21 11:03 /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e

[root@rhsqa-virt06 ~]# ssh 10.70.43.53 ls -l /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e
-rw-rw-rw- 2 qemu qemu 10737418240 Aug 21 11:03 /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e

[root@rhsqa-virt06 ~]# ssh 10.70.43.67 ls -l /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e
---------T 2 qemu qemu 0 Aug 21 18:24 /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e
[root@rhsqa-virt06 ~]# 



[root@localhost ~]# gluster vol info cinder-vol
Volume Name: cinder-vol
Type: Distributed-Replicate
Volume ID: 4d0b2e57-2eb3-44c0-9f9c-5e2a1835b81a
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: 10.70.43.66:/rhs/bricks/cinder1
Brick2: 10.70.43.67:/rhs/bricks/cinder1
Brick3: 10.70.43.68:/rhs/bricks/cinder1
Brick4: 10.70.43.69:/rhs/bricks/cinder1
Brick5: 10.70.43.70:/rhs/bricks/cinder1
Brick6: 10.70.43.61:/rhs/bricks/cinder1
Brick7: 10.70.43.56:/rhs/bricks/cinder1
Brick8: 10.70.43.57:/rhs/bricks/cinder1
Brick9: 10.70.43.53:/rhs/bricks/cinder1
Brick10: 10.70.43.47:/rhs/bricks/cinder1
Brick11: 10.70.43.44:/rhs/bricks/cinder1
Brick12: 10.70.43.42:/rhs/bricks/cinder1
Brick13: 10.70.43.40:/rhs/bricks/cinder1
Brick14: 10.70.43.35:/rhs/bricks/cinder1
Brick15: 10.70.43.36:/rhs/bricks/cinder1
Brick16: 10.70.43.37:/rhs/bricks/cinder1
Options Reconfigured:
diagnostics.client-log-level: DEBUG
diagnostics.brick-log-level: DEBUG
storage.owner-gid: 165
storage.owner-uid: 165
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
[root@localhost ~]# 


[root@localhost ~]# gluster volume rebalance cinder-vol status
                                    Node Rebalanced-files          size       scanned      failures       skipped         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes            25             0             0      completed             2.00
                             10.70.43.66                5        41.0GB            27             0             0      completed           431.00
                             10.70.43.67                0        0Bytes            25             0             0      completed             3.00
                             10.70.43.47                0        0Bytes            25             0             0      completed             2.00
                             10.70.43.69                0        0Bytes            25             0             0      completed             2.00
                             10.70.43.70                3        30.0GB            35             0             2      completed          1813.00
                             10.70.43.72                0        0Bytes            25             0             0      completed             2.00
                             10.70.43.57                0        0Bytes            25             0             0      completed             2.00
                             10.70.43.61                0        0Bytes            25             0             0      completed             3.00
                             10.70.43.56                6        46.0GB            39             0             0      completed          2758.00
                             10.70.43.68                5        50.0GB            32             0             0      completed           512.00
                             10.70.43.68                5        50.0GB            32             0             0      completed           512.00
                             10.70.43.53                1        10.0GB            30             0             0      completed           652.00
                             10.70.43.40                0        0Bytes            25             0             0      completed             2.00
                             10.70.43.35                0        0Bytes            25             0             0      completed             2.00
                             10.70.43.36                0        0Bytes            25             0             0      completed             2.00
                             10.70.43.37                0        0Bytes            25             0             0      completed             3.00
volume rebalance: cinder-vol: success: 
[root@localhost ~]# 


[root@rhsqa-virt06 ~]# for i in 10.70.43.66 10.70.43.47 10.70.43.53 10.70.43.67; do ssh $i "(echo $i:; getfattr -d -m . /rhs/bricks/cinder1; echo)"; done
10.70.43.66:
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAAH////g==
trusted.glusterfs.volume-id=0sTQsuVy6zRMCfnF4qGDW4Gg==


10.70.43.47:
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAAB////8n///+g==
trusted.glusterfs.volume-id=0sTQsuVy6zRMCfnF4qGDW4Gg==


10.70.43.53:
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAAB////8n///+g==
trusted.glusterfs.volume-id=0sTQsuVy6zRMCfnF4qGDW4Gg==


10.70.43.67:
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ==
trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAAH////g==
trusted.glusterfs.volume-id=0sTQsuVy6zRMCfnF4qGDW4Gg==


[root@rhsqa-virt06 ~]#
Comment 2 Gowrishankar Rajaiyan 2013-08-22 00:17:28 EDT
[root@rhsqa-virt06 ~]# for i in 10.70.43.66 10.70.43.47 10.70.43.53 10.70.43.67; do ssh $i "(echo $i:; getfattr -d -m . /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e; echo)"; done
10.70.43.66:
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e
trusted.gfid=0sebxoy0WtQayHIsGc4lpcew==
trusted.glusterfs.dht.linkto="cinder-vol-replicate-4"


10.70.43.47:
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e
trusted.afr.cinder-vol-client-8=0sAAAAAAAAAAAAAAAA
trusted.afr.cinder-vol-client-9=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sebxoy0WtQayHIsGc4lpcew==


10.70.43.53:
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e
trusted.afr.cinder-vol-client-8=0sAAAAAAAAAAAAAAAA
trusted.afr.cinder-vol-client-9=0sAAAAAAAAAAAAAAAA
trusted.gfid=0sebxoy0WtQayHIsGc4lpcew==


10.70.43.67:
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e
trusted.gfid=0sebxoy0WtQayHIsGc4lpcew==
trusted.glusterfs.dht.linkto="cinder-vol-replicate-4"


[root@rhsqa-virt06 ~]#
Comment 3 Gowrishankar Rajaiyan 2013-08-22 00:54:43 EDT
Created attachment 789038 [details]
cinder-vol-rebalance logs from all the nodes

$ for i in 10.70.43.66 10.70.43.67 10.70.43.68 10.70.43.69 10.70.43.70 10.70.43.61 10.70.43.56 10.70.43.57 10.70.43.53 10.70.43.47 10.70.43.44 10.70.43.42 10.70.43.40 10.70.43.35 10.70.43.36 10.70.43.37; do scp root@$i:/var/log/glusterfs/cinder-vol-rebalance.log /tmp/cinder-vol-rebalance.log.$i; done

$ tar -cjvf cinder-vol-rebalance.log.tar.bz2 /tmp/cinder-vol-rebalance.log*
Comment 4 Gowrishankar Rajaiyan 2013-08-22 01:07:23 EDT
[root@rhsqa-virt06 ~]# for i in 10.70.43.66 10.70.43.47 10.70.43.53 10.70.43.57 10.70.43.67 10.70.43.56 10.70.43.61 10.70.43.44 10.70.43.68 10.70.43.69 10.70.43.72 10.70.43.70 10.70.43.42 10.70.43.76 10.70.43.40 10.70.43.35 10.70.43.36 10.70.43.37; do ssh $i "(echo $i; getfattr -m - -d -e hex /rhs/bricks/cinder1; echo)"; done
10.70.43.66
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000001ffffffe
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.47
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000007ffffffc9ffffffa
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.53
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000007ffffffc9ffffffa
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.57
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000005ffffffd7ffffffb
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.67
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000001ffffffe
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.56
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000005ffffffd7ffffffb
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.61
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000003ffffffe5ffffffc
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.44
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.afr.cinder-vol-client-10=0x000000000000000000000000
trusted.afr.cinder-vol-client-11=0x000000000000000000000002
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000009ffffffbbffffff9
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.68
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000001fffffff3ffffffd
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.69
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000001fffffff3ffffffd
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a

getfattr: Removing leading '/' from absolute path names

10.70.43.72
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000002492492449249247
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.70
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000003ffffffe5ffffffc
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


ssh: connect to host 10.70.43.42 port 22: No route to host
10.70.43.76
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000002492492449249247
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.40
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000bffffffadffffff8
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.35
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000bffffffadffffff8
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.36
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000dffffff9ffffffff
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


10.70.43.37
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000dffffff9ffffffff
trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a


[root@rhsqa-virt06 ~]#
Comment 5 Gowrishankar Rajaiyan 2013-08-22 01:19:20 EDT
Created attachment 789039 [details]
/var/log/glusterfs/bricks/rhs-bricks-cinder1.log logs

#  for i in 10.70.43.66 10.70.43.67 10.70.43.68 10.70.43.69 10.70.43.70 10.70.43.61 10.70.43.56 10.70.43.57 10.70.43.53 10.70.43.47 10.70.43.44 10.70.43.42 10.70.43.40 10.70.43.35 10.70.43.36 10.70.43.37; do scp root@$i:/var/log/glusterfs/bricks/rhs-bricks-cinder1.log /tmp/rhs-bricks-cinder1.log.$i; done


# tar -jcvf rhs-bricks-cinder1.logs.tar.bz rhs-bricks-cinder1.log*
Comment 6 Gowrishankar Rajaiyan 2013-08-22 01:26:50 EDT
Created attachment 789040 [details]
glustershd logs from all nodes

#  for i in 10.70.43.66 10.70.43.67 10.70.43.68 10.70.43.69 10.70.43.70 10.70.43.61 10.70.43.56 10.70.43.57 10.70.43.53 10.70.43.47 10.70.43.44 10.70.43.42 10.70.43.40 10.70.43.35 10.70.43.36 10.70.43.37; do scp root@$i:/var/log/glusterfs/glustershd.log /tmp/glustershd.log.$i; done

# tar -jcvf glustershd.logs.tar.bz glustershd.log*


Note: that 10.70.43.42 node was down under mysterious circumstances during this rebalance operation. So you should see alot of the following messages. Don't think it is relevant since its replica pair 10.70.43.44 was alive and healthy but still mentioning it here to help in diagnosis.

<snip>
[2013-08-21 15:01:24.595182] D [name.c:155:client_fill_address_family] 0-cinder-vol-client-11: address-family not specified, guessing it to be inet from (remote-host: 10.70.43.42)
[2013-08-21 15:01:24.601564] D [common-utils.c:248:gf_resolve_ip6] 0-resolver: returning ip-10.70.43.42 (port-24007) for hostname: 10.70.43.42 and port: 24007
[2013-08-21 15:01:24.601850] D [name.c:155:client_fill_address_family] 0-glance-vol-client-11: address-family not specified, guessing it to be inet from (remote-host: 10.70.43.42)
[2013-08-21 15:01:24.607525] D [common-utils.c:248:gf_resolve_ip6] 0-resolver: returning ip-10.70.43.42 (port-24007) for hostname: 10.70.43.42 and port: 24007
[2013-08-21 15:01:24.699835] D [socket.c:615:__socket_shutdown] 0-cinder-vol-client-11: shutdown() returned -1. Transport endpoint is not connected
[2013-08-21 15:01:24.699902] D [socket.c:492:__socket_rwv] 0-cinder-vol-client-11: EOF on socket
[2013-08-21 15:01:24.699923] D [socket.c:2237:socket_event_handler] 0-transport: disconnecting now
</snip>
Comment 7 Gowrishankar Rajaiyan 2013-08-22 02:49:23 EDT
Created attachment 789051 [details]
Client log

Client:
[root@rhs-hpc-srv1 glusterfs(keystone_admin)]# df -hT
Filesystem    Type    Size  Used Avail Use% Mounted on
/dev/sdb1     ext4    1.1T  1.5G  1.1T   1% /
tmpfs        tmpfs     24G     0   24G   0% /dev/shm
/dev/sdf1     ext4     30G  211M   28G   1% /boot
/dev/sdc1     ext4    1.1T   28G 1013G   3% /var
/srv/loopback-device/device1
              ext4    960M   34M  876M   4% /srv/node/device1
10.70.43.44:glance-vol
    fuse.glusterfs    300G   29G  272G  10% /var/lib/glance/images
10.70.43.44:cinder-vol
    fuse.glusterfs    400G   29G  372G   8% /var/lib/cinder/volumes/1d12e17a168a458a2db39ca37ee302fd
10.70.43.44:cinder-vol
    fuse.glusterfs    400G   29G  372G   8% /var/lib/nova/mnt/1d12e17a168a458a2db39ca37ee302fd
[root@rhs-hpc-srv1 glusterfs(keystone_admin)]# 


Log file attached is /var/log/glusterfs/var-lib-cinder-volumes-1d12e17a168a458a2db39ca37ee302fd.log
Comment 8 shishir gowda 2013-08-22 03:54:07 EDT
Other than the 2 (see below)errors, there are no errors reported.
Linkfile still exist, as migration of few files have been skipped due to space constraints. Those should not lead to any failures as afr is involved too.

[2013-08-21 01:26:54.073212] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x165) [0x7f056f71cf55] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f056f71ca93] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f056f71c9ae]))) 1-cinder-vol-client-11: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2013-08-21 01:26:11.904694 (xid=0x443x)
[2013-08-21 01:26:54.083631] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x165) [0x7f056f71cf55] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f056f71ca93] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f056f71c9ae]))) 1-cinder-vol-client-11: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2013-08-21 01:26:11.904704 (xid=0x444x)
Comment 10 Gowrishankar Rajaiyan 2013-08-26 01:56:08 EDT
I shall re-test this issue. But I am not clear in which log level you asking me to set. DEBUG or TRACE. I believe TRACE (since I am trying to reproduce it), please correct me if otherwise.
Comment 11 shishir gowda 2013-08-26 06:15:09 EDT
Please re-run the tests with log level set to DEBUG. With TRACE, there is a possibility of it being masked due to excessive logging(If the issue is hit again, and DEBUG does not help, we may need to re-run it with TRACE.)
Comment 12 Gowrishankar Rajaiyan 2013-08-27 06:35:49 EDT
I could reproduce this. Add brick followed by rebalance + mkfs.ext4 (in this case) on the /attached/device causes this. Attaching logs and other info shortly.
Comment 13 Gowrishankar Rajaiyan 2013-08-27 06:39:38 EDT
Fixing summary since per dev:

----T files still will be created if moving a files leads to volume imbalance wrt space consumption, even if stale -----T are left in the back-end, dht can and will clean them up later on during readdirp/access operations (which hit dht xlator)

----T are not an issue here.. if they are missing or stale, dht can clean them up as needed
Comment 14 Gowrishankar Rajaiyan 2013-08-27 06:42:55 EDT
[root@rhsqa-virt06 ~]# for i in 10.70.43.178 10.70.43.73 10.70.42.250 10.70.42.185 10.70.42.172 10.70.42.178 10.70.42.255 10.70.42.157 10.70.42.164 10.70.43.95 10.70.42.169 10.70.42.159 10.70.42.156 10.70.43.92 10.70.42.163 10.70.42.180 10.70.42.154 10.70.42.161 10.70.42.173 10.70.42.170; do ssh $i "(echo $i; getfattr -m - -d -e hex /rhs/bricks/cinder-volume; echo)"; done
10.70.43.178
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000e38e38e0ffffffff
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.43.73
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000e38e38e0ffffffff
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.42.250
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000005555555471c71c6f
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.42.185
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000005555555471c71c6f
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.42.172
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000071c71c708e38e38b
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.42.178
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000071c71c708e38e38b
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76


getfattr: Removing leading '/' from absolute path names
10.70.42.255
getfattr: /rhs/bricks/cinder-volume: No such file or directory

10.70.42.157
getfattr: /rhs/bricks/cinder-volume: No such file or directory

10.70.42.164
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000038e38e3855555553
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.43.95
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000038e38e3855555553
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.42.169
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000001c71c71c38e38e37
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.42.159
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000001c71c71c38e38e37
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.42.156
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000001c71c71b
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.43.92
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000001c71c71b
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76

getfattr: Removing leading '/' from absolute path names

10.70.42.163
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000008e38e38caaaaaaa7
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76


10.70.42.180
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000008e38e38caaaaaaa7
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76


10.70.42.154
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000aaaaaaa8c71c71c3
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76


10.70.42.161
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000aaaaaaa8c71c71c3
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76


10.70.42.173
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000c71c71c4e38e38df
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76


10.70.42.170
getfattr: Removing leading '/' from absolute path names
# file: rhs/bricks/cinder-volume
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000c71c71c4e38e38df
trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76


[root@rhsqa-virt06 ~]#
Comment 15 Gowrishankar Rajaiyan 2013-08-27 06:44:08 EDT
[root@localhost ~]# gluster vol info cinder-volume
 
Volume Name: cinder-volume
Type: Distributed-Replicate
Volume ID: ced7fae7-c90c-4df6-953f-7148ee2ada76
Status: Started
Number of Bricks: 9 x 2 = 18
Transport-type: tcp
Bricks:
Brick1: 10.70.43.178:/rhs/bricks/cinder-volume
Brick2: 10.70.43.73:/rhs/bricks/cinder-volume
Brick3: 10.70.42.250:/rhs/bricks/cinder-volume
Brick4: 10.70.42.185:/rhs/bricks/cinder-volume
Brick5: 10.70.42.172:/rhs/bricks/cinder-volume
Brick6: 10.70.42.178:/rhs/bricks/cinder-volume
Brick7: 10.70.42.164:/rhs/bricks/cinder-volume
Brick8: 10.70.43.95:/rhs/bricks/cinder-volume
Brick9: 10.70.42.156:/rhs/bricks/cinder-volume
Brick10: 10.70.43.92:/rhs/bricks/cinder-volume
Brick11: 10.70.42.169:/rhs/bricks/cinder-volume
Brick12: 10.70.42.159:/rhs/bricks/cinder-volume
Brick13: 10.70.42.163:/rhs/bricks/cinder-volume
Brick14: 10.70.42.180:/rhs/bricks/cinder-volume
Brick15: 10.70.42.154:/rhs/bricks/cinder-volume
Brick16: 10.70.42.161:/rhs/bricks/cinder-volume
Brick17: 10.70.42.173:/rhs/bricks/cinder-volume
Brick18: 10.70.42.170:/rhs/bricks/cinder-volume
Options Reconfigured:
diagnostics.brick-log-level: DEBUG
diagnostics.client-log-level: DEBUG
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
storage.owner-uid: 165
storage.owner-gid: 165
[root@localhost ~]# 

[root@localhost ~]# gluster volume rebalance cinder-volume status
                                    Node Rebalanced-files          size       scanned      failures       skipped         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                4         4.0GB            78             0             3      completed            47.00
                             10.70.43.73                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.250               10        10.0GB            88             0             0      completed           101.00
                            10.70.42.185                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.172                8         8.0GB            93             0             0      completed            83.00
                            10.70.42.178                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.255                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.157                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.164                4         4.0GB            88             0             0      completed           243.00
                             10.70.43.95                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.169                8         8.0GB            94             0             0      completed            84.00
                            10.70.42.159                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.156                0        0Bytes            74             0             2      completed             2.00
                             10.70.43.92                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.163                0        0Bytes            74             0             8      completed             2.00
                            10.70.42.180                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.154                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.161                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.173                0        0Bytes            74             0             0      completed             2.00
                            10.70.42.170                0        0Bytes            74             0             0      completed             2.00
volume rebalance: cinder-volume: success: 
[root@localhost ~]#
Comment 16 Gowrishankar Rajaiyan 2013-08-27 06:53:44 EDT
Created attachment 790895 [details]
var-lib-cinder-volumes-01cbb41254aca5f79b543409c40afa79.log

Client info and logs attached.

[root@rhs-hpc-srv1 glusterfs(keystone_admin)]# df -hT
Filesystem    Type    Size  Used Avail Use% Mounted on
/dev/sda1     ext4     30G   10G   18G  36% /
tmpfs        tmpfs     24G     0   24G   0% /dev/shm
/dev/sdf      ext4    1.1T  318M  1.1T   1% /var/log
10.70.43.178:/glance-volume
    fuse.glusterfs    300G   22G  279G   8% /var/lib/glance
/srv/loopback-device/device1
              ext4    960M   34M  876M   4% /srv/node/device1
10.70.43.178:/cinder-volume
    fuse.glusterfs    450G   22G  429G   5% /var/lib/cinder/volumes/01cbb41254aca5f79b543409c40afa79
10.70.43.178:/cinder-volume
    fuse.glusterfs    450G   22G  429G   5% /var/lib/nova/mnt/01cbb41254aca5f79b543409c40afa79
[root@rhs-hpc-srv1 glusterfs(keystone_admin)]#
Comment 17 Gowrishankar Rajaiyan 2013-08-27 06:59:49 EDT
Created attachment 790896 [details]
Screenshot of the instance
Comment 18 shishir gowda 2013-08-27 07:24:20 EDT
On the latest graph of the client, see these network disconnection log messages:

[2013-08-27 04:23:18.693638] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-7: no subvolumes up
[2013-08-27 04:23:18.693694] D [dht-common.c:2885:dht_fd_cbk] 2-cinder-volume-dht: subvolume cinder-volume-replicate-7 returned -1 (Transport endpoint is not connected)
[2013-08-27 04:23:18.693710] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-8: no subvolumes up
[2013-08-27 04:23:18.693721] D [dht-common.c:2885:dht_fd_cbk] 2-cinder-volume-dht: subvolume cinder-volume-replicate-8 returned -1 (Transport endpoint is not connected)
....
[2013-08-27 04:23:18.698527] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-7: no subvolumes up
[2013-08-27 04:23:18.698565] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-8: no subvolumes up
[2013-08-27 04:23:21.693287] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-7: no subvolumes up
[2013-08-27 04:23:21.693342] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-8: no subvolumes up
[2013-08-27 04:23:23.573663] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-7: no subvolumes up
[2013-08-27 04:23:23.573720] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-8: no subvolumes up
Comment 19 shishir gowda 2013-08-28 01:59:19 EDT
More disconnection logs:

[2013-08-27 04:23:18.442073] I [dht-layout.c:633:dht_layout_normalize] 2-cinder-volume-dht: found anomalies in /. holes=1 overlaps=1 missing=0 down=2 misc=0
[2013-08-27 04:23:18.443150] D [dht-layout.c:649:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.4.0.22rhs/xlator/protocol/client.so(client3_3_lookup_cbk+0x626) [0x7f5c2aae71f6] (-->/usr/lib64/glusterfs/3.4.0.2
2rhs/xlator/cluster/replicate.so(afr_lookup_cbk+0x520) [0x7f5c2a8abe80] (-->/usr/lib64/glusterfs/3.4.0.22rhs/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x484) [0x7f5c2a633b64]))) 2-cinder-volume-dht: path=/
 err=Transport endpoint is not connected on subvol=cinder-volume-replicate-8
[2013-08-27 04:23:18.443181] D [dht-layout.c:649:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.4.0.22rhs/xlator/protocol/client.so(client3_3_lookup_cbk+0x626) [0x7f5c2aae71f6] (-->/usr/lib64/glusterfs/3.4.0.2
2rhs/xlator/cluster/replicate.so(afr_lookup_cbk+0x520) [0x7f5c2a8abe80] (-->/usr/lib64/glusterfs/3.4.0.22rhs/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x484) [0x7f5c2a633b64]))) 2-cinder-volume-dht: path=/
 err=Transport endpoint is not connected on subvol=cinder-volume-replicate-7
[2013-08-27 04:23:18.443187] D [dht-common.c:499:dht_lookup_dir_cbk] 2-cinder-volume-dht: fixing assignment on /
[2013-08-27 04:23:18.443204] W [dht-selfheal.c:909:dht_selfheal_directory] 2-cinder-volume-dht: 2 subvolumes down -- not fixing
Comment 20 shishir gowda 2013-08-28 02:12:35 EDT
The subvolume seems to have come back up few seconds later:

[2013-08-27 04:23:33.237097] I [afr-common.c:3795:afr_notify] 2-cinder-volume-replicate-8: Subvolume 'cinder-volume-client-17' came back up; going online.
[2013-08-27 04:23:37.315468] I [afr-common.c:3795:afr_notify] 2-cinder-volume-replicate-7: Subvolume 'cinder-volume-client-14' came back up; going online.
Comment 21 Amar Tumballi 2013-08-28 05:46:32 EDT
Looks like the graph switch happened with 'CHILD_CONNECTING' event, and the CHILD_UP came little late. Due to this, we are done with graph switch, but the new graph is not totally active.
Comment 22 Gowrishankar Rajaiyan 2013-08-28 11:15:00 EDT
Created attachment 791431 [details]
sosreport-1-to-9-rhs-nodes.tar.bz

# tar -jcvf sosreport-1-to-9-rhs-nodes.tar.bz sosreport-10.70.42.154-20130828203046-eac2.tar.xz sosreport-10.70.42.156-20130828202709-4c1b.tar.xz sosreport-10.70.42.159-20130828202853-9a92.tar.xz sosreport-10.70.42.161-20130828203137-d166.tar.xz sosreport-10.70.42.163-20130828202932-d450.tar.xz sosreport-10.70.42.164-20130828200908-6e11.tar.xz sosreport-10.70.42.169-20130828202816-adac.tar.xz sosreport-10.70.42.170-20130828203245-f0b1.tar.xz sosreport-10.70.42.172-20130828200505-c1d1.tar.xz
Comment 23 Gowrishankar Rajaiyan 2013-08-28 11:17:15 EDT
Created attachment 791432 [details]
sosreport-10-to-18-rhs-nodes.tar.bz

# tar -jcvf sosreport-10-to-18-rhs-nodes.tar.bz sosreport-10.70.42.173-20130828203211-12a7.tar.xz sosreport-10.70.42.178-20130828200552-8ff6.tar.xz sosreport-10.70.42.180-20130828203008-670e.tar.xz sosreport-10.70.42.185-20130828200344-0b69.tar.xz sosreport-10.70.42.250-20130828200235-ec2a.tar.xz sosreport-10.70.43.178-20130828200124-b8a2.tar.xz sosreport-10.70.43.73-20130828200201-2c23.tar.xz sosreport-10.70.43.92-20130828202742-cbb3.tar.xz sosreport-10.70.43.95-20130828201036-de7b.tar.xz
Comment 24 Amar Tumballi 2013-08-30 02:08:38 EDT
Looks like we now know the issue with rebalance and self-heal.

There is a patch in the works, but as proposed by the development, the fix is not 'non-intrusive'. Hence would like little more soaking time.

Proposed it in Big Bend Readout meeting, and got approval from people to push it to Update1. Patch will be posted and reviewed, tested in upstream in meantime.
Comment 25 Gowrishankar Rajaiyan 2013-08-30 03:01:02 EDT
Are we documenting this as known issue for Big Bend ?
Comment 26 Sudhir D 2013-09-05 02:08:35 EDT
<snip>
 Internal Whiteboard: Big Bend → Big Bend U1
 Flags: rhs-2.1.0+ → rhs-2.1.z?
</snip>

Removing the blocker flag as it seem to be targetted for Update1 now.
Comment 27 Divya 2013-09-10 08:05:33 EDT
Shishir,

This bug has been identified as a known issue for Big Bend release. Please provide CCFR information in the Doc Text field.
Comment 28 Divya 2013-09-13 07:46:56 EDT
Shishir, I have edited the Doc Text field. Kindly review and sign off.
Comment 29 Amar Tumballi 2013-09-16 08:30:00 EDT
Possible fix for this issue is posted upstream @ http://review.gluster.org/#/c/5891/
Comment 31 Amar Tumballi 2013-11-20 04:11:51 EST
upstream patch accepted, now posted to downstream @

https://code.engineering.redhat.com/gerrit/#/c/16038/
https://code.engineering.redhat.com/gerrit/#/c/16039/
Comment 33 Gowrishankar Rajaiyan 2013-12-13 13:46:37 EST
Per integration bug triage 12/13, request blocker? flag for Corbett if this failedQA.
Comment 34 shilpa 2013-12-26 02:21:41 EST
Tested and verified on glusterfs-3.4.0.44rhs-1.el6rhs.x86_64. Files migrate successfully without leaving T-files behind.
Comment 35 Pavithra 2014-01-13 05:59:06 EST
Pranith, 
Can you please review the doc text for technical accuracy?
Comment 37 errata-xmlrpc 2014-02-25 02:35:08 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.