Bug 881420

Summary: [RHEV-RHS] All storage domains go "Inactive" while you add brick and then remove brick from the volume using RHEVM.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Gowrishankar Rajaiyan <grajaiya>
Component: glusterfsAssignee: shishir gowda <sgowda>
Status: CLOSED NOTABUG QA Contact: Gowrishankar Rajaiyan <grajaiya>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 2.0CC: amarts, ashetty, grajaiya, nsathyan, pprakash, rhs-bugs, shaines, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-31 04:41:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screen shot of the domains in inactive state. none

Description Gowrishankar Rajaiyan 2012-11-28 21:17:27 UTC
Description of problem:
All storage domains go "Inactive" while I perform a add brick followed by a remove brick. This looks similar to bug 875076, however, it maybe worth to note that I have only 1 VM with no disk attached to it and no rebalance command was executed manually.

Version-Release number of selected component (if applicable):
RHEVM: SI24.5
RHS: glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a gluster enabled cluster.
2. Using RHEVM create, start and optimize 2 volumes (distribute and replicate)
3. Add these volumes to a vdsm enabled cluster - which in my case is from a different (hypervisor) data center.
4. Ensure that both the storage domains are Up.
5. Using RHEVM, add a brick to distribute volume.
6. And then remove a brick from distribute volume.
7. Go to "Storage" node of hypervisor data center.
8. Observe your storage domains.
  
Actual results: All domains are in "Inactive" state.


Expected results: Domain status should remain unchanged.


Additional info:

Comment 1 Gowrishankar Rajaiyan 2012-11-28 21:19:56 UTC
Version-Release number of selected component (if applicable):
RHEVM: SI24.5
RHS: glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64
ISO: BETA - RHS-2.0-20121110.0-RHS-x86_64-DVD1.iso

Comment 5 Gowrishankar Rajaiyan 2012-11-28 21:35:40 UTC
Created attachment 653764 [details]
Screen shot of the domains in inactive state.

Do observe the event messages as well.

Comment 6 Bala.FA 2012-11-30 09:01:55 UTC
I see below vdsm traceback in hypervisor node.  Looks like the storage domain disappeared suddenly!  No idea why this happened.  As this traceback is on hypervisor node, this doesn't fall in RHS/vdsm.


Thread-1707::WARNING::2012-11-28 18:22:42,508::persistentDict::248::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it i\
s
Thread-1707::ERROR::2012-11-28 18:22:42,508::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `4a8b100c-c42f-4bf7-97\
e8-1ca6865b6d71`
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain
  File "/usr/share/vdsm/storage/nfsSD.py", line 155, in findDomain
  File "/usr/share/vdsm/storage/fileSD.py", line 130, in __init__
  File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__
  File "/usr/share/vdsm/storage/persistentDict.py", line 195, in __getitem__
KeyError: 'SDUUID'
Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.f96db8c9-aad\
3-4cb0-83a5-d183442b8517'
Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.f96db8c9-aad3-4cb0-83\
a5-d183442b8517' (0 active users)
Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.f96db8c9-aad3-4cb0-83a5-d18344\
2b8517' is free, finding out if anyone is waiting for it.
Thread-1707::DEBUG::2012-11-28 18:22:42,510::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.f96db8c9\
-aad3-4cb0-83a5-d183442b8517', Clearing records.
Thread-1707::ERROR::2012-11-28 18:22:42,510::task::853::TaskManager.Task::(_setError) Task=`5be59ae0-f95b-430d-902a-69cf36757536`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
  File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool
  File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool
  File "/usr/share/vdsm/storage/sp.py", line 648, in connect
  File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild
  File "/usr/share/vdsm/storage/sp.py", line 1522, in getMasterDomain
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=f96db8c9-aad3-4cb0-83a5-d183442b8517, msdUUID=4a8b100c-c42f-4bf7-97e8-1ca6865b6d71'
Thread-1707::DEBUG::2012-11-28 18:22:42,511::task::872::TaskManager.Task::(_run) Task=`5be59ae0-f95b-430d-902a-69cf36757536`::Task._run: 5be59ae0-f95b\
-430d-902a-69cf36757536 ('f96db8c9-aad3-4cb0-83a5-d183442b8517', 1, 'f96db8c9-aad3-4cb0-83a5-d183442b8517', '4a8b100c-c42f-4bf7-97e8-1ca6865b6d71', 1)\
 {} failed - stopping task


I think that mounted glusterfs volume in this hypervisor could have some problem.

Changing component to glusterfs to do some analysis of glusterfs logs.

Comment 7 Amar Tumballi 2012-12-03 06:41:53 UTC
as mentioned in the bug 875076, was the volumes started with just 1 brick (or 1pair for bricks for replicate)? If that is the case, it should be well covered under known issues.

If volumes had more than 1brick to start with, then I think it would be a 'serious' issue. Shanks, can you confirm?

Comment 8 Gowrishankar Rajaiyan 2012-12-03 10:40:02 UTC
I can confirm that, I had 2 bricks to the distribute volume and I added one and removed the other.

Comment 9 shishir gowda 2012-12-04 11:13:16 UTC
Looks like network issues. Log messages from client on hypervisior:

[2012-11-28 14:33:14.059323] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-Distribute-client-1: server 10.70.36.9:24010 has not responded in the last 42 seconds, disconnecting.
[2012-11-28 14:33:14.060838] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-Distribute-client-0: server 10.70.36.9:24009 has not responded in the last 42 seconds, disconnecting.
[2012-11-28 14:33:14.061130] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-1: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-11-28 14:32:29.209947 (xid=0x4975x)
[2012-11-28 14:33:14.061169] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 0-Distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2012-11-28 14:33:14.061204] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-1: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-11-28 14:32:31.745217 (xid=0x4976x)
[2012-11-28 14:33:14.061219] W [client-handshake.c:275:client_ping_cbk] 0-Distribute-client-1: timer must have expired
[2012-11-28 14:33:14.061231] I [client.c:2090:client_rpc_notify] 0-Distribute-client-1: disconnected
[2012-11-28 14:33:14.061274] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-11-28 14:32:29.209923 (xid=0x4579x)
[2012-11-28 14:33:14.061292] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 0-Distribute-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2012-11-28 14:33:14.061325] W [fuse-bridge.c:513:fuse_attr_cbk] 0-glusterfs-fuse: 6595: LOOKUP() / => -1 (Transport endpoint is not connected)
[2012-11-28 14:33:14.061372] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-11-28 14:32:31.745224 (xid=0x4580x)
[2012-11-28 14:33:14.061387] W [client-handshake.c:275:client_ping_cbk] 0-Distribute-client-0: timer must have expired
[2012-11-28 14:33:14.061410] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS 3.1) op(READ(12)) called at 2012-11-28 14:32:35.595184 (xid=0x4581x)
[2012-11-28 14:33:14.061433] W [client3_1-fops.c:2720:client3_1_readv_cbk] 0-Distribute-client-0: remote operation failed: Transport endpoint is not connected
[2012-11-28 14:33:14.061448] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6596: READ => -1 (Transport endpoint is not connected)
[2012-11-28 14:33:14.061469] I [client.c:2090:client_rpc_notify] 0-Distribute-client-0: disconnected
[2012-11-28 14:33:14.561795] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0:  (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD
[2012-11-28 14:33:14.561824] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6597: READ => -1 (File descriptor in bad state)
[2012-11-28 14:33:15.062059] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0:  (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD
[2012-11-28 14:33:15.062090] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6598: READ => -1 (File descriptor in bad state)
[2012-11-28 14:33:15.562339] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0:  (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD
[2012-11-28 14:33:15.562370] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6599: READ => -1 (File descriptor in bad state)


The above messages are after the volume was started with 2 bricks. Disconnection messages appear even after add-brick and remove-brick. In addition, sos report does not have rebalance(remove-brick) logs. Was remove-brick issued with 'start' option?

Comment 11 shishir gowda 2013-01-24 05:45:44 UTC
The client error logs are due to n/w disconnections. But the interesting point is rebalance/remove brick never seems to have been started, though it was issued from the console

Comment 12 Amar Tumballi 2013-01-31 04:41:22 UTC
RHEV-M doesn't have support for Async Tasks as of now, so the 'remove-brick' will result in 'remove-brick force' behavior, which doesn't migrate any data. So, there is no fixes possible with GlusterFS. Once the asynchronous task support in RHEV-M this can be reopened. Till then, WONTFIX/NOTABUG.

Comment 13 Gowrishankar Rajaiyan 2013-01-31 15:02:46 UTC
(In reply to comment #12)
> RHEV-M doesn't have support for Async Tasks as of now, so the 'remove-brick'
> will result in 'remove-brick force' behavior, which doesn't migrate any
> data. So, there is no fixes possible with GlusterFS. Once the asynchronous
> task support in RHEV-M this can be reopened. Till then, WONTFIX/NOTABUG.


In that case, shouldn't this be targetted to rhs-future?, instead of closing it as WONTFIX/NOTABUG ?