881420 – [RHEV-RHS] All storage domains go "Inactive" while you add brick and then remove brick from the volume using RHEVM.

Bug 881420 - [RHEV-RHS] All storage domains go "Inactive" while you add brick and then remove brick from the volume using RHEVM.

Summary: [RHEV-RHS] All storage domains go "Inactive" while you add brick and then rem...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	shishir gowda
QA Contact:	Gowrishankar Rajaiyan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-28 21:17 UTC by Gowrishankar Rajaiyan
Modified:	2013-12-09 01:34 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-01-31 04:41:22 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screen shot of the domains in inactive state. (164.65 KB, image/png) 2012-11-28 21:35 UTC, Gowrishankar Rajaiyan	no flags	Details
View All

Description Gowrishankar Rajaiyan 2012-11-28 21:17:27 UTC

Description of problem:
All storage domains go "Inactive" while I perform a add brick followed by a remove brick. This looks similar to bug 875076, however, it maybe worth to note that I have only 1 VM with no disk attached to it and no rebalance command was executed manually.

Version-Release number of selected component (if applicable):
RHEVM: SI24.5
RHS: glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a gluster enabled cluster.
2. Using RHEVM create, start and optimize 2 volumes (distribute and replicate)
3. Add these volumes to a vdsm enabled cluster - which in my case is from a different (hypervisor) data center.
4. Ensure that both the storage domains are Up.
5. Using RHEVM, add a brick to distribute volume.
6. And then remove a brick from distribute volume.
7. Go to "Storage" node of hypervisor data center.
8. Observe your storage domains.
  
Actual results: All domains are in "Inactive" state.


Expected results: Domain status should remain unchanged.


Additional info:

Comment 1 Gowrishankar Rajaiyan 2012-11-28 21:19:56 UTC

Version-Release number of selected component (if applicable):
RHEVM: SI24.5
RHS: glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64
ISO: BETA - RHS-2.0-20121110.0-RHS-x86_64-DVD1.iso

Comment 5 Gowrishankar Rajaiyan 2012-11-28 21:35:40 UTC

Created attachment 653764 [details]
Screen shot of the domains in inactive state.

Do observe the event messages as well.

Comment 6 Bala.FA 2012-11-30 09:01:55 UTC

I see below vdsm traceback in hypervisor node.  Looks like the storage domain disappeared suddenly!  No idea why this happened.  As this traceback is on hypervisor node, this doesn't fall in RHS/vdsm.


Thread-1707::WARNING::2012-11-28 18:22:42,508::persistentDict::248::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it i\
s
Thread-1707::ERROR::2012-11-28 18:22:42,508::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `4a8b100c-c42f-4bf7-97\
e8-1ca6865b6d71`
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain
  File "/usr/share/vdsm/storage/nfsSD.py", line 155, in findDomain
  File "/usr/share/vdsm/storage/fileSD.py", line 130, in __init__
  File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__
  File "/usr/share/vdsm/storage/persistentDict.py", line 195, in __getitem__
KeyError: 'SDUUID'
Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.f96db8c9-aad\
3-4cb0-83a5-d183442b8517'
Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.f96db8c9-aad3-4cb0-83\
a5-d183442b8517' (0 active users)
Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.f96db8c9-aad3-4cb0-83a5-d18344\
2b8517' is free, finding out if anyone is waiting for it.
Thread-1707::DEBUG::2012-11-28 18:22:42,510::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.f96db8c9\
-aad3-4cb0-83a5-d183442b8517', Clearing records.
Thread-1707::ERROR::2012-11-28 18:22:42,510::task::853::TaskManager.Task::(_setError) Task=`5be59ae0-f95b-430d-902a-69cf36757536`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
  File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool
  File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool
  File "/usr/share/vdsm/storage/sp.py", line 648, in connect
  File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild
  File "/usr/share/vdsm/storage/sp.py", line 1522, in getMasterDomain
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=f96db8c9-aad3-4cb0-83a5-d183442b8517, msdUUID=4a8b100c-c42f-4bf7-97e8-1ca6865b6d71'
Thread-1707::DEBUG::2012-11-28 18:22:42,511::task::872::TaskManager.Task::(_run) Task=`5be59ae0-f95b-430d-902a-69cf36757536`::Task._run: 5be59ae0-f95b\
-430d-902a-69cf36757536 ('f96db8c9-aad3-4cb0-83a5-d183442b8517', 1, 'f96db8c9-aad3-4cb0-83a5-d183442b8517', '4a8b100c-c42f-4bf7-97e8-1ca6865b6d71', 1)\
 {} failed - stopping task


I think that mounted glusterfs volume in this hypervisor could have some problem.

Changing component to glusterfs to do some analysis of glusterfs logs.

Comment 7 Amar Tumballi 2012-12-03 06:41:53 UTC

as mentioned in the bug 875076, was the volumes started with just 1 brick (or 1pair for bricks for replicate)? If that is the case, it should be well covered under known issues.

If volumes had more than 1brick to start with, then I think it would be a 'serious' issue. Shanks, can you confirm?

Comment 8 Gowrishankar Rajaiyan 2012-12-03 10:40:02 UTC

I can confirm that, I had 2 bricks to the distribute volume and I added one and removed the other.

Comment 9 shishir gowda 2012-12-04 11:13:16 UTC

Looks like network issues. Log messages from client on hypervisior:

[2012-11-28 14:33:14.059323] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-Distribute-client-1: server 10.70.36.9:24010 has not responded in the last 42 seconds, disconnecting.
[2012-11-28 14:33:14.060838] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-Distribute-client-0: server 10.70.36.9:24009 has not responded in the last 42 seconds, disconnecting.
[2012-11-28 14:33:14.061130] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-1: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-11-28 14:32:29.209947 (xid=0x4975x)
[2012-11-28 14:33:14.061169] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 0-Distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2012-11-28 14:33:14.061204] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-1: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-11-28 14:32:31.745217 (xid=0x4976x)
[2012-11-28 14:33:14.061219] W [client-handshake.c:275:client_ping_cbk] 0-Distribute-client-1: timer must have expired
[2012-11-28 14:33:14.061231] I [client.c:2090:client_rpc_notify] 0-Distribute-client-1: disconnected
[2012-11-28 14:33:14.061274] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-11-28 14:32:29.209923 (xid=0x4579x)
[2012-11-28 14:33:14.061292] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 0-Distribute-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2012-11-28 14:33:14.061325] W [fuse-bridge.c:513:fuse_attr_cbk] 0-glusterfs-fuse: 6595: LOOKUP() / => -1 (Transport endpoint is not connected)
[2012-11-28 14:33:14.061372] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-11-28 14:32:31.745224 (xid=0x4580x)
[2012-11-28 14:33:14.061387] W [client-handshake.c:275:client_ping_cbk] 0-Distribute-client-0: timer must have expired
[2012-11-28 14:33:14.061410] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS 3.1) op(READ(12)) called at 2012-11-28 14:32:35.595184 (xid=0x4581x)
[2012-11-28 14:33:14.061433] W [client3_1-fops.c:2720:client3_1_readv_cbk] 0-Distribute-client-0: remote operation failed: Transport endpoint is not connected
[2012-11-28 14:33:14.061448] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6596: READ => -1 (Transport endpoint is not connected)
[2012-11-28 14:33:14.061469] I [client.c:2090:client_rpc_notify] 0-Distribute-client-0: disconnected
[2012-11-28 14:33:14.561795] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0:  (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD
[2012-11-28 14:33:14.561824] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6597: READ => -1 (File descriptor in bad state)
[2012-11-28 14:33:15.062059] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0:  (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD
[2012-11-28 14:33:15.062090] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6598: READ => -1 (File descriptor in bad state)
[2012-11-28 14:33:15.562339] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0:  (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD
[2012-11-28 14:33:15.562370] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6599: READ => -1 (File descriptor in bad state)


The above messages are after the volume was started with 2 bricks. Disconnection messages appear even after add-brick and remove-brick. In addition, sos report does not have rebalance(remove-brick) logs. Was remove-brick issued with 'start' option?

Comment 11 shishir gowda 2013-01-24 05:45:44 UTC

The client error logs are due to n/w disconnections. But the interesting point is rebalance/remove brick never seems to have been started, though it was issued from the console

Comment 12 Amar Tumballi 2013-01-31 04:41:22 UTC

RHEV-M doesn't have support for Async Tasks as of now, so the 'remove-brick' will result in 'remove-brick force' behavior, which doesn't migrate any data. So, there is no fixes possible with GlusterFS. Once the asynchronous task support in RHEV-M this can be reopened. Till then, WONTFIX/NOTABUG.

Comment 13 Gowrishankar Rajaiyan 2013-01-31 15:02:46 UTC

(In reply to comment #12)
> RHEV-M doesn't have support for Async Tasks as of now, so the 'remove-brick'
> will result in 'remove-brick force' behavior, which doesn't migrate any
> data. So, there is no fixes possible with GlusterFS. Once the asynchronous
> task support in RHEV-M this can be reopened. Till then, WONTFIX/NOTABUG.


In that case, shouldn't this be targetted to rhs-future?, instead of closing it as WONTFIX/NOTABUG ?

Note You need to log in before you can comment on or make changes to this bug.