Description of problem: All storage domains go "Inactive" while I perform a add brick followed by a remove brick. This looks similar to bug 875076, however, it maybe worth to note that I have only 1 VM with no disk attached to it and no rebalance command was executed manually. Version-Release number of selected component (if applicable): RHEVM: SI24.5 RHS: glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a gluster enabled cluster. 2. Using RHEVM create, start and optimize 2 volumes (distribute and replicate) 3. Add these volumes to a vdsm enabled cluster - which in my case is from a different (hypervisor) data center. 4. Ensure that both the storage domains are Up. 5. Using RHEVM, add a brick to distribute volume. 6. And then remove a brick from distribute volume. 7. Go to "Storage" node of hypervisor data center. 8. Observe your storage domains. Actual results: All domains are in "Inactive" state. Expected results: Domain status should remain unchanged. Additional info:
Version-Release number of selected component (if applicable): RHEVM: SI24.5 RHS: glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64 ISO: BETA - RHS-2.0-20121110.0-RHS-x86_64-DVD1.iso
Created attachment 653764 [details] Screen shot of the domains in inactive state. Do observe the event messages as well.
I see below vdsm traceback in hypervisor node. Looks like the storage domain disappeared suddenly! No idea why this happened. As this traceback is on hypervisor node, this doesn't fall in RHS/vdsm. Thread-1707::WARNING::2012-11-28 18:22:42,508::persistentDict::248::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it i\ s Thread-1707::ERROR::2012-11-28 18:22:42,508::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `4a8b100c-c42f-4bf7-97\ e8-1ca6865b6d71` Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain File "/usr/share/vdsm/storage/nfsSD.py", line 155, in findDomain File "/usr/share/vdsm/storage/fileSD.py", line 130, in __init__ File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__ File "/usr/share/vdsm/storage/persistentDict.py", line 195, in __getitem__ KeyError: 'SDUUID' Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.f96db8c9-aad\ 3-4cb0-83a5-d183442b8517' Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.f96db8c9-aad3-4cb0-83\ a5-d183442b8517' (0 active users) Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.f96db8c9-aad3-4cb0-83a5-d18344\ 2b8517' is free, finding out if anyone is waiting for it. Thread-1707::DEBUG::2012-11-28 18:22:42,510::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.f96db8c9\ -aad3-4cb0-83a5-d183442b8517', Clearing records. Thread-1707::ERROR::2012-11-28 18:22:42,510::task::853::TaskManager.Task::(_setError) Task=`5be59ae0-f95b-430d-902a-69cf36757536`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run File "/usr/share/vdsm/logUtils.py", line 38, in wrapper File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool File "/usr/share/vdsm/storage/sp.py", line 648, in connect File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild File "/usr/share/vdsm/storage/sp.py", line 1522, in getMasterDomain StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=f96db8c9-aad3-4cb0-83a5-d183442b8517, msdUUID=4a8b100c-c42f-4bf7-97e8-1ca6865b6d71' Thread-1707::DEBUG::2012-11-28 18:22:42,511::task::872::TaskManager.Task::(_run) Task=`5be59ae0-f95b-430d-902a-69cf36757536`::Task._run: 5be59ae0-f95b\ -430d-902a-69cf36757536 ('f96db8c9-aad3-4cb0-83a5-d183442b8517', 1, 'f96db8c9-aad3-4cb0-83a5-d183442b8517', '4a8b100c-c42f-4bf7-97e8-1ca6865b6d71', 1)\ {} failed - stopping task I think that mounted glusterfs volume in this hypervisor could have some problem. Changing component to glusterfs to do some analysis of glusterfs logs.
as mentioned in the bug 875076, was the volumes started with just 1 brick (or 1pair for bricks for replicate)? If that is the case, it should be well covered under known issues. If volumes had more than 1brick to start with, then I think it would be a 'serious' issue. Shanks, can you confirm?
I can confirm that, I had 2 bricks to the distribute volume and I added one and removed the other.
Looks like network issues. Log messages from client on hypervisior: [2012-11-28 14:33:14.059323] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-Distribute-client-1: server 10.70.36.9:24010 has not responded in the last 42 seconds, disconnecting. [2012-11-28 14:33:14.060838] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-Distribute-client-0: server 10.70.36.9:24009 has not responded in the last 42 seconds, disconnecting. [2012-11-28 14:33:14.061130] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-1: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-11-28 14:32:29.209947 (xid=0x4975x) [2012-11-28 14:33:14.061169] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 0-Distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2012-11-28 14:33:14.061204] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-1: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-11-28 14:32:31.745217 (xid=0x4976x) [2012-11-28 14:33:14.061219] W [client-handshake.c:275:client_ping_cbk] 0-Distribute-client-1: timer must have expired [2012-11-28 14:33:14.061231] I [client.c:2090:client_rpc_notify] 0-Distribute-client-1: disconnected [2012-11-28 14:33:14.061274] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-11-28 14:32:29.209923 (xid=0x4579x) [2012-11-28 14:33:14.061292] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 0-Distribute-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2012-11-28 14:33:14.061325] W [fuse-bridge.c:513:fuse_attr_cbk] 0-glusterfs-fuse: 6595: LOOKUP() / => -1 (Transport endpoint is not connected) [2012-11-28 14:33:14.061372] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-11-28 14:32:31.745224 (xid=0x4580x) [2012-11-28 14:33:14.061387] W [client-handshake.c:275:client_ping_cbk] 0-Distribute-client-0: timer must have expired [2012-11-28 14:33:14.061410] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS 3.1) op(READ(12)) called at 2012-11-28 14:32:35.595184 (xid=0x4581x) [2012-11-28 14:33:14.061433] W [client3_1-fops.c:2720:client3_1_readv_cbk] 0-Distribute-client-0: remote operation failed: Transport endpoint is not connected [2012-11-28 14:33:14.061448] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6596: READ => -1 (Transport endpoint is not connected) [2012-11-28 14:33:14.061469] I [client.c:2090:client_rpc_notify] 0-Distribute-client-0: disconnected [2012-11-28 14:33:14.561795] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0: (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD [2012-11-28 14:33:14.561824] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6597: READ => -1 (File descriptor in bad state) [2012-11-28 14:33:15.062059] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0: (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD [2012-11-28 14:33:15.062090] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6598: READ => -1 (File descriptor in bad state) [2012-11-28 14:33:15.562339] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0: (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD [2012-11-28 14:33:15.562370] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6599: READ => -1 (File descriptor in bad state) The above messages are after the volume was started with 2 bricks. Disconnection messages appear even after add-brick and remove-brick. In addition, sos report does not have rebalance(remove-brick) logs. Was remove-brick issued with 'start' option?
The client error logs are due to n/w disconnections. But the interesting point is rebalance/remove brick never seems to have been started, though it was issued from the console
RHEV-M doesn't have support for Async Tasks as of now, so the 'remove-brick' will result in 'remove-brick force' behavior, which doesn't migrate any data. So, there is no fixes possible with GlusterFS. Once the asynchronous task support in RHEV-M this can be reopened. Till then, WONTFIX/NOTABUG.
(In reply to comment #12) > RHEV-M doesn't have support for Async Tasks as of now, so the 'remove-brick' > will result in 'remove-brick force' behavior, which doesn't migrate any > data. So, there is no fixes possible with GlusterFS. Once the asynchronous > task support in RHEV-M this can be reopened. Till then, WONTFIX/NOTABUG. In that case, shouldn't this be targetted to rhs-future?, instead of closing it as WONTFIX/NOTABUG ?