Bug 881420
| Summary: | [RHEV-RHS] All storage domains go "Inactive" while you add brick and then remove brick from the volume using RHEVM. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Gowrishankar Rajaiyan <grajaiya> | ||||
| Component: | glusterfs | Assignee: | shishir gowda <sgowda> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Gowrishankar Rajaiyan <grajaiya> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 2.0 | CC: | amarts, ashetty, grajaiya, nsathyan, pprakash, rhs-bugs, shaines, vbellur | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-01-31 04:41:22 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Gowrishankar Rajaiyan
2012-11-28 21:17:27 UTC
Version-Release number of selected component (if applicable): RHEVM: SI24.5 RHS: glusterfs-server-3.3.0rhsvirt1-8.el6rhs.x86_64 ISO: BETA - RHS-2.0-20121110.0-RHS-x86_64-DVD1.iso Created attachment 653764 [details]
Screen shot of the domains in inactive state.
Do observe the event messages as well.
I see below vdsm traceback in hypervisor node. Looks like the storage domain disappeared suddenly! No idea why this happened. As this traceback is on hypervisor node, this doesn't fall in RHS/vdsm.
Thread-1707::WARNING::2012-11-28 18:22:42,508::persistentDict::248::Storage.PersistentDict::(refresh) data has no embedded checksum - trust it as it i\
s
Thread-1707::ERROR::2012-11-28 18:22:42,508::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `4a8b100c-c42f-4bf7-97\
e8-1ca6865b6d71`
Traceback (most recent call last):
File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain
File "/usr/share/vdsm/storage/nfsSD.py", line 155, in findDomain
File "/usr/share/vdsm/storage/fileSD.py", line 130, in __init__
File "/usr/share/vdsm/storage/persistentDict.py", line 85, in __getitem__
File "/usr/share/vdsm/storage/persistentDict.py", line 195, in __getitem__
KeyError: 'SDUUID'
Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.f96db8c9-aad\
3-4cb0-83a5-d183442b8517'
Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.f96db8c9-aad3-4cb0-83\
a5-d183442b8517' (0 active users)
Thread-1707::DEBUG::2012-11-28 18:22:42,509::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.f96db8c9-aad3-4cb0-83a5-d18344\
2b8517' is free, finding out if anyone is waiting for it.
Thread-1707::DEBUG::2012-11-28 18:22:42,510::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.f96db8c9\
-aad3-4cb0-83a5-d183442b8517', Clearing records.
Thread-1707::ERROR::2012-11-28 18:22:42,510::task::853::TaskManager.Task::(_setError) Task=`5be59ae0-f95b-430d-902a-69cf36757536`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 861, in _run
File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
File "/usr/share/vdsm/storage/hsm.py", line 862, in connectStoragePool
File "/usr/share/vdsm/storage/hsm.py", line 904, in _connectStoragePool
File "/usr/share/vdsm/storage/sp.py", line 648, in connect
File "/usr/share/vdsm/storage/sp.py", line 1178, in __rebuild
File "/usr/share/vdsm/storage/sp.py", line 1522, in getMasterDomain
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=f96db8c9-aad3-4cb0-83a5-d183442b8517, msdUUID=4a8b100c-c42f-4bf7-97e8-1ca6865b6d71'
Thread-1707::DEBUG::2012-11-28 18:22:42,511::task::872::TaskManager.Task::(_run) Task=`5be59ae0-f95b-430d-902a-69cf36757536`::Task._run: 5be59ae0-f95b\
-430d-902a-69cf36757536 ('f96db8c9-aad3-4cb0-83a5-d183442b8517', 1, 'f96db8c9-aad3-4cb0-83a5-d183442b8517', '4a8b100c-c42f-4bf7-97e8-1ca6865b6d71', 1)\
{} failed - stopping task
I think that mounted glusterfs volume in this hypervisor could have some problem.
Changing component to glusterfs to do some analysis of glusterfs logs.
as mentioned in the bug 875076, was the volumes started with just 1 brick (or 1pair for bricks for replicate)? If that is the case, it should be well covered under known issues. If volumes had more than 1brick to start with, then I think it would be a 'serious' issue. Shanks, can you confirm? I can confirm that, I had 2 bricks to the distribute volume and I added one and removed the other. Looks like network issues. Log messages from client on hypervisior: [2012-11-28 14:33:14.059323] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-Distribute-client-1: server 10.70.36.9:24010 has not responded in the last 42 seconds, disconnecting. [2012-11-28 14:33:14.060838] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-Distribute-client-0: server 10.70.36.9:24009 has not responded in the last 42 seconds, disconnecting. [2012-11-28 14:33:14.061130] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-1: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-11-28 14:32:29.209947 (xid=0x4975x) [2012-11-28 14:33:14.061169] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 0-Distribute-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2012-11-28 14:33:14.061204] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-1: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-11-28 14:32:31.745217 (xid=0x4976x) [2012-11-28 14:33:14.061219] W [client-handshake.c:275:client_ping_cbk] 0-Distribute-client-1: timer must have expired [2012-11-28 14:33:14.061231] I [client.c:2090:client_rpc_notify] 0-Distribute-client-1: disconnected [2012-11-28 14:33:14.061274] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2012-11-28 14:32:29.209923 (xid=0x4579x) [2012-11-28 14:33:14.061292] W [client3_1-fops.c:2650:client3_1_lookup_cbk] 0-Distribute-client-0: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2012-11-28 14:33:14.061325] W [fuse-bridge.c:513:fuse_attr_cbk] 0-glusterfs-fuse: 6595: LOOKUP() / => -1 (Transport endpoint is not connected) [2012-11-28 14:33:14.061372] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2012-11-28 14:32:31.745224 (xid=0x4580x) [2012-11-28 14:33:14.061387] W [client-handshake.c:275:client_ping_cbk] 0-Distribute-client-0: timer must have expired [2012-11-28 14:33:14.061410] E [rpc-clnt.c:373:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x78) [0x7f1543d77818] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xb0) [0x7f1543d774d0] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1543d76f3e]))) 0-Distribute-client-0: forced unwinding frame type(GlusterFS 3.1) op(READ(12)) called at 2012-11-28 14:32:35.595184 (xid=0x4581x) [2012-11-28 14:33:14.061433] W [client3_1-fops.c:2720:client3_1_readv_cbk] 0-Distribute-client-0: remote operation failed: Transport endpoint is not connected [2012-11-28 14:33:14.061448] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6596: READ => -1 (Transport endpoint is not connected) [2012-11-28 14:33:14.061469] I [client.c:2090:client_rpc_notify] 0-Distribute-client-0: disconnected [2012-11-28 14:33:14.561795] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0: (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD [2012-11-28 14:33:14.561824] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6597: READ => -1 (File descriptor in bad state) [2012-11-28 14:33:15.062059] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0: (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD [2012-11-28 14:33:15.062090] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6598: READ => -1 (File descriptor in bad state) [2012-11-28 14:33:15.562339] W [client3_1-fops.c:3922:client3_1_readv] 0-Distribute-client-0: (51b822be-34fa-4817-8b7b-f04c584dadff) remote_fd is -1. EBADFD [2012-11-28 14:33:15.562370] W [fuse-bridge.c:1948:fuse_readv_cbk] 0-glusterfs-fuse: 6599: READ => -1 (File descriptor in bad state) The above messages are after the volume was started with 2 bricks. Disconnection messages appear even after add-brick and remove-brick. In addition, sos report does not have rebalance(remove-brick) logs. Was remove-brick issued with 'start' option? The client error logs are due to n/w disconnections. But the interesting point is rebalance/remove brick never seems to have been started, though it was issued from the console RHEV-M doesn't have support for Async Tasks as of now, so the 'remove-brick' will result in 'remove-brick force' behavior, which doesn't migrate any data. So, there is no fixes possible with GlusterFS. Once the asynchronous task support in RHEV-M this can be reopened. Till then, WONTFIX/NOTABUG. (In reply to comment #12) > RHEV-M doesn't have support for Async Tasks as of now, so the 'remove-brick' > will result in 'remove-brick force' behavior, which doesn't migrate any > data. So, there is no fixes possible with GlusterFS. Once the asynchronous > task support in RHEV-M this can be reopened. Till then, WONTFIX/NOTABUG. In that case, shouldn't this be targetted to rhs-future?, instead of closing it as WONTFIX/NOTABUG ? |