Bug 1078907 - Reconstruct master is not re-attempted after a timeout failure
Summary: Reconstruct master is not re-attempted after a timeout failure
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.0
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Liron Aravot
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks: rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-03-20 14:13 UTC by Elad
Modified: 2016-02-10 19:23 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-17 12:25:16 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
amureini: Triaged+


Attachments (Terms of Use)
engine, vdsm, libvirt and sanlock (2.04 MB, application/x-gzip)
2014-03-20 14:13 UTC, Elad
no flags Details
1 (connectivity loss) (1.85 MB, application/x-gzip)
2014-03-27 10:25 UTC, Elad
no flags Details
2 (after restoring connectivity) (1006.94 KB, application/x-gzip)
2014-03-27 10:34 UTC, Elad
no flags Details
logs (18.5.14) (1.24 MB, application/x-gzip)
2014-05-18 11:06 UTC, Elad
no flags Details

Description Elad 2014-03-20 14:13:10 UTC
Created attachment 876873 [details]
engine, vdsm, libvirt and sanlock

Description of problem:
On a DC with 1 host, I blocked connectivity from SPM host to the storage server where the master storage domain is located on. Engine tries to perform reconstruct, another domain takes master, but the DC remains stuck between 'Contending'/'Non-Responsive' states. Even if connectivity to the storage server is resumed, DC remains stuck. Manual intervention (like manually activate the master domain) won't help.

Version-Release number of selected component (if applicable):
rhevm-3.4.0-0.5.master.el6ev.noarch
vdsm-4.14.2-0.4.el6ev.x86_64
libvirt-0.10.2-29.el6_5.5.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.6.x86_64
sanlock-2.8-1.el6.x86_64


How reproducible:
Always

Steps to Reproduce:
On a shared DC with 1 host in cluster and more than 1 storage domains located on different storage servers (storage type doesn't matter): 
1. Block connectivity from SPM host to the storage server where the master domain is located on


Actual results:
DC becomes non-responsive. Reconstruct is happening and another domain takes master:

2014-03-20 15:31:05,211 INFO  [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-4-thread-15) [6e53e84e] Running command: ReconstructMasterDomainCommand internal: true. Entities affected :  ID: 9dfe8b53-c098-4211-a677-a0d5269baf54 Type: Storage


Even though reconstruct succeeded, storage domain activation fails because spm fails to start. In vdsm.log I see that connectStorageServer fails with the following error:

 Thread-23::INFO::2014-03-20 16:05:10,019::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': '10.35.160.108:
/RHEV/elad/1', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '865f8d5c-1917-4c93-bfd8-117b9e3e2262', 'port': ''}], options=None)
Thread-23::DEBUG::2014-03-20 16:05:10,021::hsm::2317::Storage.HSM::(__prefetchDomains) nfs local path: /rhev/data-center/mnt/10.35.160.108:_RHEV_elad_1
Thread-13::DEBUG::2014-03-20 16:05:15,981::BindingXMLRPC::251::vds::(wrapper) client [10.35.161.52]
Thread-13::INFO::2014-03-20 16:05:15,981::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='aa648e96-ee6b-4d64-beb9-c538f3012a85', options=None)
Thread-13::INFO::2014-03-20 16:05:15,996::logUtils::47::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}}
Thread-13::DEBUG::2014-03-20 16:05:16,001::BindingXMLRPC::251::vds::(wrapper) client [10.35.161.52]
Thread-13::INFO::2014-03-20 16:05:16,001::logUtils::44::dispatcher::(wrapper) Run and protect: getAllTasksStatuses(spUUID=None, options=None)
Thread-13::ERROR::2014-03-20 16:05:16,002::dispatcher::68::Storage.Dispatcher.Protect::(run) Secured object is not in safe state
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 60, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 103, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1176, in prepare
    raise self.error
SecureError: Secured object is not in safe state


=================
Even if I resume connectivity to storage server, the data center fails to initialize. Hos is stuck in Contending state.

Manual activation to the new master fails too.

Expected results:
If connectivity to storage server where master domain is located on is lost, reconstruct and storage domain activation are supposed to succeed

Additional info: engine, vdsm, libvirt and sanlock

Comment 2 Allon Mureinik 2014-03-26 09:34:24 UTC
Liron, isn't this a dup of another BZ we're working on?

Comment 3 Liron Aravot 2014-03-26 12:23:59 UTC
Allon, spm fails to start with blocked domain should be duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1072900

Elad, adding needinfo -
1. please clarify/separate logs of the second part of the scenario (when the connectivity is restored).

2. you attached logs from the wrong host, (vdsb instead of vdsc), please attach the correct logs.

Comment 4 Elad 2014-03-27 10:25:29 UTC
Created attachment 879393 [details]
1 (connectivity loss)

(In reply to Liron Aravot from comment #3)
> Allon, spm fails to start with blocked domain should be duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=1072900
> 
> Elad, adding needinfo -
> 1. please clarify/separate logs of the second part of the scenario (when the
> connectivity is restored).
> 
> 2. you attached logs from the wrong host, (vdsb instead of vdsc), please
> attach the correct logs.

Uploading 2 files:

1) While host is disconnected from the storage server where the master domain is located on
2) After restoring connectivity to this storage server

Connectivity was restored at 2014-03-27 ~12:22

Comment 5 Elad 2014-03-27 10:34:50 UTC
Created attachment 879394 [details]
2 (after restoring connectivity)

Liron, I'm testing it with the new downstream build (av5). While host was disconnected from the storage domain which contained the master domain, storage domain activation failed, the same as it was using the build that I encountered the bug with. 
It seems like the second issue - that spm start failed even the connectivity was restored - is now fixed.

I'm attaching the logs, as mentioned in comment #4, connectivity was restored at  2014-03-27 ~12:22



Liron, if you'd like, I can reproduce it with the old build

Comment 6 Liron Aravot 2014-03-27 13:23:26 UTC
Elad, so that i can analyze why you couldn't recover in that scenario reproduction would be great to verify that we don't have another hidden issue.

Comment 7 Elad 2014-04-01 10:50:18 UTC
I'm not manage to reproduce with av3 nor with latest build (av5)

Closing as WORKSFORME

Comment 8 Elad 2014-04-01 11:09:51 UTC

*** This bug has been marked as a duplicate of bug 1072900 ***

Comment 9 Elad 2014-05-18 11:05:34 UTC
I managed to reproduce the issue.
After a failure in reconstruct master, reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1098769 , I resumed the connectivity from SPM host to the storage server where the master domain is located. Storage domain activation failed with a failure in connectStorageServer, even though the connectivity to the storage server was resumed.

From engine.log:

2014-05-18 13:40:18,584 INFO  [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (org.ovirt.thread.pool-4-thread-37) [285cb901] ActivateStorage Domain. Before Connect all hosts to pool. Time:5/18/14 1:40 PM
2014-05-18 13:40:18,593 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-4-thread-44) [285cb901] START, ConnectStorageServerVDSCommand(HostName = green-vdsa, HostId = f2954f23-a6a0-4c98-b80c-6c77231e432e, storagePoolId = 2afb1571-1fc2-4906-b9ea-33984997ecc2, storageType = ISCSI, connectionList = [{ id: 455150d2-3bfd-4d51-a903-5782faf17979, connection: 10.35.160.107, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: 9ea801e1-bdce-415b-b838-8e7141d83a71, connection: 10.35.160.105, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.a7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: b7fd305f-c95c-4081-90ee-be0f17b2ac73, connection: 10.35.160.106, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b6, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: d9eb9732-ef14-427a-9c45-206365dadb64, connection: 10.35.160.104, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.a6, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 6c14ea26
2014-05-18 13:41:00,329 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-4-thread-18) [68837cc8] Command ConnectStorageServerVDSCommand(HostName = green-vdsa, HostId = f2954f23-a6a0-4c98-b80c-6c77231e432e, storagePoolId = 2afb1571-1fc2-4906-b9ea-33984997ecc2, storageType = ISCSI, connectionList = [{ id: 455150d2-3bfd-4d51-a903-5782faf17979, connection: 10.35.160.107, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: 9ea801e1-bdce-415b-b838-8e7141d83a71, connection: 10.35.160.105, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.a7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: b7fd305f-c95c-4081-90ee-be0f17b2ac73, connection: 10.35.160.106, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b6, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: d9eb9732-ef14-427a-9c45-206365dadb64, connection: 10.35.160.104, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.a6, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException
2014-05-18 13:41:00,329 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-4-thread-18) [68837cc8] FINISH, ConnectStorageServerVDSCommand, log id: 55b9904b
2014-05-18 13:41:00,330 ERROR [org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation] (org.ovirt.thread.pool-4-thread-18) [68837cc8] Failed to connect host green-vdsa to storage domain (name: 2-4-iscsi, id: 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14). Exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022)



In vdsm.log, the connectStorageServer thread ends with:

Thread-88::DEBUG::2014-05-18 13:40:57,012::lvm::373::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' is holding the operation mutex, waiting...



This issue reproduced only when the scenario reported here https://bugzilla.redhat.com/show_bug.cgi?id=1098769 happens.


Re-opening.

Attaching the logs from engine and vdsm

Comment 10 Elad 2014-05-18 11:06:37 UTC
Created attachment 896766 [details]
logs (18.5.14)

Comment 11 Allon Mureinik 2014-05-19 12:06:13 UTC
(In reply to Elad from comment #9)
> This issue reproduced only when the scenario reported here
> https://bugzilla.redhat.com/show_bug.cgi?id=1098769 happens.
> 
> 
> Re-opening.
> 
> Attaching the logs from engine and vdsm
Following comments on bug 1098769, this is not related to mixed storage domains.
Returning this BZ to ON_QA, with a reminder that the LVM issue reported in 1098769 still exists.

Comment 12 Allon Mureinik 2014-05-26 15:05:12 UTC
Reopening after discussing with Aharon.

The issue here is that if an attempted reconstruct fails because of a TIMEOUT, another reconstruct is not attempted.
Timeout should be treated like any other error.

Comment 14 Liron Aravot 2014-06-15 08:47:20 UTC
Allon,
Reconstruct is issued when specific erros are retrieved from VDSM,
in that case the operations get stuck on the VDSM side which leads to "general" timeout error in the engine (on ConnectStoragePool specifically on that case) - 
That error is currently not specific enough to determine from it that reconstruct is needed as we can't tell what's the cause for the timeout.

Changing the engine code to handle that case might be complicated and requires handling of different corner cases, as that's possibly the first time that we encountered that issue in which we get timeout and not specific error, we can consider wether it's worth solving on the engine side for "general" purpose, in my opinion - currently it's not.

Comment 15 Allon Mureinik 2014-06-17 12:25:16 UTC
(In reply to Liron Aravot from comment #14)
> Changing the engine code to handle that case might be complicated and
> requires handling of different corner cases, as that's possibly the first
> time that we encountered that issue in which we get timeout and not specific
> error, we can consider wether it's worth solving on the engine side for
> "general" purpose, in my opinion - currently it's not.

Agreed.


Note You need to log in before you can comment on or make changes to this bug.