Created attachment 876873 [details] engine, vdsm, libvirt and sanlock Description of problem: On a DC with 1 host, I blocked connectivity from SPM host to the storage server where the master storage domain is located on. Engine tries to perform reconstruct, another domain takes master, but the DC remains stuck between 'Contending'/'Non-Responsive' states. Even if connectivity to the storage server is resumed, DC remains stuck. Manual intervention (like manually activate the master domain) won't help. Version-Release number of selected component (if applicable): rhevm-3.4.0-0.5.master.el6ev.noarch vdsm-4.14.2-0.4.el6ev.x86_64 libvirt-0.10.2-29.el6_5.5.x86_64 qemu-kvm-rhev-0.12.1.2-2.415.el6_5.6.x86_64 sanlock-2.8-1.el6.x86_64 How reproducible: Always Steps to Reproduce: On a shared DC with 1 host in cluster and more than 1 storage domains located on different storage servers (storage type doesn't matter): 1. Block connectivity from SPM host to the storage server where the master domain is located on Actual results: DC becomes non-responsive. Reconstruct is happening and another domain takes master: 2014-03-20 15:31:05,211 INFO [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-4-thread-15) [6e53e84e] Running command: ReconstructMasterDomainCommand internal: true. Entities affected : ID: 9dfe8b53-c098-4211-a677-a0d5269baf54 Type: Storage Even though reconstruct succeeded, storage domain activation fails because spm fails to start. In vdsm.log I see that connectStorageServer fails with the following error: Thread-23::INFO::2014-03-20 16:05:10,019::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'connection': '10.35.160.108: /RHEV/elad/1', 'iqn': '', 'portal': '', 'user': '', 'password': '******', 'id': '865f8d5c-1917-4c93-bfd8-117b9e3e2262', 'port': ''}], options=None) Thread-23::DEBUG::2014-03-20 16:05:10,021::hsm::2317::Storage.HSM::(__prefetchDomains) nfs local path: /rhev/data-center/mnt/10.35.160.108:_RHEV_elad_1 Thread-13::DEBUG::2014-03-20 16:05:15,981::BindingXMLRPC::251::vds::(wrapper) client [10.35.161.52] Thread-13::INFO::2014-03-20 16:05:15,981::logUtils::44::dispatcher::(wrapper) Run and protect: getSpmStatus(spUUID='aa648e96-ee6b-4d64-beb9-c538f3012a85', options=None) Thread-13::INFO::2014-03-20 16:05:15,996::logUtils::47::dispatcher::(wrapper) Run and protect: getSpmStatus, Return response: {'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}} Thread-13::DEBUG::2014-03-20 16:05:16,001::BindingXMLRPC::251::vds::(wrapper) client [10.35.161.52] Thread-13::INFO::2014-03-20 16:05:16,001::logUtils::44::dispatcher::(wrapper) Run and protect: getAllTasksStatuses(spUUID=None, options=None) Thread-13::ERROR::2014-03-20 16:05:16,002::dispatcher::68::Storage.Dispatcher.Protect::(run) Secured object is not in safe state Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 60, in run result = ctask.prepare(self.func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 103, in wrapper return m(self, *a, **kw) File "/usr/share/vdsm/storage/task.py", line 1176, in prepare raise self.error SecureError: Secured object is not in safe state ================= Even if I resume connectivity to storage server, the data center fails to initialize. Hos is stuck in Contending state. Manual activation to the new master fails too. Expected results: If connectivity to storage server where master domain is located on is lost, reconstruct and storage domain activation are supposed to succeed Additional info: engine, vdsm, libvirt and sanlock
Liron, isn't this a dup of another BZ we're working on?
Allon, spm fails to start with blocked domain should be duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1072900 Elad, adding needinfo - 1. please clarify/separate logs of the second part of the scenario (when the connectivity is restored). 2. you attached logs from the wrong host, (vdsb instead of vdsc), please attach the correct logs.
Created attachment 879393 [details] 1 (connectivity loss) (In reply to Liron Aravot from comment #3) > Allon, spm fails to start with blocked domain should be duplicate of > https://bugzilla.redhat.com/show_bug.cgi?id=1072900 > > Elad, adding needinfo - > 1. please clarify/separate logs of the second part of the scenario (when the > connectivity is restored). > > 2. you attached logs from the wrong host, (vdsb instead of vdsc), please > attach the correct logs. Uploading 2 files: 1) While host is disconnected from the storage server where the master domain is located on 2) After restoring connectivity to this storage server Connectivity was restored at 2014-03-27 ~12:22
Created attachment 879394 [details] 2 (after restoring connectivity) Liron, I'm testing it with the new downstream build (av5). While host was disconnected from the storage domain which contained the master domain, storage domain activation failed, the same as it was using the build that I encountered the bug with. It seems like the second issue - that spm start failed even the connectivity was restored - is now fixed. I'm attaching the logs, as mentioned in comment #4, connectivity was restored at 2014-03-27 ~12:22 Liron, if you'd like, I can reproduce it with the old build
Elad, so that i can analyze why you couldn't recover in that scenario reproduction would be great to verify that we don't have another hidden issue.
I'm not manage to reproduce with av3 nor with latest build (av5) Closing as WORKSFORME
*** This bug has been marked as a duplicate of bug 1072900 ***
I managed to reproduce the issue. After a failure in reconstruct master, reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1098769 , I resumed the connectivity from SPM host to the storage server where the master domain is located. Storage domain activation failed with a failure in connectStorageServer, even though the connectivity to the storage server was resumed. From engine.log: 2014-05-18 13:40:18,584 INFO [org.ovirt.engine.core.bll.storage.ActivateStorageDomainCommand] (org.ovirt.thread.pool-4-thread-37) [285cb901] ActivateStorage Domain. Before Connect all hosts to pool. Time:5/18/14 1:40 PM 2014-05-18 13:40:18,593 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-4-thread-44) [285cb901] START, ConnectStorageServerVDSCommand(HostName = green-vdsa, HostId = f2954f23-a6a0-4c98-b80c-6c77231e432e, storagePoolId = 2afb1571-1fc2-4906-b9ea-33984997ecc2, storageType = ISCSI, connectionList = [{ id: 455150d2-3bfd-4d51-a903-5782faf17979, connection: 10.35.160.107, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: 9ea801e1-bdce-415b-b838-8e7141d83a71, connection: 10.35.160.105, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.a7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: b7fd305f-c95c-4081-90ee-be0f17b2ac73, connection: 10.35.160.106, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b6, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: d9eb9732-ef14-427a-9c45-206365dadb64, connection: 10.35.160.104, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.a6, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 6c14ea26 2014-05-18 13:41:00,329 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-4-thread-18) [68837cc8] Command ConnectStorageServerVDSCommand(HostName = green-vdsa, HostId = f2954f23-a6a0-4c98-b80c-6c77231e432e, storagePoolId = 2afb1571-1fc2-4906-b9ea-33984997ecc2, storageType = ISCSI, connectionList = [{ id: 455150d2-3bfd-4d51-a903-5782faf17979, connection: 10.35.160.107, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: 9ea801e1-bdce-415b-b838-8e7141d83a71, connection: 10.35.160.105, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.a7, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: b7fd305f-c95c-4081-90ee-be0f17b2ac73, connection: 10.35.160.106, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.b6, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: d9eb9732-ef14-427a-9c45-206365dadb64, connection: 10.35.160.104, iqn: iqn.1992-04.com.emc:cx.ckm00121000438.a6, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 2014-05-18 13:41:00,329 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-4-thread-18) [68837cc8] FINISH, ConnectStorageServerVDSCommand, log id: 55b9904b 2014-05-18 13:41:00,330 ERROR [org.ovirt.engine.core.bll.storage.ConnectSingleAsyncOperation] (org.ovirt.thread.pool-4-thread-18) [68837cc8] Failed to connect host green-vdsa to storage domain (name: 2-4-iscsi, id: 7dd40ede-91e0-4fe6-b05f-29d6e22a7b14). Exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022): org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with error VDS_NETWORK_ERROR and code 5022) In vdsm.log, the connectStorageServer thread ends with: Thread-88::DEBUG::2014-05-18 13:40:57,012::lvm::373::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' is holding the operation mutex, waiting... This issue reproduced only when the scenario reported here https://bugzilla.redhat.com/show_bug.cgi?id=1098769 happens. Re-opening. Attaching the logs from engine and vdsm
Created attachment 896766 [details] logs (18.5.14)
(In reply to Elad from comment #9) > This issue reproduced only when the scenario reported here > https://bugzilla.redhat.com/show_bug.cgi?id=1098769 happens. > > > Re-opening. > > Attaching the logs from engine and vdsm Following comments on bug 1098769, this is not related to mixed storage domains. Returning this BZ to ON_QA, with a reminder that the LVM issue reported in 1098769 still exists.
Reopening after discussing with Aharon. The issue here is that if an attempted reconstruct fails because of a TIMEOUT, another reconstruct is not attempted. Timeout should be treated like any other error.
Allon, Reconstruct is issued when specific erros are retrieved from VDSM, in that case the operations get stuck on the VDSM side which leads to "general" timeout error in the engine (on ConnectStoragePool specifically on that case) - That error is currently not specific enough to determine from it that reconstruct is needed as we can't tell what's the cause for the timeout. Changing the engine code to handle that case might be complicated and requires handling of different corner cases, as that's possibly the first time that we encountered that issue in which we get timeout and not specific error, we can consider wether it's worth solving on the engine side for "general" purpose, in my opinion - currently it's not.
(In reply to Liron Aravot from comment #14) > Changing the engine code to handle that case might be complicated and > requires handling of different corner cases, as that's possibly the first > time that we encountered that issue in which we get timeout and not specific > error, we can consider wether it's worth solving on the engine side for > "general" purpose, in my opinion - currently it's not. Agreed.