Bug 1421428
Summary: | Host can't find iSCSI master after blocking connection to NFS | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Lilach Zitnitski <lzitnits> | ||||||
Component: | BLL.Infra | Assignee: | Martin Perina <mperina> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | Pavel Stehlik <pstehlik> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.1.0.4 | CC: | bugs, lzitnits, mperina, nsoffer, oourfali, pkliczew, tnisan, ylavi | ||||||
Target Milestone: | ovirt-4.1.1 | Keywords: | Regression | ||||||
Target Release: | --- | Flags: | lzitnits:
needinfo+
rule-engine: ovirt-4.1+ ylavi: blocker+ |
||||||
Hardware: | x86_64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-03-12 14:12:26 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Lilach Zitnitski
2017-02-12 09:46:22 UTC
Created attachment 1249480 [details]
logs
engine and vdsm logs
Thanks Lilach, The mount performed in connectStorageServer takes more than 3 minutes to complete, causing the engine to fail with network exception as it seems we have two issues here: 1. ConnectHostToStoragePoolServersCommand - the command should proceed even if VDSNetworkException occurs in connectStorageServer execution (but that issue existed since this command was introduced). 2. connectStorageServer mount is taking more than 3 minutes, which it never did - causing for the network exception to occur. Nir, do you aware to any change on vdsm that might have caused this? has anything changed in the way we mount the target? A third issue seems to be infrastructural - after failing on network exception (because the operation took more than 3 minutes) the host is detected as non responsive by the engine although there is connectivity to it. Martin, can you please take a look? 2017-02-12 11:25:11,907+02 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-48) [5cf889dd] Host 'storage-ge8-vdsm1' is not responding. Here is what happened: 1. The engine connected to vdsm 2017-02-12 11:25:07,214 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:10.35.162.7:56976 (protocoldetector:72) 2. Engine triggered StoragePool.connect 2017-02-12 11:25:08,424 ERROR (jsonrpc/2) [storage.StorageDomainCache] domain d7674368-1e73-4c12-8d55-3e64b4306f9b not found (sdc:157) Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 155, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/blockSD.py", line 1599, in findDomain return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/blockSD.py", line 1539, in findDomainPath raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('d7674368-1e73-4c12-8d55-3e64b4306f9b',) there is storage task failure: 2017-02-12 11:25:08,425 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='a279f80e-6869-42e9-9f47-261deacfd3ba') Unexpected error (task:870) Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 877, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 990, in connectStoragePool spUUID, hostID, msdUUID, masterVersion, domainsMap) File "/usr/share/vdsm/storage/hsm.py", line 1052, in _connectStoragePool res = pool.connect(hostID, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 665, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1235, in __rebuild self.setMasterDomain(msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1443, in setMasterDomain raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID) StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b' and this is propagated to the engine: 2017-02-12 11:25:08,441+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-45) [] Command 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='d036238c-ac56-4450-bf76-1ecf8944f038', vdsId='d036238c-ac56-4450-bf76-1ecf8944f038', storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b' In there logs there is that the host moved to Up 2017-02-12 11:25:11,787+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [3431b35] EVENT_ID: VDS_DETECTED(13), Correlation ID: 45db1ffa, Call Stack: null, Custom Event ID: -1, Message: Status of host storage-ge8-vdsm1 was set to Up. and 200 millis later: 2017-02-12 11:25:11,907+02 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-48) [5cf889dd] Host 'storage-ge8-vdsm1' is not responding. There was no more interaction from the engine and the connection was closed at: 2017-02-12 11:28:18,317 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.35.162.7', 56976, 0, 0) at 0x3160488>: unexpected eof (betterAsyncore:113) Later the operation repeated with exactly the same steps starting at: 2017-02-12 11:28:18,381 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:10.35.162.7:57026 (protocoldetector:72) Based on above analysis I do not see infra issue. Restoring needinfo on Nir Piotr, that's exactly the issue - The host keep being detected as non responsive even though it is. Seems like its because of the timeout in ConnectStorageServer and its going through the InitVdsOnUp flow endlessly. We need to inspect why when ConnectStorageServer fails with timeout the engine detects the host as non responsive. 2017-02-12 11:12:20,634+02 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-31) [4e78481] Host 'storage-ge8-vdsm1' is not responding. 2017-02-12 11:25:11,907+02 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-48) [5cf889dd] Host 'storage-ge8-vdsm1' is not responding. 2017-02-12 11:28:31,070+02 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-7) [3d43f734] Host 'storage-ge8-vdsm1' is not responding. Liron, The first not responding was caused by: 2017-02-12 11:12:10,654 INFO (MainThread) [vds] Received signal 15, shutting down (vdsm:68) in the second case we got the response: 2017-02-12 11:25:08,441+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-45) [] Command 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='d036238c-ac56-4450-bf76-1ecf8944f038', vdsId='d036238c-ac56-4450-bf76-1ecf8944f038', storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b' and the same for the third time: 2017-02-12 11:28:22,548+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-4) [] Command 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='d036238c-ac56-4450-bf76-1ecf8944f038', vdsId='d036238c-ac56-4450-bf76-1ecf8944f038', storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b' I do not see any timeout in the logs. (In reply to Piotr Kliczewski from comment #7) > Liron, > > The first not responding was caused by: > > 2017-02-12 11:12:10,654 INFO (MainThread) [vds] Received signal 15, > shutting down (vdsm:68) What causes the host to receive it? > > in the second case we got the response: > > 2017-02-12 11:25:08,441+02 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] > (org.ovirt.thread.pool-7-thread-45) [] Command > 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1, > ConnectStoragePoolVDSCommandParameters:{runAsync='true', > hostId='d036238c-ac56-4450-bf76-1ecf8944f038', > vdsId='d036238c-ac56-4450-bf76-1ecf8944f038', > storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})' > execution failed: IRSGenericException: IRSErrorException: > IRSNoMasterDomainException: Cannot find master domain: > 'spUUID=00000001-0001-0001-0001-000000000311, > msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b' > > and the same for the third time: > > 2017-02-12 11:28:22,548+02 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] > (org.ovirt.thread.pool-7-thread-4) [] Command > 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1, > ConnectStoragePoolVDSCommandParameters:{runAsync='true', > hostId='d036238c-ac56-4450-bf76-1ecf8944f038', > vdsId='d036238c-ac56-4450-bf76-1ecf8944f038', > storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})' > execution failed: IRSGenericException: IRSErrorException: > IRSNoMasterDomainException: Cannot find master domain: > 'spUUID=00000001-0001-0001-0001-000000000311, > msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b' This is a failure to perform ConnectStoragePoolVDSCommand - it shouldn't cause the host to be detected as non responsive. > > I do not see any timeout in the logs. We do have timeouts, I asked if because of the timeouts the connection is shutdown and the host is detected as non responsive (which shouldn't happen) ine.core.bll.storage.pool.ConnectHostToStoragePoolServersCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException : VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022) OK. I missed them but looking at vdsm logs I can see only 2 replies for this kind of requests: 2017-02-12 11:14:14,038 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 104.87 seconds (__init__:515) and 2017-02-12 11:14:16,347 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 1.24 seconds (__init__:515) This means that vdsm did not reply to others which means that the timeout is correct. Piotr, My question is why the engine detects the host as non responsive while it is responsive. Some operations do fail because of storage access problems - but the host is responsive. Liron, This is standard way of handling lack of responses. When a response do not arrive within specified period of time (300 seconds by default) we assume that there is something wrong with vdsm so we kill the connection and attempt to reconnect. The state of the connection is tightly coupled with a host status. Loosing connection means loosing Up status. My question is why vdsm is not sending responses to this specific request type. Piotr, We see in the logs that the host moved to non responsive 14 seconds after the command that failed on a timeout after a different command - so is it indeed caused by the timeout? Regardless, hosts should move to non responsive only when failing to execute specific commands (GetStats/GetCaps) by the vds manager, not by failure in executing any verb which can occur if it access the storage (in the given case, connectStorageServer failed with timeout because the mount operation took more than 3 minutes). If timeout in any verb can cause a host to move to non responsive - that's a regression, that wasn't the behavior previously. Lilach, can you please attempt to reproduce on 4.0 as well? Liron, As I mentioned in comment #7 there are 3 failures as you specified in comment #6. The first one was due to vdsm being killed (probably by sanlock) next two were the timeouts. This is the behavior we have there since xmlrpc. When a request timeouts we evaluate whether there is an issue with the host. Do we know why there were no responses for this specific verb? Restoring needinfo (In reply to Piotr Kliczewski from comment #14) > Liron, > > As I mentioned in comment #7 there are 3 failures as you specified in > comment #6. > > The first one was due to vdsm being killed (probably by sanlock) next two > were the timeouts. > > This is the behavior we have there since xmlrpc. When a request timeouts we > evaluate whether there is an issue with the host. > > Do we know why there were no responses for this specific verb? Piotr, When a verb is performing an operation that is access the shared storage, it always may get stuck when executing it (in that example - mount) when there is a a storage server problem - in that case the host shouldn't be marked as non responsive as the engine-host connectivity is functional. Liron, I don't know about any change in vdsm nfs mounts that may cause mount to get stuck. I suspect that this is a change in the way the test was performed (e.g. the type of iptables rules used), or a change in the platform. Anyway, the system should not assume that mount will fail fast, we should handle the case when mount get stuck. It will be useful to test this again with older version (4.0.6?) - we cannot assume that tests done for the previous release (one year ago?) are relevant now. Platform changes or test environment and flow may affect the results. Nir, thanks. I added needinfo? on Lilach to try and reproduce that on earlier version. Regardless, the engine should proceed even if the connectStorageServer takes more than 3 minutes and exception is thrown. (In reply to Liron Aravot from comment #13) > Lilach, can you please attempt to reproduce on 4.0 as well? this is not reproducing on 4.0.7. the hosts are able to activate with block to the nfs servers. (In reply to Lilach Zitnitski from comment #19) > (In reply to Liron Aravot from comment #13) > > Lilach, can you please attempt to reproduce on 4.0 as well? > > this is not reproducing on 4.0.7. the hosts are able to activate with block > to the nfs servers. Lilach, please attach engine and vdsm logs from 4.0.7, showing the same flow you are testing with 4.1. Also please explain in detail how do you block access to storage, and how to reproduce this issue. Created attachment 1256357 [details]
4.0-logs
engine and vdsm
The steps to reproduce: 1. block connection using iptables from all hosts to nfs storage provider (if iscsi is master) or to iscsi storage provider (if the master is on nfs) 2. after the sd's are unavailable, move to maintenance all the hosts 3. activate the hosts when activating the hosts in 4.0.7 no problems occur, after a few seconds all hosts are up again and the blocked storage domains still unavailable. but when following the same steps on 4.1 the hosts cannot be activated. (In reply to Lilach Zitnitski from comment #22) > The steps to reproduce: > 1. block connection using iptables from all hosts to nfs storage provider > (if iscsi is master) or to iscsi storage provider (if the master is on nfs) > 2. after the sd's are unavailable, move to maintenance all the hosts > 3. activate the hosts > > when activating the hosts in 4.0.7 no problems occur, after a few seconds > all hosts are up again and the blocked storage domains still unavailable. > but when following the same steps on 4.1 the hosts cannot be activated. Sounds like a regression to me - setting the correct keyword. Thanks Lilach, The reproduction seems a bit different, in 4.1 seems that the host was the SPM when the connection was blocked (and it was fenced by sanlock) so it had non operating mount, while the 4.0.7 the was was activated after all connections were disconnect (so it had no mount). Lilach, can you run the exact same scenario? Thanks, Liron (In reply to Liron Aravot from comment #24) > Thanks Lilach, > The reproduction seems a bit different, in 4.1 seems that the host was the > SPM when the connection was blocked (and it was fenced by sanlock) so it had > non operating mount, while the 4.0.7 the was was activated after all > connections were disconnect (so it had no mount). > > Lilach, can you run the exact same scenario? > > Thanks, > Liron Hi Liron, I tried this scenario in 4.1 while the host is HSM and also while it's SPM, and this bug is not reproducing any more (in the newest 4.1.1 build). Thanks Lilach, Did you try to reproduce it using v4.0 as well? the original bug is on v4.1. Regardless, the engine code in those areas didn't change - the engine always assumed that connectStorageServer() will return fast even if it fails to mount (so the engine doesn't expect network exceptions on that flow) - that obviously needs to be changed but i'm not sure if that correct target for that is 4.1.1 (as it didn't reproduce according to comment 25) - the cause in that case may be platform/host specific so I'd reduce the severity on that issue. Martin/Piotr , we have the infrastructural issue (see comment 16), I'm restoring the needinfo? which was removed. Just tested this scenario on 4.0.7 system, blocked connection to the nfs server from all hosts in the environment (while the iscsi is the master sd), moved them to maintenance and then tried to activate the hosts again - the activated succeeded. There are two issues here: 1. The fact that a request to the host that does not return in 3 minutes time moves the host into non responsive - an infra issue which should be addressed in this bug 2. The issue that a mount command takes more than 3 minutes - QE could not reproduce, we will open another bug for that issue, currently not a blocker But according to comment #27, the code works well on both 4.0.7 and 4.1. Can you explain what's the issue now? (In reply to Oved Ourfali from comment #29) > But according to comment #27, the code works well on both 4.0.7 and 4.1. > Can you explain what's the issue now? No issue from my side at the moment. I can't reproduce this bug on 4.1 and on 4.0 it always seemed to work.. |