Bug 1421428

Summary: Host can't find iSCSI master after blocking connection to NFS
Product: [oVirt] ovirt-engine Reporter: Lilach Zitnitski <lzitnits>
Component: BLL.InfraAssignee: Martin Perina <mperina>
Status: CLOSED WORKSFORME QA Contact: Pavel Stehlik <pstehlik>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0.4CC: bugs, lzitnits, mperina, nsoffer, oourfali, pkliczew, tnisan, ylavi
Target Milestone: ovirt-4.1.1Keywords: Regression
Target Release: ---Flags: lzitnits: needinfo+
rule-engine: ovirt-4.1+
ylavi: blocker+
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-12 14:12:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
4.0-logs none

Description Lilach Zitnitski 2017-02-12 09:46:22 UTC
Description of problem:
If the connection between all hosts in the DC and the NFS server is blocked, and the hosts are in maintenance mode, they can't be activated even though the master storage domain is iscsi and the connection to the iscsi server is fine. 
When trying to activate the hosts, the log shows vdsm can't find the master domain and the hosts become non-responsive. 
If the block is removed from iptables, the hosts manage to start in a few seconds. 

Version-Release number of selected component (if applicable):
rhevm-4.1.0.4-0.1.el7.noarch
vdsm-4.19.4-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. block connection from all hosts to nfs server (make sure it's not the master sd)
2. move the hosts to maintenance mode
3. activate the hosts

Actual results:
host can't be activated

Expected results:
host suppose to find the iscsi master and become active

Additional info:

even when one host is in maintenance and the connection is blocked to nfs it can't be activated even though the iscsi master storage domain is up and active in the dc

vdsm.log

2017-02-12 11:02:45,865 ERROR (jsonrpc/1) [storage.Dispatcher] {'status': {'message': "Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b'", 'code': 304}} (dispatcher:77)

engine.log

017-02-12 11:02:45,927+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-17) [6f593f3] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM storage-ge8-vdsm1 command ConnectStoragePoolVDS failed: Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b'

Comment 1 Lilach Zitnitski 2017-02-12 09:47:00 UTC
Created attachment 1249480 [details]
logs

engine and vdsm logs

Comment 2 Liron Aravot 2017-02-20 21:12:45 UTC
Thanks Lilach,

The mount performed in connectStorageServer takes more than 3 minutes to complete, causing the engine to fail with network exception

as it seems we have two issues here:
1. ConnectHostToStoragePoolServersCommand - the command should proceed even if VDSNetworkException occurs in connectStorageServer execution (but that issue existed since this command was introduced).

2. connectStorageServer mount is taking more than 3 minutes, which it never did - causing for the network exception to occur.

Nir, do you aware to any change on vdsm that might have caused this? has anything changed in the way we mount the target?

Comment 3 Liron Aravot 2017-02-20 21:17:08 UTC
A third issue seems to be infrastructural - after failing on network exception (because the operation took more than 3 minutes) the host is detected as non responsive by the engine although there is connectivity to it.

Martin, can you please take a look?

2017-02-12 11:25:11,907+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-48) [5cf889dd] Host 'storage-ge8-vdsm1' is not responding.

Comment 4 Piotr Kliczewski 2017-02-21 09:10:49 UTC
Here is what happened:

1. The engine connected to vdsm

2017-02-12 11:25:07,214 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:10.35.162.7:56976 (protocoldetector:72)

2. Engine triggered StoragePool.connect

2017-02-12 11:25:08,424 ERROR (jsonrpc/2) [storage.StorageDomainCache] domain d7674368-1e73-4c12-8d55-3e64b4306f9b not found (sdc:157)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 155, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1599, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/blockSD.py", line 1539, in findDomainPath
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('d7674368-1e73-4c12-8d55-3e64b4306f9b',)

there is storage task failure:

2017-02-12 11:25:08,425 ERROR (jsonrpc/2) [storage.TaskManager.Task] (Task='a279f80e-6869-42e9-9f47-261deacfd3ba') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 990, in connectStoragePool
    spUUID, hostID, msdUUID, masterVersion, domainsMap)
  File "/usr/share/vdsm/storage/hsm.py", line 1052, in _connectStoragePool
    res = pool.connect(hostID, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 665, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1235, in __rebuild
    self.setMasterDomain(msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1443, in setMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b'

and this is propagated to the engine:

2017-02-12 11:25:08,441+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-45) [] Command 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='d036238c-ac56-4450-bf76-1ecf8944f038', vdsId='d036238c-ac56-4450-bf76-1ecf8944f038', storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b'

In there logs there is that the host moved to Up

2017-02-12 11:25:11,787+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [3431b35] EVENT_ID: VDS_DETECTED(13), Correlation ID: 45db1ffa, Call Stack: null, Custom Event ID: -1, Message: Status of host storage-ge8-vdsm1 was set to Up.

and 200 millis later:

2017-02-12 11:25:11,907+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-48) [5cf889dd] Host 'storage-ge8-vdsm1' is not responding.

There was no more interaction from the engine and the connection was closed at:

2017-02-12 11:28:18,317 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.35.162.7', 56976, 0, 0) at 0x3160488>: unexpected eof (betterAsyncore:113)

Later the operation repeated with exactly the same steps starting at:

2017-02-12 11:28:18,381 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:10.35.162.7:57026 (protocoldetector:72)

Based on above analysis I do not see infra issue.

Comment 5 Piotr Kliczewski 2017-02-21 09:11:59 UTC
Restoring needinfo on Nir

Comment 6 Liron Aravot 2017-02-21 13:12:01 UTC
Piotr, that's exactly the issue -
The host keep being detected as non responsive even though it is. Seems like its because of the timeout in ConnectStorageServer and its going through the InitVdsOnUp flow endlessly.
We need to inspect why when ConnectStorageServer fails with timeout the engine detects the host as non responsive.

2017-02-12 11:12:20,634+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-31) [4e78481] Host 'storage-ge8-vdsm1' is not responding.


2017-02-12 11:25:11,907+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-48) [5cf889dd] Host 'storage-ge8-vdsm1' is not responding.

2017-02-12 11:28:31,070+02 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-7-thread-7) [3d43f734] Host 'storage-ge8-vdsm1' is not responding.

Comment 7 Piotr Kliczewski 2017-02-21 13:24:33 UTC
Liron,

The first not responding was caused by:

2017-02-12 11:12:10,654 INFO  (MainThread) [vds] Received signal 15, shutting down (vdsm:68)

in the second case we got the response:

2017-02-12 11:25:08,441+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-45) [] Command 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='d036238c-ac56-4450-bf76-1ecf8944f038', vdsId='d036238c-ac56-4450-bf76-1ecf8944f038', storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b'

and the same for the third time:

2017-02-12 11:28:22,548+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-7-thread-4) [] Command 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='d036238c-ac56-4450-bf76-1ecf8944f038', vdsId='d036238c-ac56-4450-bf76-1ecf8944f038', storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=00000001-0001-0001-0001-000000000311, msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b'

I do not see any timeout in the logs.

Comment 8 Liron Aravot 2017-02-21 13:28:48 UTC
(In reply to Piotr Kliczewski from comment #7)
> Liron,
> 
> The first not responding was caused by:
> 
> 2017-02-12 11:12:10,654 INFO  (MainThread) [vds] Received signal 15,
> shutting down (vdsm:68)

What causes the host to receive it?

> 
> in the second case we got the response:
> 
> 2017-02-12 11:25:08,441+02 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]
> (org.ovirt.thread.pool-7-thread-45) [] Command
> 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1,
> ConnectStoragePoolVDSCommandParameters:{runAsync='true',
> hostId='d036238c-ac56-4450-bf76-1ecf8944f038',
> vdsId='d036238c-ac56-4450-bf76-1ecf8944f038',
> storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})'
> execution failed: IRSGenericException: IRSErrorException:
> IRSNoMasterDomainException: Cannot find master domain:
> 'spUUID=00000001-0001-0001-0001-000000000311,
> msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b'
> 
> and the same for the third time:
> 
> 2017-02-12 11:28:22,548+02 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand]
> (org.ovirt.thread.pool-7-thread-4) [] Command
> 'ConnectStoragePoolVDSCommand(HostName = storage-ge8-vdsm1,
> ConnectStoragePoolVDSCommandParameters:{runAsync='true',
> hostId='d036238c-ac56-4450-bf76-1ecf8944f038',
> vdsId='d036238c-ac56-4450-bf76-1ecf8944f038',
> storagePoolId='00000001-0001-0001-0001-000000000311', masterVersion='24'})'
> execution failed: IRSGenericException: IRSErrorException:
> IRSNoMasterDomainException: Cannot find master domain:
> 'spUUID=00000001-0001-0001-0001-000000000311,
> msdUUID=d7674368-1e73-4c12-8d55-3e64b4306f9b'

This is a failure to perform ConnectStoragePoolVDSCommand - it shouldn't cause the host to be detected as non responsive.
> 
> I do not see any timeout in the logs.

We do have timeouts, I asked if because of the timeouts the connection is shutdown and the host is detected as non responsive (which shouldn't happen)

ine.core.bll.storage.pool.ConnectHostToStoragePoolServersCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException
: VDSNetworkException: Message timeout which can be caused by communication issues (Failed with error VDS_NETWORK_ERROR and code 5022)

Comment 9 Piotr Kliczewski 2017-02-21 13:53:44 UTC
OK. I missed them but looking at vdsm logs I can see only 2 replies for this kind of requests:

2017-02-12 11:14:14,038 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 104.87 seconds (__init__:515)

and

2017-02-12 11:14:16,347 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 1.24 seconds (__init__:515)

This means that vdsm did not reply to others which means that the timeout is correct.

Comment 10 Liron Aravot 2017-02-21 13:56:40 UTC
Piotr,
My question is why the engine detects the host as non responsive while it is responsive. Some operations do fail because of storage access problems - but the host is responsive.

Comment 11 Piotr Kliczewski 2017-02-21 14:08:42 UTC
Liron,

This is standard way of handling lack of responses. When a response do not arrive within specified period of time (300 seconds by default) we assume that there is something wrong with vdsm so we kill the connection and attempt to reconnect.

The state of the connection is tightly coupled with a host status. Loosing connection means loosing Up status.

My question is why vdsm is not sending responses to this specific request type.

Comment 12 Liron Aravot 2017-02-21 14:57:39 UTC
Piotr,
We see in the logs that the host moved to non responsive 14 seconds after the command that failed on a timeout after a different command - so is it indeed caused by the timeout?

Regardless, hosts should move to non responsive only when failing to execute specific commands (GetStats/GetCaps) by the vds manager, not by failure in executing any verb which can occur if it access the storage (in the given case, connectStorageServer failed with timeout because the mount operation took more than 3 minutes).
If timeout in any verb can cause a host to move to non responsive - that's a regression, that wasn't the behavior previously.

Comment 13 Liron Aravot 2017-02-21 15:29:06 UTC
Lilach, can you please attempt to reproduce on 4.0 as well?

Comment 14 Piotr Kliczewski 2017-02-21 15:31:50 UTC
Liron,

As I mentioned in comment #7 there are 3 failures as you specified in comment #6.

The first one was due to vdsm being killed (probably by sanlock) next two were the timeouts.

This is the behavior we have there since xmlrpc. When a request timeouts we evaluate whether there is an issue with the host.

Do we know why there were no responses for this specific verb?

Comment 15 Piotr Kliczewski 2017-02-21 15:33:07 UTC
Restoring needinfo

Comment 16 Liron Aravot 2017-02-21 15:45:00 UTC
(In reply to Piotr Kliczewski from comment #14)
> Liron,
> 
> As I mentioned in comment #7 there are 3 failures as you specified in
> comment #6.
> 
> The first one was due to vdsm being killed (probably by sanlock) next two
> were the timeouts.
> 
> This is the behavior we have there since xmlrpc. When a request timeouts we
> evaluate whether there is an issue with the host.
> 
> Do we know why there were no responses for this specific verb?

Piotr,
When a verb is performing an operation that is access the shared storage, it always may get stuck when executing it (in that example - mount) when there is a a storage server problem - in that case the host shouldn't be marked as non responsive as the engine-host connectivity is functional.

Comment 17 Nir Soffer 2017-02-21 16:21:15 UTC
Liron, I don't know about any change in vdsm nfs mounts that may cause mount to
get stuck.

I suspect that this is a change in the way the test was performed (e.g. the type
of iptables rules used), or a change in the platform.

Anyway, the system should not assume that mount will fail fast, we should handle
the case when mount get stuck.

It will be useful to test this again with older version (4.0.6?) - we cannot assume
that tests done for the previous release (one year ago?) are relevant now. Platform
changes or test environment and flow may affect the results.

Comment 18 Liron Aravot 2017-02-21 18:12:23 UTC
Nir, thanks.
I added needinfo? on Lilach to try and reproduce that on earlier version.
Regardless, the engine should proceed even if the connectStorageServer takes more than 3 minutes and exception is thrown.

Comment 19 Lilach Zitnitski 2017-02-22 07:29:43 UTC
(In reply to Liron Aravot from comment #13)
> Lilach, can you please attempt to reproduce on 4.0 as well?

this is not reproducing on 4.0.7. the hosts are able to activate with block to the nfs servers.

Comment 20 Nir Soffer 2017-02-22 07:33:18 UTC
(In reply to Lilach Zitnitski from comment #19)
> (In reply to Liron Aravot from comment #13)
> > Lilach, can you please attempt to reproduce on 4.0 as well?
> 
> this is not reproducing on 4.0.7. the hosts are able to activate with block
> to the nfs servers.

Lilach, please attach engine and vdsm logs from 4.0.7, showing the same flow
you are testing with 4.1.

Also please explain in detail how do you block access to storage, and how to 
reproduce this issue.

Comment 21 Lilach Zitnitski 2017-02-22 07:39:40 UTC
Created attachment 1256357 [details]
4.0-logs

engine and vdsm

Comment 22 Lilach Zitnitski 2017-02-22 08:06:33 UTC
The steps to reproduce:
1. block connection using iptables from all hosts to nfs storage provider (if iscsi is master) or to iscsi storage provider (if the master is on nfs)
2. after the sd's are unavailable, move to maintenance all the hosts
3. activate the hosts 

when activating the hosts in 4.0.7 no problems occur, after a few seconds all hosts are up again and the blocked storage domains still unavailable. but when following the same steps on 4.1 the hosts cannot be activated.

Comment 23 Yaniv Kaul 2017-02-22 11:13:56 UTC
(In reply to Lilach Zitnitski from comment #22)
> The steps to reproduce:
> 1. block connection using iptables from all hosts to nfs storage provider
> (if iscsi is master) or to iscsi storage provider (if the master is on nfs)
> 2. after the sd's are unavailable, move to maintenance all the hosts
> 3. activate the hosts 
> 
> when activating the hosts in 4.0.7 no problems occur, after a few seconds
> all hosts are up again and the blocked storage domains still unavailable.
> but when following the same steps on 4.1 the hosts cannot be activated.

Sounds like a regression to me - setting the correct keyword.

Comment 24 Liron Aravot 2017-02-22 15:05:24 UTC
Thanks Lilach,
The reproduction seems a bit different, in 4.1 seems that the host was the SPM when the connection was blocked (and it was fenced by sanlock) so it had non operating mount, while the 4.0.7 the was was activated after all connections were disconnect (so it had no mount).

Lilach, can you run the exact same scenario?

Thanks,
Liron

Comment 25 Lilach Zitnitski 2017-03-07 11:27:42 UTC
(In reply to Liron Aravot from comment #24)
> Thanks Lilach,
> The reproduction seems a bit different, in 4.1 seems that the host was the
> SPM when the connection was blocked (and it was fenced by sanlock) so it had
> non operating mount, while the 4.0.7 the was was activated after all
> connections were disconnect (so it had no mount).
> 
> Lilach, can you run the exact same scenario?
> 
> Thanks,
> Liron

Hi Liron, I tried this scenario in 4.1 while the host is HSM and also while it's SPM, and this bug is not reproducing any more (in the newest 4.1.1 build).

Comment 26 Liron Aravot 2017-03-09 08:51:09 UTC
Thanks Lilach,
Did you try to reproduce it using v4.0 as well? the original bug is on v4.1.

Regardless, the engine code in those areas didn't change - the engine always assumed that connectStorageServer() will return fast even if it fails to mount (so the engine doesn't expect network exceptions on that flow) - that obviously needs to be changed but i'm not sure if that correct target for that is 4.1.1 (as it didn't reproduce according to comment 25) - the cause in that case may be platform/host specific so I'd reduce the severity on that issue.

Martin/Piotr , we have the infrastructural issue (see comment 16), I'm restoring the needinfo? which was removed.

Comment 27 Lilach Zitnitski 2017-03-09 09:29:39 UTC
Just tested this scenario on 4.0.7 system, blocked connection to the nfs server from all hosts in the environment (while the iscsi is the master sd), moved them to maintenance and then tried to activate the hosts again - the activated succeeded.

Comment 28 Tal Nisan 2017-03-12 12:57:26 UTC
There are two issues here:
1. The fact that a request to the host that does not return in 3 minutes time moves the host into non responsive - an infra issue which should be addressed in this bug
2. The issue that a mount command takes more than 3 minutes - QE could not reproduce, we will open another bug for that issue, currently not a blocker

Comment 29 Oved Ourfali 2017-03-12 13:02:28 UTC
But according to comment #27, the code works well on both 4.0.7 and 4.1.
Can you explain what's the issue now?

Comment 30 Lilach Zitnitski 2017-03-12 13:35:45 UTC
(In reply to Oved Ourfali from comment #29)
> But according to comment #27, the code works well on both 4.0.7 and 4.1.
> Can you explain what's the issue now?

No issue from my side at the moment. I can't reproduce this bug on 4.1 and on 4.0 it always seemed to work..