Bug 1421945

Summary:	Adding a secondary storage domain fails sporadically (NFS client not completely up on time?!)
Product:	[oVirt] ovirt-engine	Reporter:	Barak Korren <bkorren>
Component:	BLL.Storage	Assignee:	Allon Mureinik <amureini>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Raz Tamir <ratamir>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	future	CC:	alitke, amureini, bugs, tnisan, ylavi
Target Milestone:	ovirt-4.2.0	Flags:	rule-engine: ovirt-4.2+
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-30 08:05:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Barak Korren 2017-02-14 07:26:21 UTC

Description of problem:
During OST testing, the addition of a secondary storage domain occasionally fails.

Version-Release number of selected component (if applicable):
master branch, and also the 4.1 branch.

How reproducible:
Run OST against a recent build

Actual results:

The following error is returned from engine (via SDK v4):

  Error: Fault reason is "Operation Failed". Fault detail is "[Storage domain
  cannot be reached. Please ensure it is accessible from the host(s).]". HTTP
  response code is 400.

The API call that is failing is as follows ('002_bootstrap.py' lines 401-409):

    dcs_service = api.system_service().data_centers_service()
    dc = dcs_service.list(search='name=%s' % DC_NAME)[0]
    dc_service = dcs_service.data_center_service(dc.id)
    attached_sds_service = dc_service.storage_domains_service()
    attached_sds_service.add(
        sdk4.types.StorageDomain(
            id=sd.id,
        ),
    )


Additional info:
It is VERY unlikely this is due to a network error, the OST network is just a local bridge. Not actual network hardware is being used. It is also unlikely that the storage itself is down or faulty, the storage is emulated by a local VM, and that exact same VM had been successfully configured to provide the primary storage domain just a few seconds earlier.

Engine and VDSM logs can be found at the followin URL:
http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_4.1/488/artifact/exported-artifacts/basic-suit-4.1-el7/test_logs/basic-suite-4.1/post-002_bootstrap.py/

Comment 1 Yaniv Kaul 2017-02-15 07:48:18 UTC

There are several interesting items here (http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_4.1/488/artifact/exported-artifacts/basic-suit-4.1-el7/test_logs/basic-suite-4.1/post-002_bootstrap.py/lago-basic-suite-4-1-host0/_var_log/vdsm/vdsm.log )

- We have the usual clutter of unfetched domains (to be fixed in 4.1.z), 

We have this:
2017-02-06 06:05:01,061 INFO  (jsonrpc/4) [dispatcher] Run and protect: downloadImage(methodArgs={u'url': u'http://glance.ovirt.org:9292/v1/images/36de0ae5-d5ce-4626-9fa7-272b324d3656', u'headers': {}, u'method': u'http'}, spUUID=u'067bd1e2-c0e3-41fb-8dd8-9d3b3e98df4f', sdUUID=u'446a8b9b-e71a-4489-901f-007f77e7f04d', imgUUID=u'7f78d8c6-bfc6-40b9-b530-dc6904480847', volUUID=u'06d9221b-1e0d-4d8d-b939-cd20d58a8acd') (logUtils:49)
2017-02-06 06:05:01,067 INFO  (jsonrpc/4) [dispatcher] Run and protect: downloadImage, Return response: None (logUtils:52)
2017-02-06 06:05:01,077 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Image.download succeeded in 0.02 seconds (__init__:515)

(Downloaded a 12MB image in 0.02 seconds? That's really really really fast)


There are multiple connectStorageServer to the same mount, which I can't explain:
1.
(jsonrpc/5) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'00000000-0000-0000-0000-000000000000', u'connection': u'192.168.201.2:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.1', u'password': '********', u'port': u''}], options=None) (logUtils:49)
2017-02-06 06:04:14,396 INFO  (jsonrpc/5) [storage.Mount] mounting 192.168.201.2:/exports/nfs/share1 at /rhev/data-center/mnt/192.168.201.2:_exports_nfs_share1 (mount:187)
2017-02-06 06:04:14,546 INFO  (jsonrpc/5) [IOProcessClient] Starting client ioprocess-1 (__init__:330)
2017-02-06 06:04:14,550 INFO  (ioprocess/14933) [IOProcess] Starting ioprocess (__init__:452)
2017-02-06 06:04:14,562 INFO  (jsonrpc/5) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'00000000-0000-0000-0000-000000000000'}]} (logUtils:52)
2017-02-06 06:04:14,563 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 0.17 seconds (__init__:515)

2. 
2017-02-06 06:04:16,021 INFO  (jsonrpc/7) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe', u'connection': u'192.168.201.2:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.1', u'password': '********', u'port': u''}], options=None) (logUtils:49)
2017-02-06 06:04:16,023 INFO  (jsonrpc/7) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe'}]} (logUtils:52)
2017-02-06 06:04:16,024 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 0.00 seconds (__init__:515)

3.
2017-02-06 06:04:27,183 INFO  (jsonrpc/1) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe', u'connection': u'192.168.201.2:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.1', u'password': '********', u'port': u''}], options=None) (logUtils:49)
2017-02-06 06:04:27,189 INFO  (jsonrpc/1) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe'}]} (logUtils:52)
2017-02-06 06:04:27,189 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 0.01 seconds (__init__:515)

4. 
2017-02-06 06:04:56,303 INFO  (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe', u'connection': u'192.168.201.2:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.1', u'password': '********', u'port': u''}], options=None) (logUtils:49)
2017-02-06 06:04:56,316 INFO  (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe'}]} (logUtils:52)
2017-02-06 06:04:56,316 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 0.01 seconds (__init__:515)


I believe we have a bug for this for 4.2.


Now, there's the ISO domain:
2017-02-06 06:04:59,954 INFO  (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'25f35b55-23ce-448c-8fc0-730cb8be8c1a', u'connection': u'192.168.201.2:/exports/nfs/iso', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'3', u'password': '********', u'port': u''}], options=None) (logUtils:49)
2017-02-06 06:04:59,956 INFO  (jsonrpc/6) [storage.Mount] mounting 192.168.201.2:/exports/nfs/iso at /rhev/data-center/mnt/192.168.201.2:_exports_nfs_iso (mount:187)


Which did not seem to return at all.
I see it in supervdsm log, but no return:

MainProcess|jsonrpc/6::DEBUG::2017-02-06 06:04:59,957::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call mount with (u'192.168.201.2:/exports/nfs/iso', u'/rhev/data-center/mnt/192.168.201.2:_exports_nfs_iso') {'vfstype': 'nfs', 'mntOpts': 'soft,nosharecache,timeo=600,retrans=6,nfsvers=3', 'timeout': None, 'cgroup': None}
MainProcess|jsonrpc/6::DEBUG::2017-02-06 06:04:59,957::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.201.2:/exports/nfs/iso /rhev/data-center/mnt/192.168.201.2:_exports_nfs_iso (cwd None)


Is this because NFS was not entirely up?! See /var/log/messages, at the very same time:

Feb  6 06:04:59 lago-basic-suite-4-1-host0 systemd: Reached target Host and Network Name Lookups.
Feb  6 06:04:59 lago-basic-suite-4-1-host0 systemd: Starting Host and Network Name Lookups.
Feb  6 06:04:59 lago-basic-suite-4-1-host0 systemd: Reached target RPC Port Mapper.
Feb  6 06:04:59 lago-basic-suite-4-1-host0 systemd: Starting RPC Port Mapper.
Feb  6 06:04:59 lago-basic-suite-4-1-host0 systemd: Starting Preprocess NFS configuration...
Feb  6 06:05:00 lago-basic-suite-4-1-host0 systemd: Started Preprocess NFS configuration.
Feb  6 06:05:00 lago-basic-suite-4-1-host0 systemd: Starting NFS status monitor for NFSv2/3 locking....
Feb  6 06:05:00 lago-basic-suite-4-1-host0 rpc.statd[15488]: Version 1.3.0 starting
Feb  6 06:05:00 lago-basic-suite-4-1-host0 rpc.statd[15488]: Flags: TI-RPC
Feb  6 06:05:00 lago-basic-suite-4-1-host0 rpc.statd[15488]: Initializing NSM state
Feb  6 06:05:00 lago-basic-suite-4-1-host0 systemd: Started NFS status monitor for NFSv2/3 locking..

Strange...

Comment 2 Tal Nisan 2017-02-15 14:24:18 UTC

Adam, you're this week's QE contact, can you please have a look?

Comment 3 Adam Litke 2017-02-15 21:36:30 UTC

This looks like a timing issue between connectStorageServer and the subsequent attach operation but I can't yet find any smoking gun in the logs to substantiate this claim.

Comment 4 Yaniv Kaul 2017-02-23 11:13:14 UTC

(In reply to Adam Litke from comment #3)
> This looks like a timing issue between connectStorageServer and the
> subsequent attach operation but I can't yet find any smoking gun in the logs
> to substantiate this claim.

Adam - what else is needed?

Comment 5 Adam Litke 2017-03-13 13:25:15 UTC

I think the best thing to do to test for a timing issue would be to add a 5-10 second delay after calling connectStorageServer for all domains and before trying to attach those domains.  If that resolves the problem then we need to look at how the lago setup code is handling this flow to ensure it waits on the proper connection statuses from vdsm.

Comment 6 Yaniv Kaul 2017-03-13 13:27:43 UTC

(In reply to Adam Litke from comment #5)
> I think the best thing to do to test for a timing issue would be to add a
> 5-10 second delay after calling connectStorageServer for all domains and
> before trying to attach those domains.  If that resolves the problem then we
> need to look at how the lago setup code is handling this flow to ensure it
> waits on the proper connection statuses from vdsm.

This has very little to do with Lago - Lago is setting up NFS (and iSCSI) way ahead of time, on the storage server (which happens to be the Engine).
Then it only uses the RHV API for everything.
Where should we add the delay?

Comment 7 Adam Litke 2017-03-13 14:27:00 UTC

Right... Lago is trying to create new storage domains from previously created storage sources.  After lago calls the RHV API, engine connects the storage and then tries to work with it.  My guess is that in the special lago environment we are quickly orchestrating calls that are triggering a race condition somewhere.  Adding the delay is a temporary debugging technique to narrow down the race.  We can try a few things:

1. Place a delay in lago after setting up the first SD (which seems to be working reliably) and before setting up the second SD (which is failing intermittently).  If this masks the race then we probably need to look at vdsm's connection manager and storage domain monitoring code.

2. Place a delay in engine between the connectStorageServer call and the attach/create call.  If this masks the race then we need to look at the engine's storage connection logic.  connectStorageServer is an async operation and if I recall correctly there is no way to check the status since that was deemed to racy to trust.  If this turns out to be the problem spot then one way to fix it is to simply retry the attach a second time from lago.

Comment 8 Yaniv Kaul 2017-08-30 08:05:48 UTC

Closing - we have removed this flow from the tests. The issue was that the 'mount' command never returned (seen in supervdsm log). We had no further debug enabling us to understand why it never did.