Bug 1421945
Summary: | Adding a secondary storage domain fails sporadically (NFS client not completely up on time?!) | ||
---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Barak Korren <bkorren> |
Component: | BLL.Storage | Assignee: | Allon Mureinik <amureini> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Raz Tamir <ratamir> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | future | CC: | alitke, amureini, bugs, tnisan, ylavi |
Target Milestone: | ovirt-4.2.0 | Flags: | rule-engine:
ovirt-4.2+
|
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-30 08:05:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Barak Korren
2017-02-14 07:26:21 UTC
There are several interesting items here (http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_4.1/488/artifact/exported-artifacts/basic-suit-4.1-el7/test_logs/basic-suite-4.1/post-002_bootstrap.py/lago-basic-suite-4-1-host0/_var_log/vdsm/vdsm.log ) - We have the usual clutter of unfetched domains (to be fixed in 4.1.z), We have this: 2017-02-06 06:05:01,061 INFO (jsonrpc/4) [dispatcher] Run and protect: downloadImage(methodArgs={u'url': u'http://glance.ovirt.org:9292/v1/images/36de0ae5-d5ce-4626-9fa7-272b324d3656', u'headers': {}, u'method': u'http'}, spUUID=u'067bd1e2-c0e3-41fb-8dd8-9d3b3e98df4f', sdUUID=u'446a8b9b-e71a-4489-901f-007f77e7f04d', imgUUID=u'7f78d8c6-bfc6-40b9-b530-dc6904480847', volUUID=u'06d9221b-1e0d-4d8d-b939-cd20d58a8acd') (logUtils:49) 2017-02-06 06:05:01,067 INFO (jsonrpc/4) [dispatcher] Run and protect: downloadImage, Return response: None (logUtils:52) 2017-02-06 06:05:01,077 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Image.download succeeded in 0.02 seconds (__init__:515) (Downloaded a 12MB image in 0.02 seconds? That's really really really fast) There are multiple connectStorageServer to the same mount, which I can't explain: 1. (jsonrpc/5) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'00000000-0000-0000-0000-000000000000', u'connection': u'192.168.201.2:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.1', u'password': '********', u'port': u''}], options=None) (logUtils:49) 2017-02-06 06:04:14,396 INFO (jsonrpc/5) [storage.Mount] mounting 192.168.201.2:/exports/nfs/share1 at /rhev/data-center/mnt/192.168.201.2:_exports_nfs_share1 (mount:187) 2017-02-06 06:04:14,546 INFO (jsonrpc/5) [IOProcessClient] Starting client ioprocess-1 (__init__:330) 2017-02-06 06:04:14,550 INFO (ioprocess/14933) [IOProcess] Starting ioprocess (__init__:452) 2017-02-06 06:04:14,562 INFO (jsonrpc/5) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'00000000-0000-0000-0000-000000000000'}]} (logUtils:52) 2017-02-06 06:04:14,563 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 0.17 seconds (__init__:515) 2. 2017-02-06 06:04:16,021 INFO (jsonrpc/7) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe', u'connection': u'192.168.201.2:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.1', u'password': '********', u'port': u''}], options=None) (logUtils:49) 2017-02-06 06:04:16,023 INFO (jsonrpc/7) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe'}]} (logUtils:52) 2017-02-06 06:04:16,024 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 0.00 seconds (__init__:515) 3. 2017-02-06 06:04:27,183 INFO (jsonrpc/1) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe', u'connection': u'192.168.201.2:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.1', u'password': '********', u'port': u''}], options=None) (logUtils:49) 2017-02-06 06:04:27,189 INFO (jsonrpc/1) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe'}]} (logUtils:52) 2017-02-06 06:04:27,189 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 0.01 seconds (__init__:515) 4. 2017-02-06 06:04:56,303 INFO (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe', u'connection': u'192.168.201.2:/exports/nfs/share1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'4.1', u'password': '********', u'port': u''}], options=None) (logUtils:49) 2017-02-06 06:04:56,316 INFO (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'1a774d48-d7f5-4db8-b763-9689b3b3dfbe'}]} (logUtils:52) 2017-02-06 06:04:56,316 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call StoragePool.connectStorageServer succeeded in 0.01 seconds (__init__:515) I believe we have a bug for this for 4.2. Now, there's the ISO domain: 2017-02-06 06:04:59,954 INFO (jsonrpc/6) [dispatcher] Run and protect: connectStorageServer(domType=1, spUUID=u'00000000-0000-0000-0000-000000000000', conList=[{u'id': u'25f35b55-23ce-448c-8fc0-730cb8be8c1a', u'connection': u'192.168.201.2:/exports/nfs/iso', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'protocol_version': u'3', u'password': '********', u'port': u''}], options=None) (logUtils:49) 2017-02-06 06:04:59,956 INFO (jsonrpc/6) [storage.Mount] mounting 192.168.201.2:/exports/nfs/iso at /rhev/data-center/mnt/192.168.201.2:_exports_nfs_iso (mount:187) Which did not seem to return at all. I see it in supervdsm log, but no return: MainProcess|jsonrpc/6::DEBUG::2017-02-06 06:04:59,957::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call mount with (u'192.168.201.2:/exports/nfs/iso', u'/rhev/data-center/mnt/192.168.201.2:_exports_nfs_iso') {'vfstype': 'nfs', 'mntOpts': 'soft,nosharecache,timeo=600,retrans=6,nfsvers=3', 'timeout': None, 'cgroup': None} MainProcess|jsonrpc/6::DEBUG::2017-02-06 06:04:59,957::commands::69::root::(execCmd) /usr/bin/taskset --cpu-list 0-1 /usr/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.201.2:/exports/nfs/iso /rhev/data-center/mnt/192.168.201.2:_exports_nfs_iso (cwd None) Is this because NFS was not entirely up?! See /var/log/messages, at the very same time: Feb 6 06:04:59 lago-basic-suite-4-1-host0 systemd: Reached target Host and Network Name Lookups. Feb 6 06:04:59 lago-basic-suite-4-1-host0 systemd: Starting Host and Network Name Lookups. Feb 6 06:04:59 lago-basic-suite-4-1-host0 systemd: Reached target RPC Port Mapper. Feb 6 06:04:59 lago-basic-suite-4-1-host0 systemd: Starting RPC Port Mapper. Feb 6 06:04:59 lago-basic-suite-4-1-host0 systemd: Starting Preprocess NFS configuration... Feb 6 06:05:00 lago-basic-suite-4-1-host0 systemd: Started Preprocess NFS configuration. Feb 6 06:05:00 lago-basic-suite-4-1-host0 systemd: Starting NFS status monitor for NFSv2/3 locking.... Feb 6 06:05:00 lago-basic-suite-4-1-host0 rpc.statd[15488]: Version 1.3.0 starting Feb 6 06:05:00 lago-basic-suite-4-1-host0 rpc.statd[15488]: Flags: TI-RPC Feb 6 06:05:00 lago-basic-suite-4-1-host0 rpc.statd[15488]: Initializing NSM state Feb 6 06:05:00 lago-basic-suite-4-1-host0 systemd: Started NFS status monitor for NFSv2/3 locking.. Strange... Adam, you're this week's QE contact, can you please have a look? This looks like a timing issue between connectStorageServer and the subsequent attach operation but I can't yet find any smoking gun in the logs to substantiate this claim. (In reply to Adam Litke from comment #3) > This looks like a timing issue between connectStorageServer and the > subsequent attach operation but I can't yet find any smoking gun in the logs > to substantiate this claim. Adam - what else is needed? I think the best thing to do to test for a timing issue would be to add a 5-10 second delay after calling connectStorageServer for all domains and before trying to attach those domains. If that resolves the problem then we need to look at how the lago setup code is handling this flow to ensure it waits on the proper connection statuses from vdsm. (In reply to Adam Litke from comment #5) > I think the best thing to do to test for a timing issue would be to add a > 5-10 second delay after calling connectStorageServer for all domains and > before trying to attach those domains. If that resolves the problem then we > need to look at how the lago setup code is handling this flow to ensure it > waits on the proper connection statuses from vdsm. This has very little to do with Lago - Lago is setting up NFS (and iSCSI) way ahead of time, on the storage server (which happens to be the Engine). Then it only uses the RHV API for everything. Where should we add the delay? Right... Lago is trying to create new storage domains from previously created storage sources. After lago calls the RHV API, engine connects the storage and then tries to work with it. My guess is that in the special lago environment we are quickly orchestrating calls that are triggering a race condition somewhere. Adding the delay is a temporary debugging technique to narrow down the race. We can try a few things: 1. Place a delay in lago after setting up the first SD (which seems to be working reliably) and before setting up the second SD (which is failing intermittently). If this masks the race then we probably need to look at vdsm's connection manager and storage domain monitoring code. 2. Place a delay in engine between the connectStorageServer call and the attach/create call. If this masks the race then we need to look at the engine's storage connection logic. connectStorageServer is an async operation and if I recall correctly there is no way to check the status since that was deemed to racy to trust. If this turns out to be the problem spot then one way to fix it is to simply retry the attach a second time from lago. Closing - we have removed this flow from the tests. The issue was that the 'mount' command never returned (seen in supervdsm log). We had no further debug enabling us to understand why it never did. |