Bug 1443156

Summary: HE agent is slow when faced with stale nfs mount
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Jiri Belka <jbelka>
Component: GeneralAssignee: bugs <bugs>
Status: CLOSED DUPLICATE QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.1.0.5CC: bugs, msivak, nsoffer, rhodain, tnisan
Target Milestone: ovirt-4.2.5Flags: rule-engine: ovirt-4.2?
rule-engine: ovirt-4.3+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-02 08:53:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jiri Belka 2017-04-18 15:26:43 UTC
Description of problem:

We had an export domain (NFS) which was out of service but it seems it wasn't detached correctly and thus it became stale nfs mount and HE agent took ages to pass 'Extracting Engine VM OVF from the OVF_STORE'... This BZ was requested from stirabos@.

Apr 07 13:47:02 slot-2.rhev.lab.eng.brq.redhat.com kernel: nfs: server 10.34.63.204 not responding, timed out

...
MainThread::INFO::2017-04-07 13:35:48,633::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE
MainThread::INFO::2017-04-07 13:57:09,121::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhev/data-center/mnt/10.34.63.199:___brq-setup/23c03bb6-9889-4cbf-b7ad-55b9a2c70653/images/9bda2694-08e0-443a-8c83-e986506e0ef9/8d084f1f-6941-44ce-a2bf-5832d5ad9362

  ^^ it took 22 minutes!

...
MainThread::DEBUG::2017-04-07 19:40:19,729::hosted_engine::427::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Processing engine state <ovirt_hosted_engine_ha.agent.states.GlobalMaintenance object at 0x1ece640>
MainThread::INFO::2017-04-07 19:40:19,729::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1491586819.73 type=state_transition detail=EngineDown-GlobalMaintenance hostname='slot-2.rhev.lab.eng.brq.redhat.com'
MainThread::DEBUG::2017-04-07 19:40:19,729::brokerlink::274::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Sending request: notify time=1491586819.73 type=state_transition detail=EngineDown-GlobalMaintenance hostname='slot-2.rhev.lab.eng.brq.redhat.com'
MainThread::DEBUG::2017-04-07 19:40:19,729::util::80::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(socket_readline) socket_readline with 30.0 seconds timeout
MainThread::DEBUG::2017-04-07 19:40:49,756::util::91::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(socket_readline) Connection timeout while reading from socket
MainThread::ERROR::2017-04-07 19:40:49,756::brokerlink::280::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out
MainThread::DEBUG::2017-04-07 19:40:49,757::brokerlink::86::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(disconnect) Closing connection to ha-broker
MainThread::ERROR::2017-04-07 19:40:49,778::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 191, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 64, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring
    hostname=socket.gethostname())
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 118, in notify
    .format(event_type, detail, options, e))
RequestError: Failed to send notification about state_transition, detail EngineDown-GlobalMaintenance, options {'hostname': 'slot-2.rhev.lab.eng.brq.redhat.com'}: Connection timed out

Version-Release number of selected component (if applicable):
vdsm-4.19.10.1-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.1.0.5-1.el7ev.noarch

How reproducible:
just happens

Steps to Reproduce:
1. deploy SHE
2. check via hosted-engine --vm-status that it's ok
3. add as nfs domain (eg. export) and make it unavailable so it is stale mount point on hosts
4. restart HE broker/agent

Actual results:
it takes ages (cca 20 mins) for HE agent to pass 'Extracting Engine VM OVF from the OVF_STORE' and hosted-engine --vm-status shows iirc 'unknown stale data' (??)

Expected results:
at least it should not be stucked on state nfs mount point, ie. just HE agent/broker

Additional info:

Comment 2 Allon Mureinik 2017-04-19 06:36:16 UTC
I sincerely doubt there's anything we can do here - the mount is stale, and NFS itself takes too long to return an error.

Nir, am I missing anything?

Comment 3 Nir Soffer 2017-04-21 08:42:56 UTC
(In reply to Allon Mureinik from comment #2)
> I sincerely doubt there's anything we can do here - the mount is stale, and
> NFS itself takes too long to return an error.

If the mount was stale, we would never succeed with extracting the ovf. Seems that
the nfs server was simply very slow. I don't know what we can do better in this
case.

Comment 4 Allon Mureinik 2017-04-24 12:10:13 UTC
(In reply to Nir Soffer from comment #3)
> (In reply to Allon Mureinik from comment #2)
> > I sincerely doubt there's anything we can do here - the mount is stale, and
> > NFS itself takes too long to return an error.
> 
> If the mount was stale, we would never succeed with extracting the ovf.
> Seems that
> the nfs server was simply very slow. I don't know what we can do better in
> this
> case.

So just CLOSE CANTFIX?

Comment 5 Nir Soffer 2017-04-25 06:50:22 UTC
I would check the logs first to understand this issue better.

Comment 6 Roman Hodain 2017-07-18 12:55:56 UTC
We have just hit this issue in our testing lab when our export domain and NFS domain got stale. The hosted engine SD is placed on FC domain and the gent got stuck on bot of the HE nodes. The problem is in 

    ovirt_hosted_engine_ha/lib/heconflib.py

in method get_volume_path. We create the volume path like this:

317     volume_path = os.path.join(
318         volume_path,
319         '*',
320         sd_uuid,
321         'images',
322         img_uuid,
323         vol_uuid,
324     )

The volume path looks like this in our case:

   /rhev/data-center/mnt/*/27da7524-f4b7-41d9-bcc4-c524e4540568/images/1853ae71-943f-4b70-81cb-5e5bcb538524/f2208f13-2f76-46f9-89ba-44a1a0c2ac43

as there are also another mount points than the HE SD we are delayed on the stale iso and export domain.

Changing to high to bring attention to this issue as it affect HE availability in case of networking issues.

Comment 7 Nir Soffer 2018-01-07 17:30:53 UTC
Based on comment 6, moving to integration team.

Martin, can you check this?

Comment 9 Tal Nisan 2018-07-02 08:53:25 UTC

*** This bug has been marked as a duplicate of bug 1485883 ***