Bug 1639997 - Agent fails to start with error "Failed to start monitor ping"
Summary: Agent fails to start with error "Failed to start monitor ping"
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 2.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.3.0
: ---
Assignee: Simone Tiraboschi
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-17 06:44 UTC by Sahina Bose
Modified: 2019-04-11 21:11 UTC (History)
4 users (show)

Fixed In Version: v2.3.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-13 07:44:59 UTC
oVirt Team: Integration
rule-engine: ovirt-4.3+


Attachments (Terms of Use)
Broker logs node2 (347.17 KB, text/plain)
2018-10-30 10:06 UTC, Abhishek Sahni
no flags Details
agent logs node2 (1.09 MB, text/plain)
2018-10-30 10:06 UTC, Abhishek Sahni
no flags Details
memery usage (34.07 KB, image/png)
2018-10-30 10:45 UTC, Abhishek Sahni
no flags Details
GUI - host screenshot (31.66 KB, image/png)
2018-10-30 10:49 UTC, Abhishek Sahni
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 95564 'None' MERGED text: rephrase brokerlink exceptions 2020-11-21 13:42:38 UTC
oVirt gerrit 95588 'None' ABANDONED err: rephrase brokerlink exceptions 2020-11-21 13:42:37 UTC

Description Sahina Bose 2018-10-17 06:44:24 UTC
Description of problem:
MainThread::ERROR::2018-10-16 15:58:41,492::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent
MainThread::INFO::2018-10-16 15:58:41,492::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
MainThread::INFO::2018-10-16 15:58:51,881::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.18 started
MainThread::INFO::2018-10-16 15:58:51,907::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: tendrl25.lab.eng.blr.redhat.com
MainThread::INFO::2018-10-16 15:58:51,998::hosted_engine::522::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection
MainThread::INFO::2018-10-16 15:58:52,000::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '10.70.43.254'}
MainThread::ERROR::2018-10-16 15:58:52,000::hosted_engine::538::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors
MainThread::ERROR::2018-10-16 15:58:52,000::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 413, in start_monitoring
    self._initialize_broker()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 535, in _initialize_broker
    m.get('options', {}))
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 83, in start_monitor
    .format(type, options, e))
RequestError: Failed to start monitor ping, options {'addr': '10.70.43.254'}: [Errno 2] No such file or directory


Version-Release number of selected component (if applicable):
4.2.7

How reproducible:


Steps to Reproduce:
1. Reboot one of the nodes in a 3 node HC deployment
2. Move a 2nd node to maintenance and activate (without stopping gluster services)
3. HE VM is not available, and the ovirt-ha-agent fails to start

Actual results:
ovirt-ha-agent service fails with above error in logs

Expected results:


Additional info:

Comment 1 Sahina Bose 2018-10-17 11:47:24 UTC
I don't have steps to reproduce, however noticed this when one of the nodes in the HC environment goes down.

Comment 2 Abhishek Sahni 2018-10-30 10:06:08 UTC
Created attachment 1498874 [details]
Broker logs node2

Comment 3 Abhishek Sahni 2018-10-30 10:06:41 UTC
Created attachment 1498875 [details]
agent logs node2

Comment 4 Abhishek Sahni 2018-10-30 10:15:04 UTC
Hello Team,

I have observed a significant increase in memory usage on all the nodes as the agent and broker are consistently failing. However, everything was working fine before the upgrade. 

Setup Details,

Hyperconverged - 3 gluster nodes along with 2 independent nodes.
32GB memory available on each. 24 cores per nodes.

VMs running on node2 = 0

==================================================
PKGS:
 rpm -qa| grep glusterfs
glusterfs-api-3.12.14-1.el7.x86_64
glusterfs-fuse-3.12.14-1.el7.x86_64
glusterfs-rdma-3.12.14-1.el7.x86_64
glusterfs-events-3.12.14-1.el7.x86_64
glusterfs-cli-3.12.14-1.el7.x86_64
glusterfs-server-3.12.14-1.el7.x86_64
glusterfs-libs-3.12.14-1.el7.x86_64
glusterfs-3.12.14-1.el7.x86_64
glusterfs-client-xlators-3.12.14-1.el7.x86_64
glusterfs-gnfs-3.12.14-1.el7.x86_64
glusterfs-geo-replication-3.12.14-1.el7.x86_64
=========================
rpm -qa| grep ovirt
ovirt-host-dependencies-4.2.3-1.el7.x86_64
cockpit-ovirt-dashboard-0.11.33-1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch
ovirt-host-deploy-1.7.4-1.el7.noarch
ovirt-node-ng-nodectl-4.2.0-0.20181003.0.el7.noarch
cockpit-machines-ovirt-176-2.el7.centos.noarch
ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch
ovirt-hosted-engine-ha-2.2.16-1.el7.noarch
ovirt-hosted-engine-setup-2.2.26-1.el7.noarch
ovirt-node-ng-image-update-placeholder-4.2.6.2-1.el7.noarch
ovirt-setup-lib-1.1.5-1.el7.noarch
ovirt-release42-4.2.6.2-1.el7.noarch
ovirt-imageio-common-1.4.4-0.el7.x86_64
python-ovirt-engine-sdk4-4.2.8-2.el7.x86_64
ovirt-imageio-daemon-1.4.4-0.el7.noarch
ovirt-host-4.2.3-1.el7.x86_64
ovirt-node-ng-image-update-4.2.6.2-1.el7.noarch
ovirt-vmconsole-1.0.5-4.el7.centos.noarch
ovirt-provider-ovn-driver-1.2.14-1.el7.noarch
ovirt-release-host-node-4.2.6.2-1.el7.noarch
==================================================

I think this should we should increase the priority.

Comment 5 Abhishek Sahni 2018-10-30 10:45:56 UTC
Created attachment 1498889 [details]
memery usage

Comment 6 Abhishek Sahni 2018-10-30 10:49:04 UTC
Created attachment 1498890 [details]
GUI - host screenshot

Comment 7 Sandro Bonazzola 2018-11-02 12:26:59 UTC
(In reply to Abhishek Sahni from comment #5)
> Created attachment 1498889 [details]
> memery usage

Not really useful, can you please attach a sos report?

Comment 8 Abhishek Sahni 2018-11-20 11:45:06 UTC
Hello Sandro,

Thanks for your response. It seems like there was an unknown lock on one of the node i.e. node4 independent node (not with the gluster setup) as I tried to restart all the 3 nodes (node1,2,3) (HC nodes - gluster) and because of not having a workaround I restarted that node4. For me, that issue has been resolved at the same moment.

Comment 9 Polina 2018-12-20 09:07:05 UTC
Hi , the verification is done in 
   ovirt-engine-4.3.0-0.6.alpha2.el7.noarch
   vdsm-4.30.4-1.el7ev.x86_64
   vdsm-gluster-4.30.4-1.el7ev.x86_64
   ovirt-hosted-engine-setup-2.3.0-0.1.alpha.gitd1b8dbb.el7ev.noarch
Here is the exact scenario:
Pre-condition: 
   1.there are three active hosts; 2.the HE VM storage is on gluster, 3. HE VM runs on host1. 4. one guest VM runs on host1.
Steps:
   1. Reboot the host2.
   2. Maintenance the host3.
   everythins is ok - HE VM is Up , also guest VM is Up.
   3. ssh host1 and block the access to gluster storage services (replica3) by iptables DROP rule (like, iptables -I INPUT -s gluster01.lab.eng.tlv2.redhat.com -j DROP). 
As result at this point we see monitoring Errors in agent.log , ovirt-ha-broker.service active and ovirt-ha-agent.service is down which could be considered ok while this state.

MainThread::ERROR::2018-12-20 10:28:58,474::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 413, in start_monitoring
    self._initialize_broker()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 537, in _initialize_broker
    m.get('options', {}))
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 83, in start_monitor
    .format(type, options, e))
RequestError: Failed to start monitor ping, options {'addr': '10.46.17.254'}: [Errno 2] No such file or directory

MainThread::ERROR::2018-12-20 10:28:58,474::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent
MainThread::INFO::2018-12-20 10:28:58,475::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down



systemctl status ovirt-ha-broker
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2018-12-20 10:23:48 IST; 7min ago
 Main PID: 116965 (ovirt-ha-broker)
    Tasks: 2
   Memory: 19.8M
   CGroup: /system.slice/ovirt-ha-broker.service
           └─116965 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker

Dec 20 10:23:48 lynx12.lab.eng.tlv2.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.
[root@lynx12 ovirt-hosted-engine-ha]# systemctl status ovirt-ha-agent
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2018-12-20 10:31:18 IST; 9s ago
  Process: 119251 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157)
 Main PID: 119251 (code=exited, status=157)

Dec 20 10:31:18 lynx12.lab.eng.tlv2.redhat.com systemd[1]: Unit ovirt-ha-agent.service entered failed state.
Dec 20 10:31:18 lynx12.lab.eng.tlv2.redhat.com systemd[1]: ovirt-ha-agent.service failed.


   4. Return access to the gluster storage .

Result: for a while we still see these errors, but then agent service is restarted and the HE VM is Up and running.,

Comment 10 Sandro Bonazzola 2019-02-13 07:44:59 UTC
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 11 Robert McSwain 2019-04-11 19:57:00 UTC
Is there any understanding of how this can be worked around if restarting the gluster nodes and bringing down the agent/broker doesn't help and still results in the 

RequestError: Failed to start monitor ping, options {'addr': 'GA.TE.WA.Y'}: [Errno 2] No such file or directory

error in agent.log?

Comment 12 Simone Tiraboschi 2019-04-11 21:11:09 UTC
That error in agent.log simply means that the agent failed to communicate with the broker over a unix domain socket since the socket it's closed on broker side.
This means that you have to check broker.log to understand why the broker is failing (99% it's a storage issue) which is the root cause of this.


Note You need to log in before you can comment on or make changes to this bug.