Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1639997

Summary: Agent fails to start with error "Failed to start monitor ping"
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Sahina Bose <sabose>
Component: AgentAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Polina <pagranat>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.3.0CC: abhishek, bugs, rmcswain, stirabos
Target Milestone: ovirt-4.3.0Flags: rule-engine: ovirt-4.3+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v2.3.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-13 07:44:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Broker logs node2
none
agent logs node2
none
memery usage
none
GUI - host screenshot none

Description Sahina Bose 2018-10-17 06:44:24 UTC
Description of problem:
MainThread::ERROR::2018-10-16 15:58:41,492::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent
MainThread::INFO::2018-10-16 15:58:41,492::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
MainThread::INFO::2018-10-16 15:58:51,881::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.18 started
MainThread::INFO::2018-10-16 15:58:51,907::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: tendrl25.lab.eng.blr.redhat.com
MainThread::INFO::2018-10-16 15:58:51,998::hosted_engine::522::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection
MainThread::INFO::2018-10-16 15:58:52,000::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '10.70.43.254'}
MainThread::ERROR::2018-10-16 15:58:52,000::hosted_engine::538::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors
MainThread::ERROR::2018-10-16 15:58:52,000::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 413, in start_monitoring
    self._initialize_broker()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 535, in _initialize_broker
    m.get('options', {}))
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 83, in start_monitor
    .format(type, options, e))
RequestError: Failed to start monitor ping, options {'addr': '10.70.43.254'}: [Errno 2] No such file or directory


Version-Release number of selected component (if applicable):
4.2.7

How reproducible:


Steps to Reproduce:
1. Reboot one of the nodes in a 3 node HC deployment
2. Move a 2nd node to maintenance and activate (without stopping gluster services)
3. HE VM is not available, and the ovirt-ha-agent fails to start

Actual results:
ovirt-ha-agent service fails with above error in logs

Expected results:


Additional info:

Comment 1 Sahina Bose 2018-10-17 11:47:24 UTC
I don't have steps to reproduce, however noticed this when one of the nodes in the HC environment goes down.

Comment 2 Abhishek Sahni 2018-10-30 10:06:08 UTC
Created attachment 1498874 [details]
Broker logs node2

Comment 3 Abhishek Sahni 2018-10-30 10:06:41 UTC
Created attachment 1498875 [details]
agent logs node2

Comment 4 Abhishek Sahni 2018-10-30 10:15:04 UTC
Hello Team,

I have observed a significant increase in memory usage on all the nodes as the agent and broker are consistently failing. However, everything was working fine before the upgrade. 

Setup Details,

Hyperconverged - 3 gluster nodes along with 2 independent nodes.
32GB memory available on each. 24 cores per nodes.

VMs running on node2 = 0

==================================================
PKGS:
 rpm -qa| grep glusterfs
glusterfs-api-3.12.14-1.el7.x86_64
glusterfs-fuse-3.12.14-1.el7.x86_64
glusterfs-rdma-3.12.14-1.el7.x86_64
glusterfs-events-3.12.14-1.el7.x86_64
glusterfs-cli-3.12.14-1.el7.x86_64
glusterfs-server-3.12.14-1.el7.x86_64
glusterfs-libs-3.12.14-1.el7.x86_64
glusterfs-3.12.14-1.el7.x86_64
glusterfs-client-xlators-3.12.14-1.el7.x86_64
glusterfs-gnfs-3.12.14-1.el7.x86_64
glusterfs-geo-replication-3.12.14-1.el7.x86_64
=========================
rpm -qa| grep ovirt
ovirt-host-dependencies-4.2.3-1.el7.x86_64
cockpit-ovirt-dashboard-0.11.33-1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch
ovirt-host-deploy-1.7.4-1.el7.noarch
ovirt-node-ng-nodectl-4.2.0-0.20181003.0.el7.noarch
cockpit-machines-ovirt-176-2.el7.centos.noarch
ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch
ovirt-hosted-engine-ha-2.2.16-1.el7.noarch
ovirt-hosted-engine-setup-2.2.26-1.el7.noarch
ovirt-node-ng-image-update-placeholder-4.2.6.2-1.el7.noarch
ovirt-setup-lib-1.1.5-1.el7.noarch
ovirt-release42-4.2.6.2-1.el7.noarch
ovirt-imageio-common-1.4.4-0.el7.x86_64
python-ovirt-engine-sdk4-4.2.8-2.el7.x86_64
ovirt-imageio-daemon-1.4.4-0.el7.noarch
ovirt-host-4.2.3-1.el7.x86_64
ovirt-node-ng-image-update-4.2.6.2-1.el7.noarch
ovirt-vmconsole-1.0.5-4.el7.centos.noarch
ovirt-provider-ovn-driver-1.2.14-1.el7.noarch
ovirt-release-host-node-4.2.6.2-1.el7.noarch
==================================================

I think this should we should increase the priority.

Comment 5 Abhishek Sahni 2018-10-30 10:45:56 UTC
Created attachment 1498889 [details]
memery usage

Comment 6 Abhishek Sahni 2018-10-30 10:49:04 UTC
Created attachment 1498890 [details]
GUI - host screenshot

Comment 7 Sandro Bonazzola 2018-11-02 12:26:59 UTC
(In reply to Abhishek Sahni from comment #5)
> Created attachment 1498889 [details]
> memery usage

Not really useful, can you please attach a sos report?

Comment 8 Abhishek Sahni 2018-11-20 11:45:06 UTC
Hello Sandro,

Thanks for your response. It seems like there was an unknown lock on one of the node i.e. node4 independent node (not with the gluster setup) as I tried to restart all the 3 nodes (node1,2,3) (HC nodes - gluster) and because of not having a workaround I restarted that node4. For me, that issue has been resolved at the same moment.

Comment 9 Polina 2018-12-20 09:07:05 UTC
Hi , the verification is done in 
   ovirt-engine-4.3.0-0.6.alpha2.el7.noarch
   vdsm-4.30.4-1.el7ev.x86_64
   vdsm-gluster-4.30.4-1.el7ev.x86_64
   ovirt-hosted-engine-setup-2.3.0-0.1.alpha.gitd1b8dbb.el7ev.noarch
Here is the exact scenario:
Pre-condition: 
   1.there are three active hosts; 2.the HE VM storage is on gluster, 3. HE VM runs on host1. 4. one guest VM runs on host1.
Steps:
   1. Reboot the host2.
   2. Maintenance the host3.
   everythins is ok - HE VM is Up , also guest VM is Up.
   3. ssh host1 and block the access to gluster storage services (replica3) by iptables DROP rule (like, iptables -I INPUT -s gluster01.lab.eng.tlv2.redhat.com -j DROP). 
As result at this point we see monitoring Errors in agent.log , ovirt-ha-broker.service active and ovirt-ha-agent.service is down which could be considered ok while this state.

MainThread::ERROR::2018-12-20 10:28:58,474::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 413, in start_monitoring
    self._initialize_broker()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 537, in _initialize_broker
    m.get('options', {}))
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 83, in start_monitor
    .format(type, options, e))
RequestError: Failed to start monitor ping, options {'addr': '10.46.17.254'}: [Errno 2] No such file or directory

MainThread::ERROR::2018-12-20 10:28:58,474::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent
MainThread::INFO::2018-12-20 10:28:58,475::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down



systemctl status ovirt-ha-broker
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2018-12-20 10:23:48 IST; 7min ago
 Main PID: 116965 (ovirt-ha-broker)
    Tasks: 2
   Memory: 19.8M
   CGroup: /system.slice/ovirt-ha-broker.service
           └─116965 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker

Dec 20 10:23:48 lynx12.lab.eng.tlv2.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.
[root@lynx12 ovirt-hosted-engine-ha]# systemctl status ovirt-ha-agent
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2018-12-20 10:31:18 IST; 9s ago
  Process: 119251 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157)
 Main PID: 119251 (code=exited, status=157)

Dec 20 10:31:18 lynx12.lab.eng.tlv2.redhat.com systemd[1]: Unit ovirt-ha-agent.service entered failed state.
Dec 20 10:31:18 lynx12.lab.eng.tlv2.redhat.com systemd[1]: ovirt-ha-agent.service failed.


   4. Return access to the gluster storage .

Result: for a while we still see these errors, but then agent service is restarted and the HE VM is Up and running.,

Comment 10 Sandro Bonazzola 2019-02-13 07:44:59 UTC
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 11 Robert McSwain 2019-04-11 19:57:00 UTC
Is there any understanding of how this can be worked around if restarting the gluster nodes and bringing down the agent/broker doesn't help and still results in the 

RequestError: Failed to start monitor ping, options {'addr': 'GA.TE.WA.Y'}: [Errno 2] No such file or directory

error in agent.log?

Comment 12 Simone Tiraboschi 2019-04-11 21:11:09 UTC
That error in agent.log simply means that the agent failed to communicate with the broker over a unix domain socket since the socket it's closed on broker side.
This means that you have to check broker.log to understand why the broker is failing (99% it's a storage issue) which is the root cause of this.