Description of problem: MainThread::ERROR::2018-10-16 15:58:41,492::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2018-10-16 15:58:41,492::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2018-10-16 15:58:51,881::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.18 started MainThread::INFO::2018-10-16 15:58:51,907::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: tendrl25.lab.eng.blr.redhat.com MainThread::INFO::2018-10-16 15:58:51,998::hosted_engine::522::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2018-10-16 15:58:52,000::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '10.70.43.254'} MainThread::ERROR::2018-10-16 15:58:52,000::hosted_engine::538::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors MainThread::ERROR::2018-10-16 15:58:52,000::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 413, in start_monitoring self._initialize_broker() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 535, in _initialize_broker m.get('options', {})) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 83, in start_monitor .format(type, options, e)) RequestError: Failed to start monitor ping, options {'addr': '10.70.43.254'}: [Errno 2] No such file or directory Version-Release number of selected component (if applicable): 4.2.7 How reproducible: Steps to Reproduce: 1. Reboot one of the nodes in a 3 node HC deployment 2. Move a 2nd node to maintenance and activate (without stopping gluster services) 3. HE VM is not available, and the ovirt-ha-agent fails to start Actual results: ovirt-ha-agent service fails with above error in logs Expected results: Additional info:
I don't have steps to reproduce, however noticed this when one of the nodes in the HC environment goes down.
Created attachment 1498874 [details] Broker logs node2
Created attachment 1498875 [details] agent logs node2
Hello Team, I have observed a significant increase in memory usage on all the nodes as the agent and broker are consistently failing. However, everything was working fine before the upgrade. Setup Details, Hyperconverged - 3 gluster nodes along with 2 independent nodes. 32GB memory available on each. 24 cores per nodes. VMs running on node2 = 0 ================================================== PKGS: rpm -qa| grep glusterfs glusterfs-api-3.12.14-1.el7.x86_64 glusterfs-fuse-3.12.14-1.el7.x86_64 glusterfs-rdma-3.12.14-1.el7.x86_64 glusterfs-events-3.12.14-1.el7.x86_64 glusterfs-cli-3.12.14-1.el7.x86_64 glusterfs-server-3.12.14-1.el7.x86_64 glusterfs-libs-3.12.14-1.el7.x86_64 glusterfs-3.12.14-1.el7.x86_64 glusterfs-client-xlators-3.12.14-1.el7.x86_64 glusterfs-gnfs-3.12.14-1.el7.x86_64 glusterfs-geo-replication-3.12.14-1.el7.x86_64 ========================= rpm -qa| grep ovirt ovirt-host-dependencies-4.2.3-1.el7.x86_64 cockpit-ovirt-dashboard-0.11.33-1.el7.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch ovirt-host-deploy-1.7.4-1.el7.noarch ovirt-node-ng-nodectl-4.2.0-0.20181003.0.el7.noarch cockpit-machines-ovirt-176-2.el7.centos.noarch ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch ovirt-hosted-engine-ha-2.2.16-1.el7.noarch ovirt-hosted-engine-setup-2.2.26-1.el7.noarch ovirt-node-ng-image-update-placeholder-4.2.6.2-1.el7.noarch ovirt-setup-lib-1.1.5-1.el7.noarch ovirt-release42-4.2.6.2-1.el7.noarch ovirt-imageio-common-1.4.4-0.el7.x86_64 python-ovirt-engine-sdk4-4.2.8-2.el7.x86_64 ovirt-imageio-daemon-1.4.4-0.el7.noarch ovirt-host-4.2.3-1.el7.x86_64 ovirt-node-ng-image-update-4.2.6.2-1.el7.noarch ovirt-vmconsole-1.0.5-4.el7.centos.noarch ovirt-provider-ovn-driver-1.2.14-1.el7.noarch ovirt-release-host-node-4.2.6.2-1.el7.noarch ================================================== I think this should we should increase the priority.
Created attachment 1498889 [details] memery usage
Created attachment 1498890 [details] GUI - host screenshot
(In reply to Abhishek Sahni from comment #5) > Created attachment 1498889 [details] > memery usage Not really useful, can you please attach a sos report?
Hello Sandro, Thanks for your response. It seems like there was an unknown lock on one of the node i.e. node4 independent node (not with the gluster setup) as I tried to restart all the 3 nodes (node1,2,3) (HC nodes - gluster) and because of not having a workaround I restarted that node4. For me, that issue has been resolved at the same moment.
Hi , the verification is done in ovirt-engine-4.3.0-0.6.alpha2.el7.noarch vdsm-4.30.4-1.el7ev.x86_64 vdsm-gluster-4.30.4-1.el7ev.x86_64 ovirt-hosted-engine-setup-2.3.0-0.1.alpha.gitd1b8dbb.el7ev.noarch Here is the exact scenario: Pre-condition: 1.there are three active hosts; 2.the HE VM storage is on gluster, 3. HE VM runs on host1. 4. one guest VM runs on host1. Steps: 1. Reboot the host2. 2. Maintenance the host3. everythins is ok - HE VM is Up , also guest VM is Up. 3. ssh host1 and block the access to gluster storage services (replica3) by iptables DROP rule (like, iptables -I INPUT -s gluster01.lab.eng.tlv2.redhat.com -j DROP). As result at this point we see monitoring Errors in agent.log , ovirt-ha-broker.service active and ovirt-ha-agent.service is down which could be considered ok while this state. MainThread::ERROR::2018-12-20 10:28:58,474::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 413, in start_monitoring self._initialize_broker() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 537, in _initialize_broker m.get('options', {})) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 83, in start_monitor .format(type, options, e)) RequestError: Failed to start monitor ping, options {'addr': '10.46.17.254'}: [Errno 2] No such file or directory MainThread::ERROR::2018-12-20 10:28:58,474::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2018-12-20 10:28:58,475::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down systemctl status ovirt-ha-broker ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2018-12-20 10:23:48 IST; 7min ago Main PID: 116965 (ovirt-ha-broker) Tasks: 2 Memory: 19.8M CGroup: /system.slice/ovirt-ha-broker.service └─116965 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker Dec 20 10:23:48 lynx12.lab.eng.tlv2.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. [root@lynx12 ovirt-hosted-engine-ha]# systemctl status ovirt-ha-agent ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Thu 2018-12-20 10:31:18 IST; 9s ago Process: 119251 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157) Main PID: 119251 (code=exited, status=157) Dec 20 10:31:18 lynx12.lab.eng.tlv2.redhat.com systemd[1]: Unit ovirt-ha-agent.service entered failed state. Dec 20 10:31:18 lynx12.lab.eng.tlv2.redhat.com systemd[1]: ovirt-ha-agent.service failed. 4. Return access to the gluster storage . Result: for a while we still see these errors, but then agent service is restarted and the HE VM is Up and running.,
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
Is there any understanding of how this can be worked around if restarting the gluster nodes and bringing down the agent/broker doesn't help and still results in the RequestError: Failed to start monitor ping, options {'addr': 'GA.TE.WA.Y'}: [Errno 2] No such file or directory error in agent.log?
That error in agent.log simply means that the agent failed to communicate with the broker over a unix domain socket since the socket it's closed on broker side. This means that you have to check broker.log to understand why the broker is failing (99% it's a storage issue) which is the root cause of this.