Bug 1393012

Summary: when restarting vdsm, mom also restarting, and immediately failed.
Product: [oVirt] vdsm Reporter: Shira Maximov <mshira>
Component: ServicesAssignee: Martin Sivák <msivak>
Status: CLOSED CURRENTRELEASE QA Contact: Shira Maximov <mshira>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.18.15CC: alukiano, bugs, gklein, mburman, mgoldboi, msivak
Target Milestone: ovirt-4.0.6Keywords: AutomationBlocker, Regression
Target Release: 4.18.18Flags: gklein: ovirt-4.0.z?
gklein: blocker?
gklein: planning_ack?
rule-engine: devel_ack+
rule-engine: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: VDSM takes some time to start and MOM is fast in trying to contact it. Consequence: MOM died with connection refused error. Fix: MOM is configured to try multiple times now (each 5 seconds, but no more than 10 times per two minutes). Exceeding the retry counter will cause MOM to shut down permanently as the issue is more substantial and needs sysadmins intervention. MOM can be re-enabled by using systemctl reset-failed mom-vdsm.service Result: MOM will start eventually. It might log one or two restarts.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 07:29:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shira Maximov 2016-11-08 17:05:11 UTC
Description of problem:
when restarting vdsm, mom also restarting, and immediately failed.

[journalctl --unit vdsmd]
Nov 08 18:25:27 RHEL7.2Server systemd[1]: Stopping Virtual Desktop
Server Manager...
Nov 08 18:25:27 RHEL7.2Server vdsmd_init_common.sh[3053]: vdsm:
Running run_final_hooks
Nov 08 18:25:27 RHEL7.2Server systemd[1]: Starting Virtual Desktop
Server Manager...

[journalctl --unit mom-vdsm]
Nov 08 18:17:23 RHEL7.2Server systemd[1]: Starting MOM instance
configured for VDSM purposes...
Nov 08 18:25:16 RHEL7.2Server systemd[1]: Stopping MOM instance
configured for VDSM purposes...
Nov 08 18:25:29 RHEL7.2Server systemd[1]: Started MOM instance
configured for VDSM purposes.


But mom then immediately failed with:

2016-11-08 18:25:08,008 - mom.RPCServer - INFO - ping()
2016-11-08 18:25:08,010 - mom.RPCServer - INFO - getStatistics()
2016-11-08 18:25:17,028 - mom.RPCServer - INFO - RPC Server ending
2016-11-08 18:25:24,705 - mom.GuestManager - INFO - Guest Manager ending
2016-11-08 18:25:26,575 - mom.HostMonitor - INFO - Host Monitor ending

2016-11-08 18:25:29,869 - mom - INFO - MOM starting
2016-11-08 18:25:29,905 - mom.HostMonitor - INFO - Host Monitor starting
2016-11-08 18:25:29,905 - mom - INFO - hypervisor interface vdsmjsonrpcbulk
2016-11-08 18:25:30,029 - mom.vdsmInterface - ERROR - Cannot connect
to VDSM! [Errno 111] Connection refused
2016-11-08 18:25:30,030 - mom - ERROR - Failed to initialize MOM threads
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mom/__init__.py", line 29, in run
    hypervisor_iface = self.get_hypervisor_interface()
  File "/usr/lib/python2.7/site-packages/mom/__init__.py", line 217,
in get_hypervisor_interface
    return module.instance(self.config)
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcbulkInterface.py",
line 47, in instance
    return JsonRpcVdsmBulkInterface()
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcbulkInterface.py",
line 29, in __init__
    super(JsonRpcVdsmBulkInterface, self).__init__()
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcInterface.py",
line 43, in __init__
    .orRaise(RuntimeError, 'No connection to VDSM.')
  File "/usr/lib/python2.7/site-packages/mom/optional.py", line 28, in orRaise
    raise exception(*args, **kwargs)
RuntimeError: No connection to VDSM.



Version-Release number of selected component (if applicable):
Red Hat Virtualization Manager Version: 4.0.5.4-0.1.el7ev
vdsm-4.18.15-23.git9656d58.el7.centos.x86_64
mom-0.5.8-1.el7ev.noarch


How reproducible:


Steps to Reproduce:
1. install vdsm from master and mom-0.5.8-1.el7ev.noarch 
2. restart vdsmd service and look at mom logs
3.

Actual results:


Expected results:


Additional info:

Comment 1 Martin Sivák 2016-11-21 09:13:18 UTC
*** Bug 1396826 has been marked as a duplicate of this bug. ***

Comment 2 Martin Sivák 2016-11-21 11:55:15 UTC
I am moving this to VDSM component as that is where the fix will be published.

Comment 3 Martin Sivák 2016-11-23 08:50:40 UTC
*** Bug 1397268 has been marked as a duplicate of this bug. ***

Comment 4 Michael Burman 2016-11-28 06:55:35 UTC
I don't think this bug should be ON_QA, it actually should failed QA.

messages log still spammed with 'vdsm MOM WARN MOM not available, KSM stats will be missing.' on vdsm-4.18.17-1.el7ev.x86_64

Comment 5 Martin Sivák 2016-11-28 09:16:09 UTC
The fix will be part of 4.0.6. It is already merged too, but the automation is not smart enough to know that it is not present in the last weeks build..

Comment 6 Michael Burman 2016-12-11 09:40:24 UTC
I still see this bug on 

4.0.6.3-0.1.el7ev
vdsm-4.18.18-4.git198e48d.el7ev.x86_64

Dec 11 11:38:11 silver-vdsa journal: vdsm MOM WARNING MOM not available.
Dec 11 11:38:11 silver-vdsa journal: vdsm MOM WARNING MOM not available, KSM stats will be missing.
Dec 11 11:38:27 silver-vdsa journal: vdsm MOM WARNING MOM not available.
Dec 11 11:38:27 silver-vdsa journal: vdsm MOM WARNING MOM not available, KSM stats will be missing.
Dec 11 11:38:43 silver-vdsa journal: vdsm MOM WARNING MOM not available.


Don't think this report should be ON_QA

Comment 7 Martin Sivák 2016-12-12 09:06:57 UTC
Michael: Check /usr/lib/systemd/system/mom-vdsm.service, it should contain this:

https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=init/systemd/mom-vdsm.service.in;h=e83acb40fcfd12f78d27503e92f7e5e4068b731f;hb=e4be1485f2a905c247d935aed72b106026713478

The link will show you the content of the file we ship with 4.18.18 and it contains the proper fix.

So you either - haven't reloaded systemd, have an .rpmnew file or systemd override laying somewhere (/etc/systemd/system/mom-vdsm.*) or triggered the maximum failure count (So check /var/log/vdsm/mom.log and journalctl for mom-vdsm).

Comment 8 Michael Burman 2016-12-12 09:17:26 UTC
Martin,

cat /usr/lib/systemd/system/mom-vdsm.service 
[Unit]
Description=MOM instance configured for VDSM purposes
Requires=vdsmd.service
After=vdsmd.service
Conflicts=momd.service ksmtuned.service

[Service]
Type=simple
LimitCORE=infinity
ExecStart=/usr/sbin/momd -c /etc/vdsm/mom.conf
Restart=always
RestartSec=5
User=vdsm
Group=kvm
TimeoutStopSec=10
StartLimitBurst=10
StartLimitInterval=120

[Install]
WantedBy=multi-user.target

Looks like my vdsm vdsm-4.18.18-4.git198e48d.el7ev.x86_64 do contain the fix but messages log is still spammed with those errors. 

Do i need to do something after updating vdsm?

Comment 9 Martin Sivák 2016-12-12 09:44:21 UTC
Check whether mom is up or down:

systemctl status mom-vdsm.service and get the reason why it is down.

Check journalctl for mom to see why it was not restarted:

journalctl --unit mom-vdsm.service

Check mom log for errors:

cat /var/log/vdsm/mom.log

Check what happens when you try starting mom manually:

systemctl start mom-vdsm.service

If mom was not started because of triggering the burst limits, use the following to reset them:

systemctl reset-failed mom-vdsm.service

Comment 10 Martin Sivák 2016-12-12 09:45:03 UTC
Btw, you can also try systemctl daemon-reload to make sure systemd knows about the updated unit files.

Comment 11 Shira Maximov 2016-12-13 11:39:52 UTC
Try to verify on : 
vdsm-4.18.18-4.git198e48d.el7ev.x86_64
mom-0.5.8-1.el7ev.noarch

Verification steps: 
0. look at mom logs
1. restart Vdsm 
2. Check in mom logs again

immediately after upgrade vdsm,  mom is down with the same failure: 
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mom/__init__.py", line 29, in run
    hypervisor_iface = self.get_hypervisor_interface()
  File "/usr/lib/python2.7/site-packages/mom/__init__.py", line 217, in get_hypervisor_interface
    return module.instance(self.config)
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcbulkInterface.py", line 47, in instance
    return JsonRpcVdsmBulkInterface()
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcbulkInterface.py", line 29, in __init__
    super(JsonRpcVdsmBulkInterface, self).__init__()
  File "/usr/lib/python2.7/site-packages/mom/HypervisorInterfaces/vdsmjsonrpcInterface.py", line 43, in __init__
    .orRaise(RuntimeError, 'No connection to VDSM.')
  File "/usr/lib/python2.7/site-packages/mom/optional.py", line 28, in orRaise
    raise exception(*args, **kwargs)
RuntimeError: No connection to VDSM.


But after restarting vdsmd service, mom is running OK. 

I'm not sure that this is valid behaviour, Martin is this OK that mom didn't work after upgrading vdsm?

Comment 12 Shira Maximov 2016-12-13 15:32:30 UTC
Verifying this bug, 
msivak check and this is a valid behavior.