Bug 1676695 - "MOM not available, KSM stats will be missing" message on random hosts, after updating to ovirt-4.3.0
Summary: "MOM not available, KSM stats will be missing" message on random hosts, after...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.30.8
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ovirt-4.3.1
: ---
Assignee: Michal Skrivanek
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks: 1649328
TreeView+ depends on / blocked
 
Reported: 2019-02-12 21:40 UTC by Sergey
Modified: 2019-03-13 16:40 UTC (History)
7 users (show)

Fixed In Version: v4.30.9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-13 16:40:31 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.3+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 97765 0 'None' MERGED mom: change connect logic to MOM 2021-01-27 11:53:09 UTC

Description Sergey 2019-02-12 21:40:56 UTC
Description of problem:
Updated our oVirt infra to latest 4.3.0 release from 4.2.8, and realized, that no KSM statistics are available in ovirt UI, always showing for all hosts:
Shared Memory: 0% 
Memory Page Sharing: Inactive
it was working just before update,
and in fact KSM is working and sharing a lot of memory.
Every host has repeating errors in vdsm.log:
2019-02-13 00:01:02,363+0300 WARN  (jsonrpc/7) [throttled] MOM not available. (throttledlog:104)
2019-02-13 00:01:02,364+0300 WARN  (jsonrpc/7) [throttled] MOM not available, KSM stats will be missing. (throttledlog:104)

mom-vdsm is running and working using old policy, no strange errors in mom.log
when activating host, or trying to "Sync MoM policy" on host there is error in vdsm.log:
2019-02-13 00:26:19,826+0300 WARN  (jsonrpc/6) [MOM] MOM not available, Policy could not be set. (momIF:111)
2019-02-13 00:26:19,826+0300 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.setMOMPolicyParameters succeeded in 0.00 seconds (__init__:312)

I've tried to reinstall one host from ui, but still same errors.
Also today I've added another cluster and deployed fresh host, It has "MOM not available" errors only one time after reboot, and no additional errors, don't now if KSM is really working, because host has low memory usage, but looks like vdsm correctly communicating with momd on this host.

All hosts are CentOS 7.6, not ovirt-nodes

Version-Release number of selected component (if applicable):
vdsm.x86_64         4.30.8-1.el7
mom.noarch          0.5.12-1.el7.centos
ovirt-host.x86_64   4.3.0-2.el7

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sergey 2019-02-13 14:47:50 UTC
Checked on our test environment running as nested vt, and can confirm same errors, it was 
initially installed as 4.2.7, updated to 4.2.8 and finally updated to 4.3.0

Comment 2 Sergey 2019-02-13 16:46:11 UTC
Looks like I've found the cause of this bug, today I've added second host in second cluster, where first host was working normally, without MOM errors, and new host started to throw MOM errors.
Also reinstalled same host that was mentioned earlier in first cluster second time(was needed to clear "Host needs to be reinstalled", after mistakenly changing "VNC encryption" on this cluster) and... no MOM errors on this host.

After some investigation I've found that vdsm.conf slightly differs on working and non-working hosts.

On working:
==
[vars]
ssl_excludes = OP_NO_TLSv1,OP_NO_TLSv1_1
ssl = true
ssl_ciphers = HIGH

[addresses]
management_port = 54321
==
on not working:
==
[vars]
ssl = true
ssl_ciphers = HIGH
ssl_excludes = OP_NO_TLSv1,OP_NO_TLSv1_1

[addresses]
management_port = 54321
==

As you can see ssl_excludes param goes first in working config. I've doubted that parameter position could lead to such behavior, but decided to give it a try.
Picked host with high memory load, moved "ssl_excludes" param before "ssl" param, restarted VDSM and it immediately started to report 
Memory Page Sharing: Active
Shared Memory: 38%
and no more MOM errors in vdsm.log, so tried on another host and same results - stats working, no errors, mom policy syncing.

I don't know how parameter order can break mom connectivity, but it breaks.
Also I want to note, that order is generating randomly, because 2 new installed hosts have different order, also another host was reinstalled 2 times and has different order each time.

Comment 3 Andrej Krejcir 2019-02-14 09:08:42 UTC
This is probably a race condition. When mom-vdsm service is started, it takes a small time to initialize before creating the UNIX domain socket for vdsm. When vdsmd is started it tries to connect to the socket and if it fails, it never tries again.

Restarting vdsmd service when mom-vdsm is already running should fix the connection issue.

Comment 4 Michal Skrivanek 2019-02-14 09:29:43 UTC
can we change mom to use a systemd unit with e.g. Type=notify and let systemd know when it's ready? or use socket activation in vdsm?

Comment 5 Sergey 2019-02-14 09:34:09 UTC
Thanks, tried to restart vdsm on 2 hosts, looks like you are right log messsages are gone and stats available, but before 4.3.0 it was always working correctly after reboot, and now in most cases - not. Is there any plans to implement reconnect, or force vdsm to start only after momd ?

Comment 6 Martin Sivák 2019-02-14 09:44:10 UTC
MOM has a strict dependency on VDSM so we can't change the ordering. VDSM does not need MOM to work, but some features are disabled when it is not there. VDSM used to report MOM is missing once or twice after boot, but then it got connection and all worked well. If that changed in VDSM, it is actually a bug. MOM can be restarted and VDSM should handle that gracefully.

@Michal: We could implement notify mechanism now that all users have systemd, but it won't help you, because VDSM has to be up for MOM to start working properly in oVirt. Socket activation.. uh, maybe, it would add systemd dependency to MOM as you need to use systemd API to take over the socket iirc.

Comment 7 Michal Skrivanek 2019-02-14 10:13:21 UTC
well, I do not see any reconnect logic in vdsm. It tries once at clientIF initialiation. It's possible that what you've seen relied on vdsmd restarts or something like that.
IIUC ut's not about mom not working, it's about vdsm not establishing conenction to mom because it's still starting up and the socket is not ready when vdsm gets to it. And it never tries again. IMHO delaying mom start (by systemd waiting on explicit notification) would solve this problem

Comment 8 Michal Skrivanek 2019-02-14 10:54:49 UTC
well, connecting logic could also work as suggested in the patch

Comment 9 Polina 2019-03-12 11:40:00 UTC
Verified on ovirt-engine-4.3.2-0.1.el7.noarch & vdsm-4.30.10-1.el7ev.x86_64 &
qemu-guest-agent-2.12.0-2.el7.x86_64

Though memory ballooning doesn't work for qemu-guest-agent-2.12.0-63.module+el8+2833+c7d6d092.x86_64. separate bug will be inserted

Comment 10 Sandro Bonazzola 2019-03-13 16:40:31 UTC
This bugzilla is included in oVirt 4.3.1 release, published on February 28th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.