Hide Forgot
Description of problem: Updated our oVirt infra to latest 4.3.0 release from 4.2.8, and realized, that no KSM statistics are available in ovirt UI, always showing for all hosts: Shared Memory: 0% Memory Page Sharing: Inactive it was working just before update, and in fact KSM is working and sharing a lot of memory. Every host has repeating errors in vdsm.log: 2019-02-13 00:01:02,363+0300 WARN (jsonrpc/7) [throttled] MOM not available. (throttledlog:104) 2019-02-13 00:01:02,364+0300 WARN (jsonrpc/7) [throttled] MOM not available, KSM stats will be missing. (throttledlog:104) mom-vdsm is running and working using old policy, no strange errors in mom.log when activating host, or trying to "Sync MoM policy" on host there is error in vdsm.log: 2019-02-13 00:26:19,826+0300 WARN (jsonrpc/6) [MOM] MOM not available, Policy could not be set. (momIF:111) 2019-02-13 00:26:19,826+0300 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.setMOMPolicyParameters succeeded in 0.00 seconds (__init__:312) I've tried to reinstall one host from ui, but still same errors. Also today I've added another cluster and deployed fresh host, It has "MOM not available" errors only one time after reboot, and no additional errors, don't now if KSM is really working, because host has low memory usage, but looks like vdsm correctly communicating with momd on this host. All hosts are CentOS 7.6, not ovirt-nodes Version-Release number of selected component (if applicable): vdsm.x86_64 4.30.8-1.el7 mom.noarch 0.5.12-1.el7.centos ovirt-host.x86_64 4.3.0-2.el7 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Checked on our test environment running as nested vt, and can confirm same errors, it was initially installed as 4.2.7, updated to 4.2.8 and finally updated to 4.3.0
Looks like I've found the cause of this bug, today I've added second host in second cluster, where first host was working normally, without MOM errors, and new host started to throw MOM errors. Also reinstalled same host that was mentioned earlier in first cluster second time(was needed to clear "Host needs to be reinstalled", after mistakenly changing "VNC encryption" on this cluster) and... no MOM errors on this host. After some investigation I've found that vdsm.conf slightly differs on working and non-working hosts. On working: == [vars] ssl_excludes = OP_NO_TLSv1,OP_NO_TLSv1_1 ssl = true ssl_ciphers = HIGH [addresses] management_port = 54321 == on not working: == [vars] ssl = true ssl_ciphers = HIGH ssl_excludes = OP_NO_TLSv1,OP_NO_TLSv1_1 [addresses] management_port = 54321 == As you can see ssl_excludes param goes first in working config. I've doubted that parameter position could lead to such behavior, but decided to give it a try. Picked host with high memory load, moved "ssl_excludes" param before "ssl" param, restarted VDSM and it immediately started to report Memory Page Sharing: Active Shared Memory: 38% and no more MOM errors in vdsm.log, so tried on another host and same results - stats working, no errors, mom policy syncing. I don't know how parameter order can break mom connectivity, but it breaks. Also I want to note, that order is generating randomly, because 2 new installed hosts have different order, also another host was reinstalled 2 times and has different order each time.
This is probably a race condition. When mom-vdsm service is started, it takes a small time to initialize before creating the UNIX domain socket for vdsm. When vdsmd is started it tries to connect to the socket and if it fails, it never tries again. Restarting vdsmd service when mom-vdsm is already running should fix the connection issue.
can we change mom to use a systemd unit with e.g. Type=notify and let systemd know when it's ready? or use socket activation in vdsm?
Thanks, tried to restart vdsm on 2 hosts, looks like you are right log messsages are gone and stats available, but before 4.3.0 it was always working correctly after reboot, and now in most cases - not. Is there any plans to implement reconnect, or force vdsm to start only after momd ?
MOM has a strict dependency on VDSM so we can't change the ordering. VDSM does not need MOM to work, but some features are disabled when it is not there. VDSM used to report MOM is missing once or twice after boot, but then it got connection and all worked well. If that changed in VDSM, it is actually a bug. MOM can be restarted and VDSM should handle that gracefully. @Michal: We could implement notify mechanism now that all users have systemd, but it won't help you, because VDSM has to be up for MOM to start working properly in oVirt. Socket activation.. uh, maybe, it would add systemd dependency to MOM as you need to use systemd API to take over the socket iirc.
well, I do not see any reconnect logic in vdsm. It tries once at clientIF initialiation. It's possible that what you've seen relied on vdsmd restarts or something like that. IIUC ut's not about mom not working, it's about vdsm not establishing conenction to mom because it's still starting up and the socket is not ready when vdsm gets to it. And it never tries again. IMHO delaying mom start (by systemd waiting on explicit notification) would solve this problem
well, connecting logic could also work as suggested in the patch
Verified on ovirt-engine-4.3.2-0.1.el7.noarch & vdsm-4.30.10-1.el7ev.x86_64 & qemu-guest-agent-2.12.0-2.el7.x86_64 Though memory ballooning doesn't work for qemu-guest-agent-2.12.0-63.module+el8+2833+c7d6d092.x86_64. separate bug will be inserted
This bugzilla is included in oVirt 4.3.1 release, published on February 28th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.