Bug 1676695
Summary: | "MOM not available, KSM stats will be missing" message on random hosts, after updating to ovirt-4.3.0 | ||
---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Sergey <serg> |
Component: | Core | Assignee: | Michal Skrivanek <michal.skrivanek> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Polina <pagranat> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.30.8 | CC: | bugs, fromani, guillaume.pavese, michal.skrivanek, msivak, rbarry, serg |
Target Milestone: | ovirt-4.3.1 | Flags: | rule-engine:
ovirt-4.3+
|
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | v4.30.9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-03-13 16:40:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1649328 |
Description
Sergey
2019-02-12 21:40:56 UTC
Checked on our test environment running as nested vt, and can confirm same errors, it was initially installed as 4.2.7, updated to 4.2.8 and finally updated to 4.3.0 Looks like I've found the cause of this bug, today I've added second host in second cluster, where first host was working normally, without MOM errors, and new host started to throw MOM errors. Also reinstalled same host that was mentioned earlier in first cluster second time(was needed to clear "Host needs to be reinstalled", after mistakenly changing "VNC encryption" on this cluster) and... no MOM errors on this host. After some investigation I've found that vdsm.conf slightly differs on working and non-working hosts. On working: == [vars] ssl_excludes = OP_NO_TLSv1,OP_NO_TLSv1_1 ssl = true ssl_ciphers = HIGH [addresses] management_port = 54321 == on not working: == [vars] ssl = true ssl_ciphers = HIGH ssl_excludes = OP_NO_TLSv1,OP_NO_TLSv1_1 [addresses] management_port = 54321 == As you can see ssl_excludes param goes first in working config. I've doubted that parameter position could lead to such behavior, but decided to give it a try. Picked host with high memory load, moved "ssl_excludes" param before "ssl" param, restarted VDSM and it immediately started to report Memory Page Sharing: Active Shared Memory: 38% and no more MOM errors in vdsm.log, so tried on another host and same results - stats working, no errors, mom policy syncing. I don't know how parameter order can break mom connectivity, but it breaks. Also I want to note, that order is generating randomly, because 2 new installed hosts have different order, also another host was reinstalled 2 times and has different order each time. This is probably a race condition. When mom-vdsm service is started, it takes a small time to initialize before creating the UNIX domain socket for vdsm. When vdsmd is started it tries to connect to the socket and if it fails, it never tries again. Restarting vdsmd service when mom-vdsm is already running should fix the connection issue. can we change mom to use a systemd unit with e.g. Type=notify and let systemd know when it's ready? or use socket activation in vdsm? Thanks, tried to restart vdsm on 2 hosts, looks like you are right log messsages are gone and stats available, but before 4.3.0 it was always working correctly after reboot, and now in most cases - not. Is there any plans to implement reconnect, or force vdsm to start only after momd ? MOM has a strict dependency on VDSM so we can't change the ordering. VDSM does not need MOM to work, but some features are disabled when it is not there. VDSM used to report MOM is missing once or twice after boot, but then it got connection and all worked well. If that changed in VDSM, it is actually a bug. MOM can be restarted and VDSM should handle that gracefully. @Michal: We could implement notify mechanism now that all users have systemd, but it won't help you, because VDSM has to be up for MOM to start working properly in oVirt. Socket activation.. uh, maybe, it would add systemd dependency to MOM as you need to use systemd API to take over the socket iirc. well, I do not see any reconnect logic in vdsm. It tries once at clientIF initialiation. It's possible that what you've seen relied on vdsmd restarts or something like that. IIUC ut's not about mom not working, it's about vdsm not establishing conenction to mom because it's still starting up and the socket is not ready when vdsm gets to it. And it never tries again. IMHO delaying mom start (by systemd waiting on explicit notification) would solve this problem well, connecting logic could also work as suggested in the patch Verified on ovirt-engine-4.3.2-0.1.el7.noarch & vdsm-4.30.10-1.el7ev.x86_64 & qemu-guest-agent-2.12.0-2.el7.x86_64 Though memory ballooning doesn't work for qemu-guest-agent-2.12.0-63.module+el8+2833+c7d6d092.x86_64. separate bug will be inserted This bugzilla is included in oVirt 4.3.1 release, published on February 28th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |