Bug 1583228
Summary: | [downstream clone - 4.2.4] After updating to current RHV-H, vdsmd consistently fails to start on startup. | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> |
Component: | vdsm | Assignee: | Piotr Kliczewski <pkliczew> |
Status: | CLOSED ERRATA | QA Contact: | Pavol Brilla <pbrilla> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.1.9 | CC: | adevolder, aperotti, cshao, daniel-oliveira, danken, dfediuck, huzhao, lsurette, lveyde, mjankula, mkalinin, mperina, nsoffer, pstehlik, qiyuan, rbarry, sbonazzo, srevivo, weiwang, yaniwang, ycui, ykaul, ylavi, yturgema, yzhao |
Target Milestone: | ovirt-4.2.4 | Keywords: | ZStream |
Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | vdsm v4.20.30 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 1534197 | Environment: | |
Last Closed: | 2018-06-27 10:02:46 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1534197 | ||
Bug Blocks: |
Description
RHV bug bot
2018-05-28 13:53:52 UTC
Please provide vdsm.log, supervdsm.log, and the output of "systemctl status vdsmd.service" These upgrade paths are heavily exercised, and I suspect something in j the environment, since this has not been seen before. In particular, RHVH looks at files in /etc, and copied any which have been modified to the new layer. I would guess that a modification to some configuration file was made in 4.2.5 which is being carried across (Originally by Ryan Barry) QE did not reproduce this issue. Test version: From: rhvh-4.1.6 (redhat-virtualization-host-4.1-20170914.1) To: rhvh-4.1.7 (redhat-virtualization-host-4.1-20171101.0) Test steps: 1. Install redhat-virtualization-host-4.1-20170914.1, then added host to rhvm 2. Check the vdsmd.service is active 3. Setup local repos in host, then upgrade host to redhat-virtualization-host-4.1-20171101.0 from rhvm side. 4. After host upgrade and reboot host, the vdsmd.service is still active. Test results: After step4: the vdsmd.service is active on startup after upgrade, and host is up in rhvm side. # service vdsmd status Redirecting to /bin/systemctl status vdsmd.service ● vdsmd.service - Virtual Desktop Server Manager Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2018-02-05 02:40:16 EST; 5min ago Process: 2237 ExecStartPre=/usr/libexec/vdsm/vdsmd_init_common.sh --pre-start (code=exited, status=0/SUCCESS) Main PID: 2791 (vdsm) CGroup: /system.slice/vdsmd.service └─2791 /usr/bin/python2 /usr/share/vdsm/vdsm Feb 05 02:40:14 dhcp-8-202.nay.redhat.com vdsmd_init_common.sh[2237]: vdsm: Running prepare_transient_repository Feb 05 02:40:15 dhcp-8-202.nay.redhat.com vdsmd_init_common.sh[2237]: vdsm: Running syslog_available Feb 05 02:40:15 dhcp-8-202.nay.redhat.com vdsmd_init_common.sh[2237]: vdsm: Running nwfilter Feb 05 02:40:15 dhcp-8-202.nay.redhat.com vdsmd_init_common.sh[2237]: vdsm: Running dummybr Feb 05 02:40:16 dhcp-8-202.nay.redhat.com vdsmd_init_common.sh[2237]: vdsm: Running tune_system Feb 05 02:40:16 dhcp-8-202.nay.redhat.com vdsmd_init_common.sh[2237]: vdsm: Running test_space Feb 05 02:40:16 dhcp-8-202.nay.redhat.com vdsmd_init_common.sh[2237]: vdsm: Running test_lo Feb 05 02:40:16 dhcp-8-202.nay.redhat.com systemd[1]: Started Virtual Desktop Server Manager. Feb 05 02:40:18 dhcp-8-202.nay.redhat.com vdsm[2791]: vdsm throttled WARN MOM not available. Feb 05 02:40:18 dhcp-8-202.nay.redhat.com vdsm[2791]: vdsm throttled WARN MOM not available, KSM stats will be missing. (Originally by Huijuan Zhao) Tested with other versions in comment 0, still can not reproduce this issue. Test versions: From: rhvh-4.0.7.1 (redhat-virtualization-host-4.0-20170307.1) To: rhvh-4.1.5.0 (redhat-virtualization-host-4.1-20170816.2) To: rhvh-4.1.6.0 (redhat-virtualization-host-4.1-20170914.1) (Originally by Huijuan Zhao) I am not clear if anything stopped vdsmd as I do not see anything failing. There is also nothing I can see at the networking side unless I am missing something here. (Originally by edwardh) The situation here essentially seems to be that vdsmd does not start. The service is enabled, and supervdsmd starts, but nothing happens after that. Is there some other step in the vdsm startup order which would give output? (Originally by Ryan Barry) (In reply to Ryan Barry from comment #16) > The situation here essentially seems to be that vdsmd does not start. > > The service is enabled, and supervdsmd starts, but nothing happens after > that. Is there some other step in the vdsm startup order which would give > output? If there would have been a problem, at the minimum I would expect to see it in the messages log, but there was nothing there. Could you please point to the time where VDSM was expected to start but did not? Has someone started it manually after? (Originally by edwardh) Martin, maybe your team can help us understand the odd behaviour of systemd? Except for the peculiar timing of vdsm-network finish, I do not see how this is related to Network. (Originally by danken) Yaniv, could you please take a look? (Originally by Martin Perina) the boot is done around 18:06 - I don't see that supervdsm starts since then and I also don't see a sign that libvirtd started. I see all enabled services under /etc/systemd/system/multi-user.target.wants exists after reboot. Maybe this happens due to a race between the call to configure and vdsmd start. I see that the configure happens, and that "Starting Virtualization daemon..." appears twice before it. Don't we run in pre-start "vdsm-tool configure" and then start the service again? we can't count on systemd to try to start the service only after execution of the node pre-start scripts snip - Feb 19 18:08:08 hilpro1808 systemd: Starting Virtualization daemon... .. Feb 19 18:08:09 hilpro1808 systemd: Started Virtualization daemon. Feb 19 18:08:09 hilpro1808 systemd: Starting Virtual Desktop Server Manager network restoration... ... Feb 19 18:08:14 hilpro1808 systemd: Stopped Virtual Desktop Server Manager network restoration. Feb 19 18:08:14 hilpro1808 systemd: Stopping Virtual Desktop Server Manager network restoration... Feb 19 18:08:14 hilpro1808 systemd: Stopping Virtualization daemon... Feb 19 18:08:14 hilpro1808 systemd: Stopped Virtualization daemon. Feb 19 18:08:14 hilpro1808 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found Feb 19 18:08:14 hilpro1808 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found Feb 19 18:08:14 hilpro1808 saslpasswd2: error deleting entry from sasldb: BDB0073 DB_NOTFOUND: No matching key/data pair found Feb 19 18:08:14 hilpro1808 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Feb 19 18:08:14 hilpro1808 systemd: Starting Virtualization daemon... Feb 19 18:08:14 hilpro1808 systemd: Started Virtualization daemon. Feb 19 18:08:14 hilpro1808 vdsm-tool: Checking configuration status... Feb 19 18:08:14 hilpro1808 vdsm-tool: abrt is already configured for vdsm Feb 19 18:08:14 hilpro1808 vdsm-tool: lvm is configured for vdsm Feb 19 18:08:14 hilpro1808 vdsm-tool: libvirt is already configured for vdsm Feb 19 18:08:14 hilpro1808 vdsm-tool: SUCCESS: ssl configured to true. No conflicts Feb 19 18:08:14 hilpro1808 vdsm-tool: Current revision of multipath.conf detected, preserving Feb 19 18:08:14 hilpro1808 vdsm-tool: Running configure... Feb 19 18:08:14 hilpro1808 vdsm-tool: Reconfiguration of abrt is done. Feb 19 18:08:14 hilpro1808 vdsm-tool: Reconfiguration of passwd is done. Feb 19 18:08:14 hilpro1808 vdsm-tool: Reconfiguration of libvirt is done. Feb 19 18:08:14 hilpro1808 vdsm-tool: Done configuring modules to VDSM. Feb 19 18:08:14 hilpro1808 systemd: Started Reconfigure vdsm. ... Then no try to start "Virtualization daemon" again (Originally by ybronhei) NGN does not have pre-start scripts. It's purely systemd. And why could we not rely on the following service to run before supervdsmd? [Unit] Description=Reconfigure vdsm Before=supervdsmd.service After=chronyd.service ntpd.service [Service] ExecStartPre=/usr/bin/sleep 2 ExecStart=/usr/bin/vdsm-tool -v configure --force Type=oneshot RemainAfterExit=yes [Install] WantedBy=multi-user.target (Originally by Ryan Barry) Is this something that you currently do? Now I understand where "systemd: Started Reconfigure vdsm." comes from. This is fine, but supervdsmd doesn't require vdsmd. Maybe you should have this Before=vdsmd.service btw, "chronyd.service ntpd.service" can be changed to time-sync.target can you patch this service file in the customer env and see if the issue reproduces after reboot? (Originally by ybronhei) We have done this since March of 2017. Our understanding was that supervdsmd runs before vdsmd, so this should be safe. To patch for the customer would require some manual steps before rebooting. Are they willing to do this, Allan? Should be about 3 commands (Originally by Ryan Barry) This is not going to make it to 4.1.10 - please re-target. (Originally by Yaniv Kaul) I'll help them apply whatever steps they need to apply to test this. Just let me know exactly which steps I should have them attempt. (Originally by Allan Voss) yum update # before rebooting mkdir /tmp/updateimg mount $(imgbase layer --volume-path $(imgbase layer --latest)) /tmp/updateimg vi /tmp/updateimg/usr/lib/systemd/system/imgbase-config-vdsm.service Change: Before=supervdsmd.service After=chronyd.service ntpd.service To: Before=vdsmd.service After=time-sync.target Reboot (Originally by Ryan Barry) Allan, any response from customer? (Originally by Martin Perina) (In reply to Ryan Barry from comment #27) > NGN does not have pre-start scripts. It's purely systemd. And why could we > not rely on the following service to run before supervdsmd? > > [Unit] > Description=Reconfigure vdsm > Before=supervdsmd.service > After=chronyd.service ntpd.service > > [Service] > ExecStartPre=/usr/bin/sleep 2 > ExecStart=/usr/bin/vdsm-tool -v configure --force > Type=oneshot > RemainAfterExit=yes > > [Install] > WantedBy=multi-user.target Ryan, you cannot do these kinds of changes. If you change vdsm startup you will have to support vdsm start up issues yourself. We are supporting only the upstream vdsm startup flow. If something in the upstream flow does not work for you please open a vdsm bug. vdsm-tool configure is not something that should run automatically on every boot. It should run after installing or updating vdsm, and it should be already managed by vdsm upgrade code and host deploy. (Originally by Nir Soffer) Nir, this was added in 4.0.7 (bug 1429288) - it looks like the reconfigure service cancels vdsmd.service when systemctl tries to run both in parallel. Can't we make sure systemd starts vdsmd.service after reconfiguring ? (Originally by Yuval Turgeman) in host deploy we run configure after we install the rpms as part of the deploy flow. during upgrade we do that also automatically to restart the service with the new code. in old rhev-h we used to run configure in pre-start script. here you added this systemd dependency. why not to use pre-start script for that? I'm not so sure that this the cause of the bug. but I can't find any other change that can cause vdsmd not to start at all.. if it would start and fail, we could see the vdsmd_init_common output in /var/log/messages (Originally by ybronhei) (In reply to Yuval Turgeman from comment #38) > Nir, this was added in 4.0.7 (bug 1429288) - it looks like the reconfigure > service cancels vdsmd.service when systemctl tries to run both in parallel. > Can't we make sure systemd starts vdsmd.service after reconfiguring ? configure cannot be used from a service like this, it should not run automatically on every boot. It is expected that configuring will cause vdsm to stop, since it may cause vdsm required services to restart. Systemd kills vdsm if a required service was stopped. The configure service also assumes that Before: supervdsm will ensure that systemd will wait until this service was completed before starting vdsm, but since the service uses the default "simple" type and not the "notify" type, I'm not sure it works as expected. Finally using sleep 2 in the server is racy is is going to fail randomly in production. We should go back to bug 1429288 and understand why configure is needed during boot. This bug should move back to RHV-H, this is not vdsm bug. (Originally by Nir Soffer) (In reply to Yaniv Bronhaim from comment #39) > in host deploy we run configure after we install the rpms as part of the > deploy flow. during upgrade we do that also automatically to restart the > service with the new code. in old rhev-h we used to run configure in > pre-start script. here you added this systemd dependency. why not to use > pre-start script for that? I'm not so sure that this the cause of the bug. > but I can't find any other change that can cause vdsmd not to start at all.. > if it would start and fail, we could see the vdsmd_init_common output in > /var/log/messages Not sure this is the cause for this bug also, but I tried to start both services at (almost) the same time, and got: [root@node-iso-15878 ~]# systemctl start vdsmd Job for vdsmd.service canceled. So the neither the service nor its prescript run, yet nothing is shown in the logs (unless running systemd with debug log-level) IIRC, we added this as a service since vdsm-tool can't run inside chroot/nsenter, so our only option to make sure it runs after a new image is installed was to make it run on boot, hope this makes sense. (Originally by Yuval Turgeman) (In reply to Nir Soffer from comment #40) > (In reply to Yuval Turgeman from comment #38) > > Nir, this was added in 4.0.7 (bug 1429288) - it looks like the reconfigure > > service cancels vdsmd.service when systemctl tries to run both in parallel. > > Can't we make sure systemd starts vdsmd.service after reconfiguring ? > > configure cannot be used from a service like this, it should not run > automatically on every boot. > > It is expected that configuring will cause vdsm to stop, since it may cause > vdsm required services to restart. Systemd kills vdsm if a required service > was > stopped. > > The configure service also assumes that Before: supervdsm will ensure that > systemd > will wait until this service was completed before starting vdsm, but since > the > service uses the default "simple" type and not the "notify" type, I'm not > sure it > works as expected. > > Finally using sleep 2 in the server is racy is is going to fail randomly in > production. > > We should go back to bug 1429288 and understand why configure is needed > during > boot. > > This bug should move back to RHV-H, this is not vdsm bug. Not sure I understand why this is a problem - the reconfigure service is set to oneshot - so it must end before other services start, meaning if we place vdsmd.service "After" our service, it should be fine, shouldn't it ? As for those 2 seconds of sleep, they were added for a different reason (time sync), so they're probably not related to this issue. (Originally by Yuval Turgeman) (In reply to Nir Soffer from comment #37) > Ryan, you cannot do these kinds of changes. If you change vdsm startup you > will > have to support vdsm start up issues yourself. We are supporting only the > upstream > vdsm startup flow. Nir - this has been in place since 4.0.7 on NGN, and we had an identical configuration on vintage RHVH. > > If something in the upstream flow does not work for you please open a vdsm > bug. > > vdsm-tool configure is not something that should run automatically on every > boot. > It should run after installing or updating vdsm, and it should be already > managed > by vdsm upgrade code and host deploy. vdsm does not run this on RHVH and never has, since vdsm runs these as part of the RPM %post and %posttrans scripts, which doesn't mesh well with an prebuilt image. vdsm's scriptlets do not trigger when RHVH images are updated, since they ran at build time. This is the cause of the original bug, both on vintage RHVH and NGN. We already have an abstraction to grab the RPM scriptlets and run them, but as Yuval noted, "vdsm-tool configure" cannot be run via a chroot or nsenter while we are updating (we mount the new filesystem to perform some tasks like this). This means that, to apply to the updated image on reboot, it must be run at boot time, and we must find a solution which works with this. (In reply to Yaniv Bronhaim from comment #39) > in host deploy we run configure after we install the rpms as part of the > deploy flow. during upgrade we do that also automatically to restart the > service with the new code. in old rhev-h we used to run configure in > pre-start script. here you added this systemd dependency. why not to use > pre-start script for that? NGN doesn't have ovirt_early, and we have not (so far, and hopefully ever) patched any files from other packages. I'm very open to options, but since this has been in since 4.0.7 with only one reported case, I'm hesitant to say that this is a structural problem with the systemd service. The semantics for Before= are not complex, and "Type=simple", per the systemd docs, waits for it to exit. Type=notify should not be needed, and a shell oneliner will not send messages across sd_notify in any case. (In reply to Yuval Turgeman from comment #41) > Not sure this is the cause for this bug also, but I tried to start both > services at (almost) the same time, and got: AFAIK from the systemd docs and previous tests, we shouldn't actually start them at the same time. Instead, "systemctl stop supervdsmd.service && systemctl imgbase-config-vdsm.service && systemctl start supervdsmd.service" should start imgbase-config-vdsm then supervdsmd, which is what should be happening at startup. There was a systemd rebase in 7.4 which may have changed this, but this would be the first report. (Originally by Ryan Barry) (In reply to Yuval Turgeman from comment #42) > (In reply to Nir Soffer from comment #40) > Not sure I understand why this is a problem - the reconfigure service is set > to oneshot - so it must end before other services start, meaning if we place > vdsmd.service "After" our service, it should be fine, shouldn't it ? It is ok to defer running "vdsm-tool configure", but: - it must run once on the first boot, or maybe until the configuration was completed. - it must start and complete before starting the vdsm services or services requiring vdsm. - it must not run automatically after vdsm was configured I think the simplest way to solve this will be to change vdsm pre-start script to check if the host needs to be configured, and if it does, configure it. Node installation can signal the need to configure the host by adding a file in /var/lib/vdsm/need-configure. Vdsm pre-start script will remove this file if vdsm-tool configure was successful. Yaniv, what do you think? (Originally by Nir Soffer) If the configure needs to run only once after node installation this solution sounds good (Originally by ybronhei) What's the next step here? (Originally by Yaniv Kaul) (In reply to Yaniv Kaul from comment #48) > What's the next step here? Piotr will try to provide a fix for that (Originally by Martin Perina) Copy paste from console of server : Installed: redhat-virtualization-host-image-update.noarch 0:4.2-20180615.0.el7_5 Replaced: redhat-virtualization-host-image-update-placeholder.noarch 0:4.2-3.1.el7 Complete! # mkdir /tmp/updateimg # mount $(imgbase layer --volume-path $(imgbase layer --latest)) # less /tmp/updateimg/usr/lib/systemd/system/imgbase-config-vdsm [Unit] Description=Reconfigure vdsm Before=supervdsmd.service After=chronyd.service ntpd.service ConditionPathExists=/var/lib/ngn-vdsm-need-configure [Service] ExecStartPre=/usr/bin/sleep 2 ExecStart=/usr/bin/vdsm-tool -v configure --force ExecStartPost=/usr/bin/rm /var/lib/ngn-vdsm-need-configure Type=oneshot RemainAfterExit=yes [Install] WantedBy=multi-user.target # ls -l /tmp/updateimg/var/lib/ngn-vdsm-need-configure -rw-r--r--. 1 root root 0 Jun 15 16:26 /tmp/updateimg/var/lib/ngn-vdsm-need-configure After reboot vdsmd and supervdsmd are up + file is missing # ls -l /var/lib/ngn-vdsm-need-configure ls: cannot access /var/lib/ngn-vdsm-need-configure: No such file or directory VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2072 BZ<2>Jira Resync sync2jira sync2jira |