Bug 1195029 - hypervkvpd service start up is racy
Summary: hypervkvpd service start up is racy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: hyperv-daemons
Version: 23
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Vitaly Kuznetsov
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-02-22 08:10 UTC by Sitsofe Wheeler
Modified: 2015-12-07 20:30 UTC (History)
7 users (show)

Fixed In Version: 0-0.12.20150702git.fc23 hyperv-daemons-0-0.13.20150702git.fc23
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-12-07 20:30:03 UTC
Type: Bug


Attachments (Terms of Use)
Journal output from a problem boot (84.30 KB, text/plain)
2015-02-22 08:10 UTC, Sitsofe Wheeler
no flags Details
/usr/lib/udev/rules.d/99-hypervkvpd.rules (307 bytes, text/plain)
2015-02-22 08:14 UTC, Sitsofe Wheeler
no flags Details
/usr/lib/systemd/system/hypervkvpd.service (169 bytes, text/plain)
2015-02-22 08:15 UTC, Sitsofe Wheeler
no flags Details

Description Sitsofe Wheeler 2015-02-22 08:10:47 UTC
Created attachment 994218 [details]
Journal output from a problem boot

Description of problem:
If hypervkvpd is started too early then information (like integration services status and IP addresses) is not reported back to the Hyper-V host.

Version-Release number of selected component (if applicable):
hypervkvpd-0-0.9.20140714git.fc21.x86_64
kernel-core-3.18.7-200.fc21.x86_64

How reproducible:
Reproducible every time.

Steps to Reproduce:
(On a Windows 2012 R2 host)
1. Boot Fedora 21 cloud image.
2. Install the hypervkvpd package.
3. Run
systemctl enable hypervkvpd.service
4. Reboot.
5. After 30 seconds look at Summary tab of the VM in Hyper-V Manager.

Actual results:
Integration Services says nothing.

Expected results:
Integration Services to say "Update required".

Additional info:
This occurs because hypervkvpd is started just before the hv_utils module finishes loading:

Feb 22 08:08:51 systemd[1]: Starting Hyper-V KVP daemon...
Feb 22 08:08:51 KVP[407]: KVP starting; pid is:407
Feb 22 08:08:51 systemd[1]: Started Hyper-V KVP daemon.
Feb 22 08:08:51 systemd[1]: Starting NTP client/server...
Feb 22 08:08:52 systemd[1]: Starting Resets System Activity Logs...
Feb 22 08:08:52 chronyd[410]: chronyd version 1.31 starting
[...]
Feb 22 08:08:52 kernel: hv_vmbus: registering driver hyperv_keyboard
Feb 22 08:08:52 systemd[1]: Started Getty on tty1.
Feb 22 08:08:52 kernel: hv_utils: Registering HyperV Utility Driver 
Feb 22 08:08:52 kernel: hv_vmbus: registering driver hv_util

Attaching journal log.

Generally speaking if the hypervkvpd.service is later restarted by hand (e.g. by doing
systemctl restart hypervkvpd.service
) the correct intergration status will be reported from then on (in the attached journal output you can see "hv_utils: KVP: user-mode registering done." is ONLY printed after the service is restarted). However very occasionally (around 1 in 10 times) restarting the service will NOT fix the problem and the only option is to reboot the VM (not even removing and adding the module will reset things). It is this flakiness is what led me to misfile Bug #1194728 (just to be clear this problem has nothing to do with selinux - that was a wrong guess in that bug).

Comment 1 Sitsofe Wheeler 2015-02-22 08:14:25 UTC
Created attachment 994219 [details]
/usr/lib/udev/rules.d/99-hypervkvpd.rules

Comment 2 Sitsofe Wheeler 2015-02-22 08:15:19 UTC
Created attachment 994220 [details]
/usr/lib/systemd/system/hypervkvpd.service

Comment 3 Sitsofe Wheeler 2015-02-22 08:25:10 UTC
I've tried to workaround the problem by making the service wait for the hv_util driver to finish loading.

Initially I thought I could do this using Requires=sys-bus-vmbus-drivers-hv_util.device line in hypervkvpd.service but this caused boots to break on VMware systems where the hypervkvpd.service had been enabled (ConditionVirtualization doesn't stop the Requires line taking effect and since the Requires is never satisfied on VMware systems...).

The latest (attached) attempt forcefully loads the service from udev after hv_util is around. This is kinda bad because:

1. The service MUST no longer be installed to multi-user.target or it will load too soon. This kinda creates problems for people who are upgrading as they won't benefit from the fix...
2. There is no way to deactivate the service other than uninstalling the RPM/manually overriding udev rules.
3. The act of installing the RPM would "activate" the service on future boots.

Does anyone have something better? The only other thing I can think of is if service does the modprobe of hv_utils itself (but this makes a mess if the user compiled the module into the kernel).

Comment 4 Vitaly Kuznetsov 2015-02-26 13:30:14 UTC
The root cause of the issue is the fact that kernel/userspace daemon communication for kvp/vss (but not for fcopy) is done via netlink so all sorts of weird issues (packets being lost, racy startup, ...) are possible.

I'm going to send upstream patches to eliminate netlink communication for kvp/vss shortly.

Comment 5 K. Y. Srinivasan 2015-02-26 13:57:26 UTC
Thanks Vitaly, if you want I can send you the kvp kernel/daemon code that I have for RHEL 5.x based on char device.

Comment 6 K. Y. Srinivasan 2015-02-26 13:57:52 UTC
Thanks Vitaly, if you want I can send you the kvp kernel/daemon code that I have for RHEL 5.x based on char device.

Comment 7 Vitaly Kuznetsov 2015-07-27 09:14:06 UTC
Should be fixed in 0-0.12.20150702git.fc23 (requires kernel>=4.2-rc0).

Comment 8 Sitsofe Wheeler 2015-07-31 16:06:59 UTC
Will this also be fixed in Fedora 22?

Comment 9 Vitaly Kuznetsov 2015-08-04 07:21:25 UTC
(In reply to Sitsofe Wheeler from comment #8)
> Will this also be fixed in Fedora 22?

Probably not until kernel-4.2 (which has this new transport) lands in updates.

Comment 10 Sitsofe Wheeler 2015-08-04 08:20:15 UTC
Vitaly:
Understood. Will the existing hypervkvpd continue to work when a user boots the 4.2 kernel?

Comment 11 Vitaly Kuznetsov 2015-08-04 08:42:15 UTC
(In reply to Sitsofe Wheeler from comment #10)
> Vitaly:
> Understood. Will the existing hypervkvpd continue to work when a user boots
> the 4.2 kernel?

Yes (or at least it was supposed to). Newly introduced transport has an abstraction over both layers (old netlink and newly introduced chardev-based). Daemons upgrade under 4.2 kernel should also work.

Comment 12 Sitsofe Wheeler 2015-11-09 13:47:06 UTC
Unfortunately this issue is still here with
Fedora 23
kernel-core-4.2.5-300.fc23.x86_64

Looking through the journal only shows the following:
Nov 09 10:28:48 localhost.localdomain systemd[1]: Starting Basic System.
Nov 09 10:28:48 localhost.localdomain systemd[1]: Started Hyper-V KVP daemon.
Nov 09 10:28:48 localhost.localdomain systemd[1]: Started Software RAID monitoring and management.
Nov 09 10:28:48 localhost.localdomain systemd[1]: Started VGAuth Service for open-vm-tools.
Nov 09 10:28:48 localhost.localdomain systemd[1]: Started Service for virtual machines hosted on VMware.
[...]
Nov 09 10:28:48 localhost.localdomain systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 09 10:28:48 localhost.localdomain systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 09 10:28:48 localhost.localdomain systemd[1]: Startup finished in 819ms (kernel) + 900ms (initrd) + 917ms (userspace) = 2.637s.
Nov 09 10:28:48 localhost.localdomain kernel: hv_utils: Registering HyperV Utility Driver
Nov 09 10:28:48 localhost.localdomain kernel: hv_vmbus: registering driver hv_util

So I'm guessing by the time hv_utils showed up it was too late for the KVP daemon.

(How do I reopen this?)

Comment 13 Vitaly Kuznetsov 2015-11-09 18:04:02 UTC
(In reply to Sitsofe Wheeler from comment #12)
> Unfortunately this issue is still here with
> Fedora 23
> kernel-core-4.2.5-300.fc23.x86_64
> 
> Looking through the journal only shows the following:
> Nov 09 10:28:48 localhost.localdomain systemd[1]: Starting Basic System.
> Nov 09 10:28:48 localhost.localdomain systemd[1]: Started Hyper-V KVP daemon.
> Nov 09 10:28:48 localhost.localdomain systemd[1]: Started Software RAID
> monitoring and management.
> Nov 09 10:28:48 localhost.localdomain systemd[1]: Started VGAuth Service for
> open-vm-tools.
> Nov 09 10:28:48 localhost.localdomain systemd[1]: Started Service for
> virtual machines hosted on VMware.
> [...]
> Nov 09 10:28:48 localhost.localdomain systemd[1]: Starting Update UTMP about
> System Runlevel Changes...
> Nov 09 10:28:48 localhost.localdomain systemd[1]: Started Update UTMP about
> System Runlevel Changes.
> Nov 09 10:28:48 localhost.localdomain systemd[1]: Startup finished in 819ms
> (kernel) + 900ms (initrd) + 917ms (userspace) = 2.637s.
> Nov 09 10:28:48 localhost.localdomain kernel: hv_utils: Registering HyperV
> Utility Driver
> Nov 09 10:28:48 localhost.localdomain kernel: hv_vmbus: registering driver
> hv_util
> 
> So I'm guessing by the time hv_utils showed up it was too late for the KVP
> daemon.


Oh, it seems 

ConditionPathExists=/dev/vmbus/hv_fcopy

only prevents this service from starting too early, nobody will start it when this file appears. I think it should be possible to make systemd do magic like that, I'll take a look.

> 
> (How do I reopen this?)

Moved the bug back to ASSIGNED as there is still some work to do here.

Comment 14 Sitsofe Wheeler 2015-11-09 21:32:44 UTC
Something like this in hypervkvpd.path:

[Unit]
Description=Hyper-V KVP daemon
ConditionVirtualization=microsoft

[Path]
PathExists=/dev/vmbus/hv_kvp

[Install]
WantedBy=multi-user.target

for hypervkvpd.service might have the desired effect (see http://www.freedesktop.org/software/systemd/man/systemd.path.html for details).

Comment 15 Vitaly Kuznetsov 2015-11-12 12:07:19 UTC
Can you please give this scratch build a try?
http://koji.fedoraproject.org/koji/taskinfo?taskID=11803666

I added .path files for systemd so units are supposed to be activated when devices come up. Device disappearance/reappearance (e.g. when guest services are disabled and then enabled back from host) should also be supported. Unfortunately I discovered an ugly kernel bug while testing this:
https://lkml.org/lkml/2015/11/12/197

Comment 16 Sitsofe Wheeler 2015-11-13 09:22:34 UTC
You wouldn't happen to have a build that targets Fedora 23 with binary packages would you? I don't happen to have an appropriate build system around at the moment to rebuild the srpm...

Comment 17 Sitsofe Wheeler 2015-11-13 09:33:37 UTC
OK I've just found the fc24 RPM and installed that. Here's what I get:

Nov 13 09:30:59 localhost kernel: hv_vmbus: registering driver hv_storvsc
Nov 13 09:31:00 localhost.localdomain systemd[1]: Started Hyper-V KVP device.
Nov 13 09:31:00 localhost.localdomain systemd[1]: Starting Hyper-V KVP device.
Nov 13 09:31:00 localhost.localdomain systemd[1]: Started Hyper-V KVP daemon.
Nov 13 09:31:00 localhost.localdomain systemd[1]: Starting Hyper-V KVP daemon...
Nov 13 09:31:00 localhost.localdomain KVP[493]: KVP starting; pid is:493
Nov 13 09:31:00 localhost.localdomain KVP[493]: open /dev/vmbus/hv_kvp failed; error: 2 No such file or directory
Nov 13 09:31:00 localhost.localdomain systemd[1]: hypervkvpd.service: Main process exited, code=exited, status=1/FAILURE
Nov 13 09:31:00 localhost.localdomain systemd[1]: hypervkvpd.service: Unit entered failed state.
Nov 13 09:31:00 localhost.localdomain systemd[1]: hypervkvpd.service: Failed with result 'exit-code'.
Nov 13 09:31:00 localhost.localdomain kernel: hv_utils: Registering HyperV Utility Driver
Nov 13 09:31:00 localhost.localdomain kernel: hv_vmbus: registering driver hv_util
Nov 13 09:31:00 localhost.localdomain systemd[1]: Started Hyper-V KVP daemon.
Nov 13 09:31:00 localhost.localdomain systemd[1]: Starting Hyper-V KVP daemon...
Nov 13 09:31:00 localhost.localdomain KVP[514]: KVP starting; pid is:514
Nov 13 09:31:00 localhost.localdomain KVP[514]: KVP LIC Version: 3.1
Nov 13 09:30:22 localhost.localdomain kernel: hv_vmbus: registering driver hv_netvsc
Nov 13 09:30:22 localhost.localdomain kernel: hv_netvsc: hv_netvsc channel opened successfully
Nov 13 09:30:22 localhost.localdomain multipathd[399]: sda: spurious uevent, path already in pathvec
Nov 13 09:30:22 localhost.localdomain kernel: hv_netvsc vmbus_14: Send section size: 6144, Section count:2560
Nov 13 09:30:22 localhost.localdomain kernel: hv_netvsc vmbus_14: Device MAC 00:15:5d:03:b6:32 link state up
Nov 13 09:30:22 localhost.localdomain kernel: hv_netvsc: hv_netvsc channel opened successfully
Nov 13 09:30:22 localhost.localdomain kernel: hv_netvsc vmbus_15: Send section size: 6144, Section count:2560
Nov 13 09:30:22 localhost.localdomain kernel: hv_netvsc vmbus_15: Device MAC 00:15:5d:03:b6:33 link state up
Nov 13 09:30:22 localhost.localdomain systemd[1]: Started Hyper-V KVP daemon.

It seems a bit messy that it starts up and then quits with failure only to be started again later...

Comment 18 Sitsofe Wheeler 2015-11-13 09:39:23 UTC
I'm wondering if my first suggestion was wrong and it is better to use udev rules to help sort this out: https://lists.fedoraproject.org/pipermail/devel/2012-January/160917.html .

Comment 19 Sitsofe Wheeler 2015-11-13 10:01:16 UTC
/etc/udev/rules.d/hypervkvpd.rules:
SUBSYSTEM=="misc", KERNEL=="vmbus/hv_kvp", TAG+="systemd"

/etc/systemd/system/hypervkvpd.service.d/override.conf:
[Unit]
Wants=sys-devices-virtual-misc-vmbus\x21hv_kvp.device
After=sys-devices-virtual-misc-vmbus\x21hv_kvp.device

looks awful but it seems to work...

Nov 13 09:56:47 localhost.localdomain kernel: hv_utils: Registering HyperV Utility Driver
Nov 13 09:56:47 localhost.localdomain kernel: hv_vmbus: registering driver hv_util
Nov 13 09:56:47 localhost.localdomain kernel: hv_vmbus: registering driver hv_netvsc
[...]
Nov 13 09:56:10 localhost.localdomain systemd[1]: Found device /sys/devices/virtual/misc/vmbus!hv_kvp.
Nov 13 09:56:10 localhost.localdomain systemd[1]: Started Hyper-V KVP daemon.
Nov 13 09:56:10 localhost.localdomain systemd[1]: Starting Hyper-V KVP daemon...
Nov 13 09:56:10 localhost.localdomain KVP[475]: KVP starting; pid is:475
Nov 13 09:56:10 localhost.localdomain KVP[475]: KVP LIC Version: 3.1

(I've no idea why the clock jumps around as it does when starting up - the host's clock is set correctly and there's no NTP syncing happening...)

Comment 20 Vitaly Kuznetsov 2015-11-13 15:56:38 UTC
The tricky part is the fact that these devices can disappear and reappear again at any time (e.g. try disabling "Data Exchange" in your VM properties and then enabling it back). I'd like to have daemons restarted in such cases, but udev rules don't do that. I played a little bit with other systemd features and it seems adding
ConditionPathExists=/dev/vmbus/hv_kvp
to the unit solves both problems -- we don't have fault startup when the device isn't there and .path file does the job when it comes up.

Here is a scratch build with the change:
http://koji.fedoraproject.org/koji/taskinfo?taskID=11821296

Please be aware that in case you'll be testing enabling/disabling services from your Hyper-V host your guest kernel will probably crash unless you rebuild it with patches from Comment#15.

Thanks!

Comment 21 Sitsofe Wheeler 2015-11-14 22:09:30 UTC
The latest RPM produces the following:

journalctl -b | grep -i 'hv\|hyper\|KVP'
Nov 14 21:52:16 localhost kernel: hv_vmbus: registering driver hv_storvsc
Nov 14 21:52:17 localhost.localdomain systemd[1]: Started Hyper-V KVP device.
Nov 14 21:52:17 localhost.localdomain systemd[1]: Starting Hyper-V KVP device.
Nov 14 21:52:17 localhost.localdomain systemd[1]: Started Hyper-V KVP daemon.
Nov 14 21:52:17 localhost.localdomain kernel: hv_utils: Registering HyperV Utility Driver
Nov 14 21:52:17 localhost.localdomain kernel: hv_vmbus: registering driver hv_util
Nov 14 21:52:17 localhost.localdomain systemd[1]: Started Hyper-V KVP daemon.
Nov 14 21:52:17 localhost.localdomain systemd[1]: Starting Hyper-V KVP daemon...
Nov 14 21:52:17 localhost.localdomain KVP[478]: KVP starting; pid is:478
Nov 14 21:52:17 localhost.localdomain KVP[478]: KVP LIC Version: 3.1
Nov 14 21:52:17 localhost.localdomain kernel: hv_vmbus: registering driver hv_netvsc
Nov 14 21:52:17 localhost.localdomain kernel: hv_netvsc: hv_netvsc channel opened successfully
[...]
Nov 14 21:52:18 localhost.localdomain systemd[1]: Started Hyper-V KVP daemon.

It's not great seeing the (systemd hypervkvpd) service being started so many (4?) times but at least the KVP itself is only being started once (and I wish that daemon had chosen hypervkvpd as its syslog name) so arguably this is fixed.

One thing though - perhaps it's linked to your kernel patch but while disabling/enabling the "Data Exchange" integration service on the host didn't result in a kernel crash it DID result in the IP address never appearing in the host's networking tab ever again. The journal shows the service being started again but this has no impact - only manually restarting the service makes the IP address reporting work again.

Comment 22 Sitsofe Wheeler 2015-11-14 22:42:32 UTC
OK try 3 (from http://unix.stackexchange.com/questions/63232/what-is-the-correct-way-to-write-a-udev-rule-to-stop-a-service-under-systemd ):

/etc/udev/rules.d/hypervkvpd.rules:
SUBSYSTEM=="misc", KERNEL=="vmbus/hv_kvp", TAG+="systemd", ENV{SYSTEMD_WANTS}+="hypervkvpd.service"

/etc/systemd/system/hypervkvpd.service.d/override.conf:
[Unit]
BindsTo=sys-devices-virtual-misc-vmbus\x21hv_kvp.device
After=sys-devices-virtual-misc-vmbus\x21hv_kvp.device

(.path file was masked away)

Brief testing shows this appeared to address your disappear and reappear requirement (service was correctly stopped when Data Exchange was disabled and started after Data Exchange was subsequently enabled) a few times before the kernel blew up with an oops.

Comment 23 Vitaly Kuznetsov 2015-11-18 16:09:31 UTC
Thanks,

here is the scratch build with udev rules:
http://koji.fedoraproject.org/koji/taskinfo?taskID=11893296

I'll test it for a while and submit an official build if there are no issues.

Comment 24 Sitsofe Wheeler 2015-11-20 22:57:12 UTC
Thanks Vitaly - that package seems to work well for me (but I'm only testing the hyperkvpd portion of it).

Comment 25 Vitaly Kuznetsov 2015-11-23 10:45:09 UTC
(In reply to Sitsofe Wheeler from comment #24)
> Thanks Vitaly - that package seems to work well for me (but I'm only testing
> the hyperkvpd portion of it).

Thanks, I briefly tested all of them and everything seemed to work well. I'll be submitting builds for rawhide & f23 shortly.

Comment 26 Sitsofe Wheeler 2015-11-23 12:24:59 UTC
Vitaly:
Something that has just occurred to me - are people going to get into a mess if they were to install this package on a system with an old (pre 4.2 kernel)? This isn't such an issue for Fedora 23 (as it never came with an older kernel) but might be for others...

Comment 27 Vitaly Kuznetsov 2015-11-23 12:51:21 UTC
(In reply to Sitsofe Wheeler from comment #26)
> Vitaly:
> Something that has just occurred to me - are people going to get into a mess
> if they were to install this package on a system with an old (pre 4.2
> kernel)? This isn't such an issue for Fedora 23 (as it never came with an
> older kernel) but might be for others...

We could have added an explicit 'Requires: kernel >= 4.2' to rpms but I think it would be an overkill:
- there never was an older kernel in F23;
- current hyperv-daemons packages in F23 and rawhide are not able to work with older kernels anyway;
- it is not a bullet proof solution as it is possible to have a newer kernel installed but an older one booted.

In general, people installing packages from newer Fedora on old Fedoras or building their own kernels are on their own: it might and might not work in each particular case...

Comment 28 Sitsofe Wheeler 2015-11-23 13:54:35 UTC
Vitaly: Sounds reasonable!

Comment 29 Fedora Update System 2015-11-23 17:52:19 UTC
hyperv-daemons-0-0.13.20150702git.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2015-0d1a796b7d

Comment 30 Fedora Update System 2015-11-24 02:22:34 UTC
hyperv-daemons-0-0.13.20150702git.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
If you want to test the update, you can install it with
$ su -c 'dnf --enablerepo=updates-testing update hyperv-daemons'
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-0d1a796b7d

Comment 31 Fedora Update System 2015-12-07 20:29:57 UTC
hyperv-daemons-0-0.13.20150702git.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.