Description of problem: According to discussion on BZ 1507691, switching a storage domain to maintenance is allowed even if a volume is still in open status because the volume is not being used by a VM nor the storage domain once deactivated. In vdsm log you will notice a warning message indicating that the volume could not be deactivated, this is because it was active on host side too. Although the volume is no longer 'active' from RHV perspective, LVM still shows this volume as open. This could happen after introducing 'vdsm disables lvmetad service' on BZ 1398918 and _not_ having LVM filters set, as per KCS https://access.redhat.com/solutions/2662261, when creating/adding a raw disk (no partition) into a VG inside the guest. The problem is that there is no practical way to identify this issue to let user know about it and point him to the correct procedure to configure LVM filters on hosts to avoid such scenario. As Nir detailed in other BZ: ~~~ 1. user create raw volume 2. user run vm with raw volume 3. inside the vm, user added the new disk to a vg, without creating a partition table on the disk (e.g. create lvm partiation (/dev/sdb1), add /dev/sdb1 to the vg. During last boot of this host: 1. lvm activated all lvs during boot, including RHV lvs 2. when the raw volume was activated, lvm scan it like any other disk 3. lvm discovers the guest lv inside the RHV raw volume, and activates it 4. user plug the raw volume to a running vm 5. user unplug the raw volume from the vm 6. vdsm fail to deactivate the raw volume lv since it is used by the guest lv ~~~ This RFE was opened to request this condition, or similar, to be reported/notified to engine and then logged as an event in order to be consumed by the UI and be shown on Hosts and Storage Domains. This way we could warn the user that additional steps need to be performed on hosts' LVM filters to avoid such issues in a future.
The specific case in comment 0 is just one example. Once we improve the lvm filter configuration, this issue will be gone, but the general issue of reporting secondary failures (e.g. during flow cleanup, or during periodic checks) will always be a problem. I think this is best solved by a central logging system, collecting errors and warnings from all hosts and providing statistics on most common errors. Engine can integrated with such system to show alerts about hosts or storage domains that had such failures.
(In reply to Nir Soffer from comment #1) > The specific case in comment 0 is just one example. Once we improve the lvm > filter configuration, this issue will be gone, but the general issue of > reporting > secondary failures (e.g. during flow cleanup, or during periodic checks) will > always be a problem. > > I think this is best solved by a central logging system, collecting errors > and > warnings from all hosts and providing statistics on most common errors. > > Engine can integrated with such system to show alerts about hosts or storage > domains that had such failures. The above is being done in the common logging effort, already for 4.2. We can have additional collection of logs (from journal, etc.) and set an alert for it.
(In reply to Yaniv Kaul from comment #2) > The above is being done in the common logging effort, already for 4.2. > We can have additional collection of logs (from journal, etc.) and set an > alert for it. This is great - but we cannot depend on arbitrary text in log file for reporting events. If we treat this event as important event, it must be written in machine readable format (e.g json) or reported to the daemon collecting events on a host.
(In reply to Yaniv Kaul from comment #2) > (In reply to Nir Soffer from comment #1) > > The specific case in comment 0 is just one example. Once we improve the lvm > > filter configuration, this issue will be gone, but the general issue of > > reporting > > secondary failures (e.g. during flow cleanup, or during periodic checks) will > > always be a problem. > > > > I think this is best solved by a central logging system, collecting errors > > and > > warnings from all hosts and providing statistics on most common errors. > > > > Engine can integrated with such system to show alerts about hosts or storage > > domains that had such failures. > > The above is being done in the common logging effort, already for 4.2. > We can have additional collection of logs (from journal, etc.) and set an > alert for it. Hi Yaniv, is there a BZ for this effort you mentioned in 4.2 ? If so, shouldn't this BZ be blocked by it ? Will add more cases or scenarios we might need to include besides the one in description as soon as I can come up with some. Thanks!
Not that I'm aware of. Please include specific items - as we use collectd for monitoring, maybe it's something we already monitor or maybe it's something we can easily add.
(In reply to Yaniv Kaul from comment #5) > Not that I'm aware of. Please include specific items - as we use collectd > for monitoring, maybe it's something we already monitor or maybe it's > something we can easily add. Forgot to set NEEDINFO on reporter for exact details of what's needed.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
Dominik, I went over all comments and unless I'm missing something this is about adding bad bond event into audit_log, so moving to network team
> Dominik, I went over all comments and unless I'm missing something this is about adding bad bond event into audit_log, so moving to network team Ack, we could check for bad bonds the same way as UI already does on every getCaps and add a rate limited message to audit log. I removed the target milestone to re-schedule this in the network team.
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly