Bug 1509007

Summary: [RFE] Add bad bond event into audit_log
Product: Red Hat Enterprise Virtualization Manager Reporter: Javier Coscia <jcoscia>
Component: ovirt-engineAssignee: Nobody <nobody>
Status: CLOSED WONTFIX QA Contact: Michael Burman <mburman>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.6CC: danken, dholler, jcoscia, lsurette, mburman, mgoldboi, michal.skrivanek, mkalinin, mmirecki, mperina, mtessun, nsoffer, sradco, srevivo
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sync-to-jira
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-24 15:24:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1240719, 1671876    
Bug Blocks:    

Description Javier Coscia 2017-11-02 16:43:54 UTC
Description of problem:

According to discussion on BZ 1507691, switching a storage domain to maintenance is allowed even if a volume is still in open status because the volume is not being used by a VM nor the storage domain once deactivated. In vdsm log you will notice a warning message indicating that the volume could not be deactivated, this is because it was active on host side too.

Although the volume is no longer 'active' from RHV perspective, LVM still shows this volume as open.

This could happen after introducing 'vdsm disables lvmetad service' on BZ 1398918 and _not_ having LVM filters set, as per KCS https://access.redhat.com/solutions/2662261, when creating/adding a raw disk (no partition) into a VG inside the guest.

The problem is that there is no practical way to identify this issue to let user know about it and point him to the correct procedure to configure LVM filters on hosts to avoid such scenario.


As Nir detailed in other BZ:

~~~
1. user create raw volume
2. user run vm with raw volume
3. inside the vm, user added the new disk to a vg, without creating a partition table on the disk (e.g. create lvm partiation (/dev/sdb1), add /dev/sdb1 to 
the vg.

During last boot of this host:
1. lvm activated all lvs during boot, including RHV lvs
2. when the raw volume was activated, lvm scan it like any other disk
3. lvm discovers the guest lv inside the RHV raw volume, and activates it
4. user plug the raw volume to a running vm
5. user unplug the raw volume from the vm
6. vdsm fail to deactivate the raw volume lv since it is used by the guest lv
~~~



This RFE was opened to request this condition, or similar, to be reported/notified to engine and then logged as an event in order to be consumed by the UI and be shown on Hosts and Storage Domains. This way we could warn the user that additional steps need to be performed on hosts' LVM filters to avoid such issues in a future.

Comment 1 Nir Soffer 2017-11-02 21:17:53 UTC
The specific case in comment 0 is just one example. Once we improve the lvm
filter configuration, this issue will be gone, but the general issue of reporting
secondary failures (e.g. during flow cleanup, or during periodic checks) will
always be a problem.

I think this is best solved by a central logging system, collecting errors and
warnings from all hosts and providing statistics on most common errors.

Engine can integrated with such system to show alerts about hosts or storage
domains that had such failures.

Comment 2 Yaniv Kaul 2017-11-22 14:12:50 UTC
(In reply to Nir Soffer from comment #1)
> The specific case in comment 0 is just one example. Once we improve the lvm
> filter configuration, this issue will be gone, but the general issue of
> reporting
> secondary failures (e.g. during flow cleanup, or during periodic checks) will
> always be a problem.
> 
> I think this is best solved by a central logging system, collecting errors
> and
> warnings from all hosts and providing statistics on most common errors.
> 
> Engine can integrated with such system to show alerts about hosts or storage
> domains that had such failures.

The above is being done in the common logging effort, already for 4.2.
We can have additional collection of logs (from journal, etc.) and set an alert for it.

Comment 3 Nir Soffer 2017-11-22 14:18:18 UTC
(In reply to Yaniv Kaul from comment #2)
> The above is being done in the common logging effort, already for 4.2.
> We can have additional collection of logs (from journal, etc.) and set an
> alert for it.

This is great - but we cannot depend on arbitrary text in log file for reporting
events. If we treat this event as important event, it must be written in machine
readable format (e.g json) or reported to the daemon collecting events on a host.

Comment 4 Javier Coscia 2018-02-09 20:21:28 UTC
(In reply to Yaniv Kaul from comment #2)
> (In reply to Nir Soffer from comment #1)
> > The specific case in comment 0 is just one example. Once we improve the lvm
> > filter configuration, this issue will be gone, but the general issue of
> > reporting
> > secondary failures (e.g. during flow cleanup, or during periodic checks) will
> > always be a problem.
> > 
> > I think this is best solved by a central logging system, collecting errors
> > and
> > warnings from all hosts and providing statistics on most common errors.
> > 
> > Engine can integrated with such system to show alerts about hosts or storage
> > domains that had such failures.
> 
> The above is being done in the common logging effort, already for 4.2.
> We can have additional collection of logs (from journal, etc.) and set an
> alert for it.

Hi Yaniv, is there a BZ for this effort you mentioned in 4.2 ? If so, shouldn't this BZ be blocked by it ?

Will add more cases or scenarios we might need to include besides the one in description as soon as I can come up with some.

Thanks!

Comment 5 Yaniv Kaul 2018-02-09 20:23:09 UTC
Not that I'm aware of. Please include specific items - as we use collectd for monitoring, maybe it's something we already monitor or maybe it's something we can easily add.

Comment 6 Yaniv Kaul 2018-02-26 12:45:00 UTC
(In reply to Yaniv Kaul from comment #5)
> Not that I'm aware of. Please include specific items - as we use collectd
> for monitoring, maybe it's something we already monitor or maybe it's
> something we can easily add.

Forgot to set NEEDINFO on reporter for exact details of what's needed.

Comment 15 Sandro Bonazzola 2019-01-28 09:43:45 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 18 Martin Perina 2019-03-25 14:29:35 UTC
Dominik, I went over all comments and unless I'm missing something this is about adding bad bond event into audit_log, so moving to network team

Comment 19 Dominik Holler 2019-03-25 17:23:42 UTC
> Dominik, I went over all comments and unless I'm missing something this is about adding bad bond event into audit_log, so moving to network team

Ack, we could check for bad bonds the same way as UI already does on every getCaps and add a rate limited message to audit log.
I removed the target milestone to re-schedule this in the network team.

Comment 20 Michal Skrivanek 2020-03-18 15:47:12 UTC
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly

Comment 21 Michal Skrivanek 2020-03-18 15:51:57 UTC
This bug didn't get any attention for a while, we didn't have the capacity to make any progress. If you deeply care about it or want to work on it please assign/target accordingly