Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 773685

Summary:	RFE: vdsm should monitor host health
Product:	[Retired] oVirt	Reporter:	Dan Yasny <dyasny>
Component:	vdsm	Assignee:	Dan Kenigsberg <danken>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	unspecified	CC:	abaron, acathrow, bazulay, iheim, mburns, ykaul
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:	infra
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-03-12 15:55:54 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Yasny 2012-01-12 15:22:51 UTC

Description of problem:
When some of the system LVs of a RHEV-H are missing, the host cannot operate properly, but vdsm still reports it as "up"

Version-Release number of selected component (if applicable):

Version     : 4.9
Release     : 110.el6


How reproducible:
Always

Steps to Reproduce:
1. install rhev-h to a host with 2 hard drives
2. take second hdd offline
3. host boots and comes up OK, but some of the LVs that touched the second PV will be offline (usually it's Data and possibly others, depending on disk size and LV sizes)
  
Actual results:
Host boots OK, fails to mount everything in fstab, but moves on, and shows as "Up" in RHEV-M


Expected results:
Host should Come up as non-operational with a relevant message ("Missing local volumes: X, Y, Z")

Additional info:

Comment 2 Dan Kenigsberg 2012-01-12 17:24:20 UTC

Note that Vdsm does not report whether it is Up or not - it reports its capabilities, and RHEV-M decides if it is worthy of its cluster.

When the second hdd is offline, what's missing from the OS perspective? Which mountpoint is missing? How can vdsm know that a harddisk exists, but offline?

Comment 3 Dan Yasny 2012-01-13 09:50:54 UTC

(In reply to comment #2)
> Note that Vdsm does not report whether it is Up or not - it reports its
> capabilities, and RHEV-M decides if it is worthy of its cluster.
> 
> When the second hdd is offline, what's missing from the OS perspective? Which
> mountpoint is missing? How can vdsm know that a harddisk exists, but offline?

Haven't looked too deeply into this, but /data was definitely gone. Depending on the partitioning layout and the amount of local drives, this can be any of the mounts.
I think we need some kind of general health check that would confirm we are able to run in a cluster, and getVdsCaps is missing some params there. In this particular case, the host had to use local storage, but failed to create a local SD because /data wasn't readable (without the Data LV, it's part of the r/o root)
The test for this is easy enough - run mount -a and check for $?

Comment 4 Ayal Baron 2012-01-14 09:08:02 UTC

if only /data wasn't available then it's only interesting when user wants to create local storage.  This does not and should not render the host non-operational as the node can be used in a cluster just fine.

When you try to create the storage domain what error do you get?

Comment 5 Dan Yasny 2012-01-14 10:25:52 UTC

(In reply to comment #4)
> if only /data wasn't available then it's only interesting when user wants to
> create local storage.  This does not and should not render the host
> non-operational as the node can be used in a cluster just fine.
> 
> When you try to create the storage domain what error do you get?

/data is r/o, vdsm fails on that.

But you're missing the point here, if you want to actually check every resorce for cluster compatibility, then yes, some mountpoints are more relevant to specific use cases than others, but /data is also the place for iso uploading and for host upgrades, unless I'm mistaken. Are those also not critical?

Comment 6 Ayal Baron 2012-01-14 22:43:13 UTC

(In reply to comment #5)
> (In reply to comment #4)
> > if only /data wasn't available then it's only interesting when user wants to
> > create local storage.  This does not and should not render the host
> > non-operational as the node can be used in a cluster just fine.
> > 
> > When you try to create the storage domain what error do you get?
> 
> /data is r/o, vdsm fails on that.
> 
> But you're missing the point here, if you want to actually check every resorce
> for cluster compatibility, then yes, some mountpoints are more relevant to
> specific use cases than others, but /data is also the place for iso uploading
> and for host upgrades, unless I'm mistaken. Are those also not critical?

The ISO uploader has nothing to do with vdsm, it's a hack which pushes things into the iso domain using ssh.  vdsm doesn't care about that.
host upgrade on rhev-h means that if we want to upgrade the host it will fail, it doesn't mean that the host cannot currently function as a hypervisor (it can), so no, it is not critical.

note that this is not to say that this shouldn't be checked, but it is definitely no reason to move a host to non-operational.  This is at most a warrning.

Now that I'm clear on the problem and it's implications, I'm changing this to be an rfe and moving upstream as this will not make it into 3.1.

TODO:
1. Add /data to monitored paths in getVdsStats
2. make sure that monitoring includes 'liveness' and not only free space

Comment 7 Dan Yasny 2012-01-16 09:14:06 UTC

(In reply to comment #6)
> The ISO uploader has nothing to do with vdsm, it's a hack which pushes things
> into the iso domain using ssh.  vdsm doesn't care about that.
> host upgrade on rhev-h means that if we want to upgrade the host it will fail,
> it doesn't mean that the host cannot currently function as a hypervisor (it
> can), so no, it is not critical.
> 
> note that this is not to say that this shouldn't be checked, but it is
> definitely no reason to move a host to non-operational.  This is at most a
> warrning.

First of all yes, the host is in trouble and we should report that, instead of waiting for something to fail.
Second, this time it was /data that failed to mount, in another config, it might be another mountpoint. We need to assess which mountpoints are critical and start monitoring them, besides reporting issues with the less critical ones.


> 
> Now that I'm clear on the problem and it's implications, I'm changing this to
> be an rfe and moving upstream as this will not make it into 3.1.
> 
> TODO:
> 1. Add /data to monitored paths in getVdsStats
> 2. make sure that monitoring includes 'liveness' and not only free space

I'm not sure this is the right approach here, at least not until we're done assessing the above.

Comment 8 Ayal Baron 2012-01-17 10:06:31 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > The ISO uploader has nothing to do with vdsm, it's a hack which pushes things
> > into the iso domain using ssh.  vdsm doesn't care about that.
> > host upgrade on rhev-h means that if we want to upgrade the host it will fail,
> > it doesn't mean that the host cannot currently function as a hypervisor (it
> > can), so no, it is not critical.
> > 
> > note that this is not to say that this shouldn't be checked, but it is
> > definitely no reason to move a host to non-operational.  This is at most a
> > warrning.
> 
> First of all yes, the host is in trouble and we should report that, instead of
> waiting for something to fail.
> Second, this time it was /data that failed to mount, in another config, it
> might be another mountpoint. We need to assess which mountpoints are critical
> and start monitoring them, besides reporting issues with the less critical
> ones.
> 
> 
> > 
> > Now that I'm clear on the problem and it's implications, I'm changing this to
> > be an rfe and moving upstream as this will not make it into 3.1.
> > 
> > TODO:
> > 1. Add /data to monitored paths in getVdsStats
> > 2. make sure that monitoring includes 'liveness' and not only free space
> 
> I'm not sure this is the right approach here, at least not until we're done
> assessing the above.

Assessing whether a mount point is critical is the engine's job when it decided to put a host as non-operational or not.
It is vdsm's job to give it all the data it needs to make a smart decision.
There are cases where it is clear cut, but those are few and not policy driven.

The same host when attached to 2 different clusters could be fine in 1 but non-operational in the other.
I'll give another example to make this clear:
A DC which connects to LUNs over FCP -
 Host A which has no FC HBA's would be non-operational in this DC
Another DC which connects to NFS -
 Same host A could work just fine here.

Same goes to /data

Comment 9 Dan Yasny 2012-01-17 10:18:30 UTC

This is exactly why I am saying vdsmd should __report__ additional host metrics. The assessment we need to do is on what exactly needs to be added to reporting, once that additional data is reported, the engine can start looking at it.

Comment 10 Ayal Baron 2012-01-17 10:59:52 UTC

So start adding here everything you think should be monitored (in general though, we should just have a set of counters the host can monitor and have the engine define which counters it wants to collect).

Comment 11 Itamar Heim 2013-03-12 15:55:54 UTC

Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug.