1111234 – [RFE] Host Self-Health Log: network connectivity information

Bug 1111234 - [RFE] Host Self-Health Log: network connectivity information

Summary: [RFE] Host Self-Health Log: network connectivity information

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Dan Kenigsberg
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:	network
Depends On:
Blocks:	1122010 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-06-19 14:17 UTC by Nir Yechiel
Modified:	2016-02-10 19:56 UTC (History)
CC List:	11 users (show)
Fixed In Version:	vt2.2
Doc Type:	Enhancement
Doc Text:	With this release, hosts can now keep a connectivity history log. Sometimes, hosts fail due to transient connectivity failures. To help debug these failures, users can now check /var/log/vdsm/connectivity.log. This log can be reviewed via the log inspector. It includes changes in interface operational status, speed, and duplex. It also reports when the Engine is disconnected from the host and stops polling it by logging client_seen:False.
Clone Of:
Clones:	1122010 (view as bug list)
Environment:
Last Closed:	2015-02-11 21:11:34 UTC
oVirt Team:	Network
Target Upstream Version:
Embargoed:
Flags:	nyechiel: Triaged+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0159	normal	SHIPPED_LIVE	vdsm 3.5.0 - bug fix and enhancement update	2015-02-12 01:35:58 UTC
oVirt gerrit	29472	master	MERGED	introduce connectivity log	2020-04-16 14:58:34 UTC
oVirt gerrit	29646	ovirt-3.5	MERGED	introduce connectivity log	2020-04-16 14:58:34 UTC
oVirt gerrit	29690	ovirt-3.4	MERGED	introduce connectivity log	2020-04-16 14:58:34 UTC
oVirt gerrit	29724	ovirt-3.5	MERGED	connectivty log: more readable changes	2020-04-16 14:58:34 UTC
oVirt gerrit	29725	ovirt-3.4	MERGED	connectivty log: more readable changes	2020-04-16 14:58:34 UTC
oVirt gerrit	30225	master	MERGED	connectivty log: log duplex state	2020-04-16 14:58:34 UTC
oVirt gerrit	30226	master	MERGED	connectivity log: report if Engine has been seen recently	2020-04-16 14:58:35 UTC
oVirt gerrit	30227	master	MERGED	connectivity log: report dropped devices	2020-04-16 14:58:35 UTC
oVirt gerrit	30522	ovirt-3.5	MERGED	connectivty log: log duplex state	2020-04-16 14:58:35 UTC
oVirt gerrit	30523	ovirt-3.5	MERGED	connectivity log: report if Engine has been seen recently	2020-04-16 14:58:35 UTC
oVirt gerrit	30524	ovirt-3.5	MERGED	connectivity log: report dropped devices	2020-04-16 14:58:36 UTC
oVirt gerrit	31619	master	MERGED	virt: keep on sampling before /var/run/vdsm/client.log exists	2020-04-16 14:58:36 UTC
oVirt gerrit	31649	ovirt-3.5	MERGED	virt: keep on sampling before /var/run/vdsm/client.log exists	Never

Description Nir Yechiel 2014-06-19 14:17:22 UTC

Description of problem:

Troubleshooting host issues through the vdsm.log file is difficult. Easily parsing usable information for quick diagnostics of key data is not possible today. RHEV hypervisors need to have a simple and dedicate local log file that contains a set of self-diagnostic information about the network status that can assist in easily identifying common infrastructure issues. 

Proposed solution:

On each host, the following network data should be logged on a regular user-definable interval which defaults to 60 seconds:

- Physical NICs and Bonds link status (up/down) 

- ICMP pings to the hosts' default gateways. Each interval should run a certain number of ICMP requests, which is user-definable. The default value should be 10 requests


Additional info:

The new dedicated log file should be -

- In simple format that is easily consumable by end users, support staff, and engineers. The format should make it simple to use CLI parsing such as grep or awk to find and list key information

- Stored on each hypervisor host

- Included in SOS Report Bundles

- Available for view/display within the RHEV-Manager UI when the host is online

Comment 1 Dan Kenigsberg 2014-06-20 08:55:47 UTC

- Why is that a big deal to have these stats in their own log? Wouldn't just as good to have a clear identifier for these stats and use `grep`? I find it easier to debug with one sequential log.
- How would you like to display it on Engine? Currently, Vdsm already reports link state; layer 3 ping statistics can be added, and to be available in dwh. Or are you interested in upload of raw logs, in a parallel mechanism to logcollector?
- The rate limit on the number of pings is not clear. One ICMP ping per gateway per 15 seconds (which is Engine's polling rate) is frequent enough and adds a negligible load.

Comment 2 Tomas Dosek 2014-06-25 13:34:02 UTC

The view/display of logs should be handled using current "Log viewer" plugin implemented in 3.4 by Keith Robertson's team (part of the support plug-in for RHEV).

Comment 3 Barak 2014-06-26 12:41:59 UTC

In a different log file will mean also additional rotation and log-collector.

Personally I'm for having this in the vdsm log itself.

Comment 4 Dan Kenigsberg 2014-06-27 15:55:08 UTC

(In reply to Barak from comment #3)
> 
> Personally I'm for having this in the vdsm log itself.

So was I, but having spoken with Scott, I understand that the whole essence of this RFE is to produce a succinct per-host connectivity.log, that would be easily visibly via the Log Viewer ( https://access.redhat.com/site/articles/425603 ).

Comment 5 Nir Yechiel 2014-07-01 09:41:51 UTC

Sorry for the late response.

(In reply to Dan Kenigsberg from comment #1)
> - Why is that a big deal to have these stats in their own log? Wouldn't just
> as good to have a clear identifier for these stats and use `grep`? I find it
> easier to debug with one sequential log.

The idea is to create a simple separate per-host log with this key data to ease monitoring and troubleshooting. This is an explicit request by product management.


> - How would you like to display it on Engine? Currently, Vdsm already
> reports link state; layer 3 ping statistics can be added, and to be
> available in dwh. Or are you interested in upload of raw logs, in a parallel
> mechanism to logcollector?

As stated by Tomas on comment #2, we can use the "log viewer" plugin for that. 

> - The rate limit on the number of pings is not clear. One ICMP ping per
> gateway per 15 seconds (which is Engine's polling rate) is frequent enough
> and adds a negligible load.

One ICMP ping is not enough, especially for cases where there are indirect network failures like link flapping or device reboots, hence the suggested timers of 10 ICMP requests each 60 seconds. Anyway, these values should be configurable.

Comment 6 Dan Kenigsberg 2014-07-01 17:55:04 UTC

(In reply to Nir Yechiel from comment #5)
> Sorry for the late response.

It's understandable

> 
> One ICMP ping is not enough, especially for cases where there are indirect
> network failures like link flapping or device reboots, hence the suggested
> timers of 10 ICMP requests each 60 seconds. Anyway, these values should be
> configurable.

Could you elaborate on your suggestion? Are you suggesting a ping per gateway per 6 seconds?

Comment 7 Nir Yechiel 2014-07-02 07:12:52 UTC

(In reply to Dan Kenigsberg from comment #6)
> (In reply to Nir Yechiel from comment #5)
> > Sorry for the late response.
> 
> It's understandable
> 
> > 
> > One ICMP ping is not enough, especially for cases where there are indirect
> > network failures like link flapping or device reboots, hence the suggested
> > timers of 10 ICMP requests each 60 seconds. Anyway, these values should be
> > configurable.
> 
> Could you elaborate on your suggestion? Are you suggesting a ping per
> gateway per 6 seconds?

With the default suggested values there should be 10 continuous pings per gateway each 60 seconds.

Comment 8 Dan Kenigsberg 2014-07-02 08:39:06 UTC

Nir, I fail to understand what you mean by "continuous pings". Are you referring to an infinitely-running /usr/bin/ping ? If so, why do you need 10 of those? Are you referring to 10 parallel ICMP ECHO requests? What makes them "continuous"? Why parallel? I'd appreciate an elaboration, with motivation included.

Comment 9 Nir Yechiel 2014-07-02 08:52:45 UTC

(In reply to Dan Kenigsberg from comment #8)
> Nir, I fail to understand what you mean by "continuous pings". Are you
> referring to an infinitely-running /usr/bin/ping ? If so, why do you need 10
> of those? Are you referring to 10 parallel ICMP ECHO requests? What makes
> them "continuous"? Why parallel? I'd appreciate an elaboration, with
> motivation included.

10 requests each with one sec interval:

ping -c 10 -i 1 x.x.x.x

This should run every 60 seconds by default


While we are tracking the local interfaces on the host, the only way for us to detect any kind of indirect network failures is using these pings. With the suggested timers, we should be able to detect repeating issues like port flapping, and pretty quickly.

Comment 10 Tomas Dosek 2014-07-02 09:03:26 UTC

I'm a little bit afraid what traffic that would cause on network with single gateway where we'd have 100 hypervisors connected. With maximal allowable size of ICMP being 64 kB (that is theoretical limit, usually the size is much less)
that'd make 64 * 10 * 100 = ~63 MB of data processed each 60 seconds (and that's only the requests).

That seems like quite a lot of traffic for the network as well as quite a lot of processing power needed for logging, etc.

Comment 11 Nir Yechiel 2014-07-02 09:36:39 UTC

(In reply to Tomas Dosek from comment #10)
> I'm a little bit afraid what traffic that would cause on network with single
> gateway where we'd have 100 hypervisors connected. With maximal allowable
> size of ICMP being 64 kB (that is theoretical limit, usually the size is
> much less)
> that'd make 64 * 10 * 100 = ~63 MB of data processed each 60 seconds (and
> that's only the requests).
> 
> That seems like quite a lot of traffic for the network as well as quite a
> lot of processing power needed for logging, etc.

Alternatives are to make the timers less aggressive or to force a smaller data size on each request. The default is 56, which translates into 64 plus the 8 bytes of ICMP header, but we really just need a keep-alive probes here and even a really small number should work. Thoughts?

Comment 12 Andrew Cathrow 2014-07-02 10:58:00 UTC

(In reply to Tomas Dosek from comment #2)
> The view/display of logs should be handled using current "Log viewer" plugin
> implemented in 3.4 by Keith Robertson's team (part of the support plug-in
> for RHEV).

Log viewer plugin on RHEV-M is fine but if there's a communication issue it might not be viable.

Comment 13 Andrew Cathrow 2014-07-02 10:59:03 UTC

Also - separate log is advisable to ease diagonstics.

VDSM log is unreadable by mere mortals, we need a separate log

Comment 14 Dan Kenigsberg 2014-07-02 11:54:47 UTC

(In reply to Andrew Cathrow from comment #12)
> 
> Log viewer plugin on RHEV-M is fine but if there's a communication issue it
> might not be viable.

Andy, as far as I understand, there is no alternative to keeping a host-local connectivity log. Reading it would be an issue anyway, regardless on where it is viewed.

We can report the additionally-requested stats (duplex, gateway pings) on getVdsStats and collect them in Engine (until connectivity is lost). But this changes Vdsm/Engine API and Engine reports and cannot reallistically happen in 3.4.z.

Comment 15 Scott Herold 2014-07-02 14:09:28 UTC

> > Log viewer plugin on RHEV-M is fine but if there's a communication issue it
> > might not be viable.
> 
> Andy, as far as I understand, there is no alternative to keeping a
> host-local connectivity log. Reading it would be an issue anyway, regardless
> on where it is viewed.

At least if it needs to be viewed "offline" or through serial console, etc, it will be a simple log and manageable by humans for a first-step in diagnostics.  There's nothing stopping us from viewing the log through "legacy" methods as well.

> We can report the additionally-requested stats (duplex, gateway pings) on
> getVdsStats and collect them in Engine (until connectivity is lost). But
> this changes Vdsm/Engine API and Engine reports and cannot reallistically
> happen in 3.4.z.

Why does this require anything to be collected in engine?  We're looking for the most basic of information that can be easily captured from the command line.  Nir mentioned the ping command in Comment 9.  For link status, speed, and duplex, "ethtool {interface}" on the local host provides the requested info.

Comment 16 Scott Herold 2014-07-02 14:15:51 UTC

(In reply to Tomas Dosek from comment #10)
> I'm a little bit afraid what traffic that would cause on network with single
> gateway where we'd have 100 hypervisors connected. With maximal allowable
> size of ICMP being 64 kB (that is theoretical limit, usually the size is
> much less)
> that'd make 64 * 10 * 100 = ~63 MB of data processed each 60 seconds (and
> that's only the requests).
> 
> That seems like quite a lot of traffic for the network as well as quite a
> lot of processing power needed for logging, etc.

Even with the overhead of tcp encapsulation, etc, the default size of an ICMP echo request (64 bytes of data) is about 120 Bytes.  120 Bytes x 10 x 100 = ~117KB.  Since we're running the command, we have full control over the specific parameters.  This is not excessive IMO.

Comment 17 Barak 2014-07-02 17:05:10 UTC

How do you envision this log to be used ?

Comment 18 Scott Herold 2014-07-02 19:22:46 UTC

It's a first step for basic host network and storage diagnostics that doesn't require a deep understanding of RHEV to figure out like vdsm.log.  It helps easily determine the basic "First X steps" of troubleshooting.

Comment 19 Barak 2014-07-03 07:06:05 UTC

you still didn't explain how you intend for that to be used ?

Comment 20 Scott Herold 2014-07-07 15:33:03 UTC

(In reply to Barak from comment #19)
> you still didn't explain how you intend for that to be used ?

It's a log file.  It can be used through the GSS support plugin, or if engine is down, it can be accessed directly from the host like any other log file in the system.

Comment 23 Michael Burman 2014-08-05 08:06:19 UTC

Hi dan

What is the behavior of this log? if i do ping(ping -c 10 -i 1 x.x.x.x) to wrong default gateway or change the default gateway and then ping, i should see an error in this connectivity log? i should see something in this log if i do ping to default gateway?
Because right now i don't see nothing that connected to ping to default gateway.
I see in this log only the state of the nic's and their speed.
I would like to get more info about this connectivity log and his behavior.

Kind regards

Michael

Comment 24 Dan Kenigsberg 2014-08-05 09:05:39 UTC

I am sorry that it was not clear: I did NOT implement any ping'ing of gateway, since what we ACTUALLY need is to report Engine/Vdsm connectivity. This achieved by logging when Engine disappears and when it reappears again.

Comment 25 Michael Burman 2014-08-10 06:01:21 UTC

Hi Dan

I'm doing /etc/init.d/ovirt-engine stop, and there is no report about loosing connectivity(Engine/Vdsm) in the connectivity.log

recent_client:False/True is the only thing that appears in the log.

I guess that this is not the behavior we should expect.

Thank you

Michael

Comment 26 Dan Kenigsberg 2014-08-11 08:26:52 UTC

Michael, once you stop Engine, connectivity.log should have recent_client:False within 20 seconds. Is that the case? "recent_client:False" means that Vdsm has not seen its client (=engine) recently.

Comment 27 Michael Burman 2014-08-11 08:37:36 UTC

Ok, so this is the report that connectivity log displays? when Engine/Vdsm loosing connectivity?

Or the log should display more than  "recent_client:False" ?

Comment 28 Michael Burman 2014-08-11 11:19:17 UTC

Verified on - oVirt Engine Version: 3.5.0-0.0.master.20140804172041.git23b558e.el6 

- When stopping engine service, the connectivity.log reports  "recent_client:False" after 10-20 sec, and that means Engine(client) disconnected from the host(Vdsm). 
- When starting engine service, the connectivity.log reports "recent_client:True"
after 10-20 sec, and that means Engine(client) connected to host(Vdsm).

- Also connectivity.log reports changes in interface operational state, speed and duplex.

Comment 29 Dan Kenigsberg 2014-08-18 15:51:35 UTC

Actually, it needs another round. Please verify that with the added patch, a fresh installation of vdsm starts up and reports all interfaces properly on `vdsClient -s 0 getVdsStats`.

Comment 31 Michael Burman 2014-09-07 09:18:47 UTC

Verified on - 3.5.0-0.10.master.el6ev and vdsm-4.16.3-2.el6.x86_64

Comment 33 errata-xmlrpc 2015-02-11 21:11:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0159.html

Note You need to log in before you can comment on or make changes to this bug.