Description of problem: Troubleshooting host issues through the vdsm.log file is difficult. Easily parsing usable information for quick diagnostics of key data is not possible today. RHEV hypervisors need to have a simple and dedicate local log file that contains a set of self-diagnostic information about the network status that can assist in easily identifying common infrastructure issues. Proposed solution: On each host, the following network data should be logged on a regular user-definable interval which defaults to 60 seconds: - Physical NICs and Bonds link status (up/down) - ICMP pings to the hosts' default gateways. Each interval should run a certain number of ICMP requests, which is user-definable. The default value should be 10 requests Additional info: The new dedicated log file should be - - In simple format that is easily consumable by end users, support staff, and engineers. The format should make it simple to use CLI parsing such as grep or awk to find and list key information - Stored on each hypervisor host - Included in SOS Report Bundles - Available for view/display within the RHEV-Manager UI when the host is online
- Why is that a big deal to have these stats in their own log? Wouldn't just as good to have a clear identifier for these stats and use `grep`? I find it easier to debug with one sequential log. - How would you like to display it on Engine? Currently, Vdsm already reports link state; layer 3 ping statistics can be added, and to be available in dwh. Or are you interested in upload of raw logs, in a parallel mechanism to logcollector? - The rate limit on the number of pings is not clear. One ICMP ping per gateway per 15 seconds (which is Engine's polling rate) is frequent enough and adds a negligible load.
The view/display of logs should be handled using current "Log viewer" plugin implemented in 3.4 by Keith Robertson's team (part of the support plug-in for RHEV).
In a different log file will mean also additional rotation and log-collector. Personally I'm for having this in the vdsm log itself.
(In reply to Barak from comment #3) > > Personally I'm for having this in the vdsm log itself. So was I, but having spoken with Scott, I understand that the whole essence of this RFE is to produce a succinct per-host connectivity.log, that would be easily visibly via the Log Viewer ( https://access.redhat.com/site/articles/425603 ).
Sorry for the late response. (In reply to Dan Kenigsberg from comment #1) > - Why is that a big deal to have these stats in their own log? Wouldn't just > as good to have a clear identifier for these stats and use `grep`? I find it > easier to debug with one sequential log. The idea is to create a simple separate per-host log with this key data to ease monitoring and troubleshooting. This is an explicit request by product management. > - How would you like to display it on Engine? Currently, Vdsm already > reports link state; layer 3 ping statistics can be added, and to be > available in dwh. Or are you interested in upload of raw logs, in a parallel > mechanism to logcollector? As stated by Tomas on comment #2, we can use the "log viewer" plugin for that. > - The rate limit on the number of pings is not clear. One ICMP ping per > gateway per 15 seconds (which is Engine's polling rate) is frequent enough > and adds a negligible load. One ICMP ping is not enough, especially for cases where there are indirect network failures like link flapping or device reboots, hence the suggested timers of 10 ICMP requests each 60 seconds. Anyway, these values should be configurable.
(In reply to Nir Yechiel from comment #5) > Sorry for the late response. It's understandable > > One ICMP ping is not enough, especially for cases where there are indirect > network failures like link flapping or device reboots, hence the suggested > timers of 10 ICMP requests each 60 seconds. Anyway, these values should be > configurable. Could you elaborate on your suggestion? Are you suggesting a ping per gateway per 6 seconds?
(In reply to Dan Kenigsberg from comment #6) > (In reply to Nir Yechiel from comment #5) > > Sorry for the late response. > > It's understandable > > > > > One ICMP ping is not enough, especially for cases where there are indirect > > network failures like link flapping or device reboots, hence the suggested > > timers of 10 ICMP requests each 60 seconds. Anyway, these values should be > > configurable. > > Could you elaborate on your suggestion? Are you suggesting a ping per > gateway per 6 seconds? With the default suggested values there should be 10 continuous pings per gateway each 60 seconds.
Nir, I fail to understand what you mean by "continuous pings". Are you referring to an infinitely-running /usr/bin/ping ? If so, why do you need 10 of those? Are you referring to 10 parallel ICMP ECHO requests? What makes them "continuous"? Why parallel? I'd appreciate an elaboration, with motivation included.
(In reply to Dan Kenigsberg from comment #8) > Nir, I fail to understand what you mean by "continuous pings". Are you > referring to an infinitely-running /usr/bin/ping ? If so, why do you need 10 > of those? Are you referring to 10 parallel ICMP ECHO requests? What makes > them "continuous"? Why parallel? I'd appreciate an elaboration, with > motivation included. 10 requests each with one sec interval: ping -c 10 -i 1 x.x.x.x This should run every 60 seconds by default While we are tracking the local interfaces on the host, the only way for us to detect any kind of indirect network failures is using these pings. With the suggested timers, we should be able to detect repeating issues like port flapping, and pretty quickly.
I'm a little bit afraid what traffic that would cause on network with single gateway where we'd have 100 hypervisors connected. With maximal allowable size of ICMP being 64 kB (that is theoretical limit, usually the size is much less) that'd make 64 * 10 * 100 = ~63 MB of data processed each 60 seconds (and that's only the requests). That seems like quite a lot of traffic for the network as well as quite a lot of processing power needed for logging, etc.
(In reply to Tomas Dosek from comment #10) > I'm a little bit afraid what traffic that would cause on network with single > gateway where we'd have 100 hypervisors connected. With maximal allowable > size of ICMP being 64 kB (that is theoretical limit, usually the size is > much less) > that'd make 64 * 10 * 100 = ~63 MB of data processed each 60 seconds (and > that's only the requests). > > That seems like quite a lot of traffic for the network as well as quite a > lot of processing power needed for logging, etc. Alternatives are to make the timers less aggressive or to force a smaller data size on each request. The default is 56, which translates into 64 plus the 8 bytes of ICMP header, but we really just need a keep-alive probes here and even a really small number should work. Thoughts?
(In reply to Tomas Dosek from comment #2) > The view/display of logs should be handled using current "Log viewer" plugin > implemented in 3.4 by Keith Robertson's team (part of the support plug-in > for RHEV). Log viewer plugin on RHEV-M is fine but if there's a communication issue it might not be viable.
Also - separate log is advisable to ease diagonstics. VDSM log is unreadable by mere mortals, we need a separate log
(In reply to Andrew Cathrow from comment #12) > > Log viewer plugin on RHEV-M is fine but if there's a communication issue it > might not be viable. Andy, as far as I understand, there is no alternative to keeping a host-local connectivity log. Reading it would be an issue anyway, regardless on where it is viewed. We can report the additionally-requested stats (duplex, gateway pings) on getVdsStats and collect them in Engine (until connectivity is lost). But this changes Vdsm/Engine API and Engine reports and cannot reallistically happen in 3.4.z.
> > Log viewer plugin on RHEV-M is fine but if there's a communication issue it > > might not be viable. > > Andy, as far as I understand, there is no alternative to keeping a > host-local connectivity log. Reading it would be an issue anyway, regardless > on where it is viewed. At least if it needs to be viewed "offline" or through serial console, etc, it will be a simple log and manageable by humans for a first-step in diagnostics. There's nothing stopping us from viewing the log through "legacy" methods as well. > We can report the additionally-requested stats (duplex, gateway pings) on > getVdsStats and collect them in Engine (until connectivity is lost). But > this changes Vdsm/Engine API and Engine reports and cannot reallistically > happen in 3.4.z. Why does this require anything to be collected in engine? We're looking for the most basic of information that can be easily captured from the command line. Nir mentioned the ping command in Comment 9. For link status, speed, and duplex, "ethtool {interface}" on the local host provides the requested info.
(In reply to Tomas Dosek from comment #10) > I'm a little bit afraid what traffic that would cause on network with single > gateway where we'd have 100 hypervisors connected. With maximal allowable > size of ICMP being 64 kB (that is theoretical limit, usually the size is > much less) > that'd make 64 * 10 * 100 = ~63 MB of data processed each 60 seconds (and > that's only the requests). > > That seems like quite a lot of traffic for the network as well as quite a > lot of processing power needed for logging, etc. Even with the overhead of tcp encapsulation, etc, the default size of an ICMP echo request (64 bytes of data) is about 120 Bytes. 120 Bytes x 10 x 100 = ~117KB. Since we're running the command, we have full control over the specific parameters. This is not excessive IMO.
How do you envision this log to be used ?
It's a first step for basic host network and storage diagnostics that doesn't require a deep understanding of RHEV to figure out like vdsm.log. It helps easily determine the basic "First X steps" of troubleshooting.
you still didn't explain how you intend for that to be used ?
(In reply to Barak from comment #19) > you still didn't explain how you intend for that to be used ? It's a log file. It can be used through the GSS support plugin, or if engine is down, it can be accessed directly from the host like any other log file in the system.
Hi dan What is the behavior of this log? if i do ping(ping -c 10 -i 1 x.x.x.x) to wrong default gateway or change the default gateway and then ping, i should see an error in this connectivity log? i should see something in this log if i do ping to default gateway? Because right now i don't see nothing that connected to ping to default gateway. I see in this log only the state of the nic's and their speed. I would like to get more info about this connectivity log and his behavior. Kind regards Michael
I am sorry that it was not clear: I did NOT implement any ping'ing of gateway, since what we ACTUALLY need is to report Engine/Vdsm connectivity. This achieved by logging when Engine disappears and when it reappears again.
Hi Dan I'm doing /etc/init.d/ovirt-engine stop, and there is no report about loosing connectivity(Engine/Vdsm) in the connectivity.log recent_client:False/True is the only thing that appears in the log. I guess that this is not the behavior we should expect. Thank you Michael
Michael, once you stop Engine, connectivity.log should have recent_client:False within 20 seconds. Is that the case? "recent_client:False" means that Vdsm has not seen its client (=engine) recently.
Ok, so this is the report that connectivity log displays? when Engine/Vdsm loosing connectivity? Or the log should display more than "recent_client:False" ?
Verified on - oVirt Engine Version: 3.5.0-0.0.master.20140804172041.git23b558e.el6 - When stopping engine service, the connectivity.log reports "recent_client:False" after 10-20 sec, and that means Engine(client) disconnected from the host(Vdsm). - When starting engine service, the connectivity.log reports "recent_client:True" after 10-20 sec, and that means Engine(client) connected to host(Vdsm). - Also connectivity.log reports changes in interface operational state, speed and duplex.
Actually, it needs another round. Please verify that with the added patch, a fresh installation of vdsm starts up and reports all interfaces properly on `vdsClient -s 0 getVdsStats`.
Verified on - 3.5.0-0.10.master.el6ev and vdsm-4.16.3-2.el6.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html