Bug 1261457

Summary: New device is not reported by getVdsStats until AVERAGING_WINDOW is filled
Product: [oVirt] vdsm Reporter: Petr Horáček <phoracek>
Component: GeneralAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: meital avital <mavital>
Severity: low Docs Contact:
Priority: low    
Version: 4.17.2CC: amarchuk, bugs, danken, fromani, gklein, masayag, michal.skrivanek, oourfali, osvoboda, phoracek
Target Milestone: ovirt-3.6.3Keywords: Reopened
Target Release: 4.17.19Flags: ylavi: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack+
rule-engine: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-17 07:22:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Horáček 2015-09-09 11:31:22 UTC
Description of problem:
Although host_sample_stats_interval is set to 15 seconds, a new device is reported by getVdsStats after AVERAGING_WINDOW * host_sample_stats_interval (75) seconds. Obviously this happens because of we have to fill AVERAGING_WINDOW first.


Version-Release number of selected component (if applicable):
[vagrant@localhost ~]$ sudo yum info vdsm | grep 'Release\|Version'
Version     : 4.17.2
Release     : 116.git2e22be6.el7.centos


How reproducible:
Always

Steps to Reproduce:
http://fpaste.org/265115/41797950/

Actual results:
Time needed for a nic to appear is ~75 seconds

Expected results:
It should appear in <= 15 seconds

Comment 1 Moti Asayag 2015-10-18 07:50:31 UTC
(In reply to Petr Horáček from comment #0)
> Description of problem:
> Although host_sample_stats_interval is set to 15 seconds, a new device is
> reported by getVdsStats after AVERAGING_WINDOW * host_sample_stats_interval
> (75) seconds. Obviously this happens because of we have to fill
> AVERAGING_WINDOW first.
> 
> 
> Version-Release number of selected component (if applicable):
> [vagrant@localhost ~]$ sudo yum info vdsm | grep 'Release\|Version'
> Version     : 4.17.2
> Release     : 116.git2e22be6.el7.centos
> 
> 
> How reproducible:
> Always
> 
> Steps to Reproduce:
> http://fpaste.org/265115/41797950/

Steps to reproduce are no longer available on the link.
Please add them as a comment on the bug.

> 
> Actual results:
> Time needed for a nic to appear is ~75 seconds
> 
> Expected results:
> It should appear in <= 15 seconds

Comment 2 Petr Horáček 2015-10-21 13:46:57 UTC
Sorry about that, here is the script:

#!/bin/python
# BZ 1261457
# try how long it takes for a dummy to appear in stats
import os
import sys
import time

from vdsm import vdscli

c = vdscli.connect()

print('Wait for first dummy to appear')
os.system('ip link add dummy_1 type dummy')
while True:
    stats = c.getVdsStats()
    if 'dummy_1' in stats['info']['network']:
        break
    sys.stdout.write('.')
    sys.stdout.flush()
    time.sleep(0.5)

print('\nWait for second dummy to appear')
start = time.time()
os.system('ip link add dummy_2 type dummy')
while True:
    stats = c.getVdsStats()
    if 'dummy_2' in stats['info']['network']:
        break
    sys.stdout.write('.')
    sys.stdout.flush()
    time.sleep(0.5)

print('\nResult: %d' % (time.time() - start))

os.system('ip link del dummy_1')
os.system('ip link del dummy_2')



Output:
vagrant@vdsmbox ➜  ~  sudo python test.py  
Wait for first dummy to appear
..................................................................................
Wait for second dummy to appear
.....................................................................................................
Result: 75

Comment 3 Oved Ourfali 2015-10-25 12:55:37 UTC
After talking offline with Dan, moving to virt.

Comment 4 Yaniv Lavi 2015-10-29 12:33:15 UTC
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 5 Michal Skrivanek 2015-11-13 16:07:46 UTC
toughts?

Comment 6 Francesco Romani 2015-11-16 13:36:14 UTC
While the flows should still work, I'm concerned that this added delay can break some customers' flow. I'd fix this for 3.6.1/2

Comment 7 Francesco Romani 2015-11-16 14:16:37 UTC
Looks like the real culprit is the reduced sampling interval, from
2s per cycle (2 * 5 = 10) to 15s per cycle (15 * 5 = 75). We still keep 5 samples, we still do the same logic to see if we should report it or not.

The workaround exists:

/etc/vdsm/vdsm.conf:

[vars]
host_sample_stats_interval = 2

However, we should now re-evaluate if this new behaviour could have nasty side-effects or not. Dan, could you please share your thoughts? I'm not very well versed in the network flows :)

Comment 8 Dan Kenigsberg 2015-12-04 07:26:14 UTC
I don't see any "nasty" consequences. It's just a matter of delayed reporting (and one broken functional network test).

Comment 9 Francesco Romani 2015-12-04 08:02:29 UTC
Thanks Dan!
So, because
- there is an easy workaround (https://bugzilla.redhat.com/show_bug.cgi?id=1261457#c7)
- the effect seems limited to network functional tests
I'm closing this bug.

Comment 10 Dan Kenigsberg 2015-12-06 08:09:04 UTC
I'm not sure I understand. In the past, Vdsm reported stats even if there where fewer than AVERAGING_WINDOW samples - it took the first and last of whatever was there. Why don't we do that anymore?

Why do we bother with AVERAGING_WINDOW? It's funny that *I* ask this, but I think that this was a very old mistake that I have made. There is no real benefit of smoothing short-intervaled samples over sampling less often. Can we set AVERAGING_WINDOW to 2?

Comment 11 Francesco Romani 2015-12-07 12:14:37 UTC
(In reply to Dan Kenigsberg from comment #10)
> I'm not sure I understand. In the past, Vdsm reported stats even if there
> where fewer than AVERAGING_WINDOW samples - it took the first and last of
> whatever was there. Why don't we do that anymore?

We should do like this, I tried hard to preserve the behaviour during my refactoring. I think the culprit is the fact that we need a sample to be present
*both* in the first and last sample to be present in the reported stats.
To do so, we need AVERAGING_WINDOW samples, so one full window, hence 
the delay = AVERAGING_WINDOW * host_sample_stats_interval

> Why do we bother with AVERAGING_WINDOW? It's funny that *I* ask this, but I
> think that this was a very old mistake that I have made. There is no real
> benefit of smoothing short-intervaled samples over sampling less often. Can
> we set AVERAGING_WINDOW to 2?

I see no problems at first glance. Please note this alone will not 'solve' the original issue, because we will have 2 * 15 = 30 > 15s

Comment 12 Dan Kenigsberg 2015-12-07 13:21:34 UTC
Let's have this bug track setting AVERAGING_WINDOW to 2, as a first stage.

Comment 13 Francesco Romani 2015-12-09 09:05:44 UTC
(In reply to Dan Kenigsberg from comment #12)
> Let's have this bug track setting AVERAGING_WINDOW to 2, as a first stage.

No objections, the patch is trivial and the current AVERAGING_WINDOW setting has no obvious value, but how could this unbreak the original report?

Comment 14 Francesco Romani 2015-12-09 09:06:21 UTC
lowering priority and severity, this affects only functional tests.

Comment 15 Dan Kenigsberg 2015-12-09 14:35:12 UTC
(In reply to Francesco Romani from comment #13)
> but how could this unbreak the original report?

Waiting 30 seconds for a result is better than waiting 75 seconds.

Comment 16 Red Hat Bugzilla Rules Engine 2016-01-28 12:15:17 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 17 Red Hat Bugzilla Rules Engine 2016-01-28 12:24:33 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 18 Red Hat Bugzilla Rules Engine 2016-01-28 12:29:29 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 19 Gil Klein 2016-02-17 07:22:36 UTC
This bug was fixed and is slated to be in the upcoming version. As we
are focusing our testing at this phase on severe bugs, this bug was
closed without going through its verification step. If you think this
bug should be verified by QE, please set its severity to high and move
it back to ON_QA