Bug 1261457
Summary: | New device is not reported by getVdsStats until AVERAGING_WINDOW is filled | ||
---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Petr Horáček <phoracek> |
Component: | General | Assignee: | Francesco Romani <fromani> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | meital avital <mavital> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 4.17.2 | CC: | amarchuk, bugs, danken, fromani, gklein, masayag, michal.skrivanek, oourfali, osvoboda, phoracek |
Target Milestone: | ovirt-3.6.3 | Keywords: | Reopened |
Target Release: | 4.17.19 | Flags: | ylavi:
ovirt-3.6.z?
rule-engine: planning_ack? rule-engine: devel_ack+ rule-engine: testing_ack? |
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-02-17 07:22:36 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Petr Horáček
2015-09-09 11:31:22 UTC
(In reply to Petr Horáček from comment #0) > Description of problem: > Although host_sample_stats_interval is set to 15 seconds, a new device is > reported by getVdsStats after AVERAGING_WINDOW * host_sample_stats_interval > (75) seconds. Obviously this happens because of we have to fill > AVERAGING_WINDOW first. > > > Version-Release number of selected component (if applicable): > [vagrant@localhost ~]$ sudo yum info vdsm | grep 'Release\|Version' > Version : 4.17.2 > Release : 116.git2e22be6.el7.centos > > > How reproducible: > Always > > Steps to Reproduce: > http://fpaste.org/265115/41797950/ Steps to reproduce are no longer available on the link. Please add them as a comment on the bug. > > Actual results: > Time needed for a nic to appear is ~75 seconds > > Expected results: > It should appear in <= 15 seconds Sorry about that, here is the script: #!/bin/python # BZ 1261457 # try how long it takes for a dummy to appear in stats import os import sys import time from vdsm import vdscli c = vdscli.connect() print('Wait for first dummy to appear') os.system('ip link add dummy_1 type dummy') while True: stats = c.getVdsStats() if 'dummy_1' in stats['info']['network']: break sys.stdout.write('.') sys.stdout.flush() time.sleep(0.5) print('\nWait for second dummy to appear') start = time.time() os.system('ip link add dummy_2 type dummy') while True: stats = c.getVdsStats() if 'dummy_2' in stats['info']['network']: break sys.stdout.write('.') sys.stdout.flush() time.sleep(0.5) print('\nResult: %d' % (time.time() - start)) os.system('ip link del dummy_1') os.system('ip link del dummy_2') Output: vagrant@vdsmbox ➜ ~ sudo python test.py Wait for first dummy to appear .................................................................................. Wait for second dummy to appear ..................................................................................................... Result: 75 After talking offline with Dan, moving to virt. In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone. toughts? While the flows should still work, I'm concerned that this added delay can break some customers' flow. I'd fix this for 3.6.1/2 Looks like the real culprit is the reduced sampling interval, from 2s per cycle (2 * 5 = 10) to 15s per cycle (15 * 5 = 75). We still keep 5 samples, we still do the same logic to see if we should report it or not. The workaround exists: /etc/vdsm/vdsm.conf: [vars] host_sample_stats_interval = 2 However, we should now re-evaluate if this new behaviour could have nasty side-effects or not. Dan, could you please share your thoughts? I'm not very well versed in the network flows :) I don't see any "nasty" consequences. It's just a matter of delayed reporting (and one broken functional network test). Thanks Dan! So, because - there is an easy workaround (https://bugzilla.redhat.com/show_bug.cgi?id=1261457#c7) - the effect seems limited to network functional tests I'm closing this bug. I'm not sure I understand. In the past, Vdsm reported stats even if there where fewer than AVERAGING_WINDOW samples - it took the first and last of whatever was there. Why don't we do that anymore? Why do we bother with AVERAGING_WINDOW? It's funny that *I* ask this, but I think that this was a very old mistake that I have made. There is no real benefit of smoothing short-intervaled samples over sampling less often. Can we set AVERAGING_WINDOW to 2? (In reply to Dan Kenigsberg from comment #10) > I'm not sure I understand. In the past, Vdsm reported stats even if there > where fewer than AVERAGING_WINDOW samples - it took the first and last of > whatever was there. Why don't we do that anymore? We should do like this, I tried hard to preserve the behaviour during my refactoring. I think the culprit is the fact that we need a sample to be present *both* in the first and last sample to be present in the reported stats. To do so, we need AVERAGING_WINDOW samples, so one full window, hence the delay = AVERAGING_WINDOW * host_sample_stats_interval > Why do we bother with AVERAGING_WINDOW? It's funny that *I* ask this, but I > think that this was a very old mistake that I have made. There is no real > benefit of smoothing short-intervaled samples over sampling less often. Can > we set AVERAGING_WINDOW to 2? I see no problems at first glance. Please note this alone will not 'solve' the original issue, because we will have 2 * 15 = 30 > 15s Let's have this bug track setting AVERAGING_WINDOW to 2, as a first stage. (In reply to Dan Kenigsberg from comment #12) > Let's have this bug track setting AVERAGING_WINDOW to 2, as a first stage. No objections, the patch is trivial and the current AVERAGING_WINDOW setting has no obvious value, but how could this unbreak the original report? lowering priority and severity, this affects only functional tests. (In reply to Francesco Romani from comment #13) > but how could this unbreak the original report? Waiting 30 seconds for a result is better than waiting 75 seconds. Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA. Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA. Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA. This bug was fixed and is slated to be in the upcoming version. As we are focusing our testing at this phase on severe bugs, this bug was closed without going through its verification step. If you think this bug should be verified by QE, please set its severity to high and move it back to ON_QA |