Bug 853725

Summary: [vdsm] vm disk stats failures is flooding log (every 2ms for each vm) in case storage domain containing vm disk is not-reachable
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED WONTFIX QA Contact: Haim <hateya>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.3CC: abaron, bazulay, danken, fsimonce, iheim, lpeer, yeylon, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-10-23 17:14:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm logs
none
correct vdsm logs
none
reproduce log none

Description Haim 2012-09-02 12:54:44 UTC
Description of problem:

m disk stats failures is flooding log (every 2ms for each vm) in case storage domain containing vm disk is not-reachable:

[root@nott-vds2 ~]# xzgrep 'Stats function failed:'  /var/log/vdsm/vdsm.log.1.xz | wc -l 
3956

storage was not accessible for 7 minutes and I got 3956 errors.

expected results: no need to run disk stats in case storage is reported as faulty (repo stats

Comment 1 Haim 2012-09-02 12:58:10 UTC
Created attachment 609078 [details]
vdsm logs

Comment 2 Haim 2012-09-02 12:58:36 UTC
Thread-1997::ERROR::2012-09-02 17:30:02,522::utils::399::vm.Vm::(collect) vmId=`e5f954b1-99c0-49e6-8b30-4948e922a21c`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c765a8>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 395, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 272, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 89, in _highWrite
    self._vm._dom.blockInfo(vmDrive.path, 0)
  File "/usr/share/vdsm/libvirtvm.py", line 492, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 82, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1664, in blockInfo
    if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)
libvirtError: failed to open path '/rhev/data-center/520180ca-5062-4abd-a973-6246282b2812/29741863-4295-4794-bfcb-7b9e7e8775f9/images/b01b5f3a-08b2-49e5-b350-cf5ffb2c320e/571744fd-259d-4ac8-8739-85e682160d54': No 
such file or directory

Comment 4 Ayal Baron 2012-09-02 13:40:58 UTC
All the log contains is 8 such message out of which 4 are unique.
So I don't understand what the bug is:

IOError: [Errno 2] No such file or directory: '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/50fa7537-79da-468f-9cab-d671a23035c9'
IOError: [Errno 2] No such file or directory: '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/50fa7537-79da-468f-9cab-d671a23035c9'
IOError: [Errno 2] No such file or directory: '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/f8389e21-0de6-4508-9623-3ac600e99727'
IOError: [Errno 2] No such file or directory: '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/f8389e21-0de6-4508-9623-3ac600e99727'
IOError: [Errno 2] No such file or directory: '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/9f5af493-24e0-4e14-84cc-654988159e33'
IOError: [Errno 2] No such file or directory: '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/9f5af493-24e0-4e14-84cc-654988159e33'
IOError: [Errno 2] No such file or directory: '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/51dbcf41-fbb2-4cab-a76c-f73c4e1dd82f'
IOError: [Errno 2] No such file or directory: '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/51dbcf41-fbb2-4cab-a76c-f73c4e1dd82f'

Comment 5 Haim 2012-09-02 14:01:07 UTC
(In reply to comment #4)
> All the log contains is 8 such message out of which 4 are unique.
> So I don't understand what the bug is:
> 
> IOError: [Errno 2] No such file or directory:
> '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/50fa7537-79da-468f-9cab-
> d671a23035c9'
> IOError: [Errno 2] No such file or directory:
> '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/50fa7537-79da-468f-9cab-
> d671a23035c9'
> IOError: [Errno 2] No such file or directory:
> '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/f8389e21-0de6-4508-9623-
> 3ac600e99727'
> IOError: [Errno 2] No such file or directory:
> '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/f8389e21-0de6-4508-9623-
> 3ac600e99727'
> IOError: [Errno 2] No such file or directory:
> '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/9f5af493-24e0-4e14-84cc-
> 654988159e33'
> IOError: [Errno 2] No such file or directory:
> '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/9f5af493-24e0-4e14-84cc-
> 654988159e33'
> IOError: [Errno 2] No such file or directory:
> '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/51dbcf41-fbb2-4cab-a76c-
> f73c4e1dd82f'
> IOError: [Errno 2] No such file or directory:
> '/dev/29741863-4295-4794-bfcb-7b9e7e8775f9/51dbcf41-fbb2-4cab-a76c-
> f73c4e1dd82f'

how? 

[root@nott-vds2 ~]# xzgrep 'Stats function failed:'  /var/log/vdsm/vdsm.log.1.xz   | wc -l 
3956

Comment 6 Ayal Baron 2012-09-06 11:40:57 UTC
then you've attached the wrong log, because the one attached doesn't contain even 1 'Stats function failed:'... ?

Comment 7 Haim 2012-09-06 14:29:37 UTC
Created attachment 610351 [details]
correct vdsm logs

Comment 8 Ayal Baron 2012-09-16 06:36:52 UTC
Problem is with frequency of sampling stats:

Thread-2002::ERROR::2012-09-02 17:22:29,200::utils::399::vm.Vm::(collect) vmId=`c01dc87b-f962-4eda-815a-49684c69c228`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x1c765a8>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 395, in collect
    statsFunction()
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 272, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 89, in _highWrite
    self._vm._dom.blockInfo(vmDrive.path, 0)
  File "/usr/share/vdsm/libvirtvm.py", line 492, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 82, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1664, in blockInfo
    if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)
libvirtError: failed to open path '/rhev/data-center/520180ca-5062-4abd-a973-6246282b2812/29741863-4295-4794-bfcb-7b9e7e8775f9/images/50a0888a-2ac7-4464-b61c-d9ab8b766515/b40766dd-b705-47a7-a441-4bb93d097707': No such file or directory

Comment 9 Yaniv Bronhaim 2012-09-27 16:57:11 UTC
Do you prefer to change the frequency of this print or handling this specific  libvirt error somehow (like printing something else than stats function failed..)?

Comment 10 Ayal Baron 2012-09-29 13:56:49 UTC
(In reply to comment #9)
> Do you prefer to change the frequency of this print or handling this
> specific  libvirt error somehow (like printing something else than stats
> function failed..)?

first, I'm not sure we need a traceback here.
second, we have to log the problem, just don't need to do it so frequently.

Comment 11 Ayal Baron 2012-09-29 14:03:51 UTC
Note that there are 20 unique VMs here and for each one we have a log message every 10s.  With 100 VMs it would be worse.  What we probably really need here is rca and suppression of most of these events (replacing with 1 generic message).

Thread-9289::DEBUG::2012-09-02 16:01:02,661::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`8eea11d0-103f-4aa3-8694-405bfa450919`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,680::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`21efbc74-b677-487f-9db4-fdbbeae179ca`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,681::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`5f04f6ab-2cd9-43cb-98a8-78b5eea25ef7`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,681::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`12aec5c0-3e1d-48e2-9921-db6b344057eb`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,682::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`dd402cf2-6237-44f6-9d98-bf0dae55f1f8`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,683::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`74d21a9c-516b-44b6-8fe5-d74279345c86`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,684::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`8a7169ca-49c0-42c5-b1b7-0e49defc3ed8`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,684::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`41c16001-a028-4844-a1fe-40558b99215a`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,685::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`2bebcde4-0be2-447b-8eed-6368054dab0f`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,686::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`88a4b827-2d89-4501-9d0c-bdece761d0d6`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,687::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`d47149f9-ab9e-4501-83e8-1547d25c2fe8`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,687::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`7f4ec82f-1c18-4fbe-99d4-e53d60e5b05b`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,688::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`f78850bf-46cd-4606-86ab-1c72823edf76`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,689::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`c3ddabed-5ec5-4883-9b15-33db959ec656`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,690::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`052a871e-794a-4005-bc8a-e23f2bd353a0`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,691::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`1e9b7f40-cb7a-47c4-941f-8347c9171025`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,692::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`c01dc87b-f962-4eda-815a-49684c69c228`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,693::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`b5a5739a-3b9e-4b2a-91b3-bb7137b35e5b`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,694::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`23049264-993d-4427-89d0-2bcd7f04b92d`::Disk hdc stats not available
Thread-9289::DEBUG::2012-09-02 16:01:02,695::libvirtvm::241::vm.Vm::(_getDiskStats) vmId=`e5f954b1-99c0-49e6-8b30-4948e922a21c`::Disk hdc stats not available

Comment 12 Ayal Baron 2012-09-29 14:09:58 UTC
Another issue here is that for some reason disk stats and disk latency are sampled separately (seems inefficient + in case of storage problem both fail but scope is separate so logging is done for both).
Dan, why are these 2 methods separate? _sampleDisk, _sampleDiskLatency

Comment 13 Ayal Baron 2012-09-29 14:10:15 UTC
*** Bug 859103 has been marked as a duplicate of this bug. ***

Comment 14 Dan Kenigsberg 2012-09-29 19:55:23 UTC
(In reply to comment #12)
> Dan, why are these 2 methods separate? _sampleDisk, _sampleDiskLatency

They have different sampling intervals, and different means to acquire data. What would you like to consolidate here?

Comment 15 Yaniv Bronhaim 2012-10-01 17:13:50 UTC
i can stop collecting stats about storage issues when getRealDomain once fails on looking for domain, until next success call to getRealDomain when the connection will be back on. how is it sound?
the sampling way is ok, when the connection is on, it looks like we want to sample all those tests for each vm.

Comment 16 Yaniv Bronhaim 2012-10-22 10:49:49 UTC
Created attachment 631400 [details]
reproduce log

I attached the log I get during blocking storage. Seems like the libvirtError you've got there I don't get. Maybe I reproduce it differently or maybe it was fixed during different patch, but this exception doesn't appear.

If sample function throws exception it means we have something wrong that we want to fix. After Blocking storage domain exception shouldn't be raised during sampling highwrite.

Comment 17 Barak 2012-10-23 17:14:29 UTC
http://gerrit.ovirt.org/#/c/8288

see comment 

It looks like the stack trace is displayed in log only when libvirt is down.

When the SD is down it has a log per VM every 10 secs, this looks like a valid behaviour.


see comment in the patch.

Hence moving this BZ to CLOSED WONTFIX

Comment 18 Yaniv Kaul 2012-10-23 17:32:28 UTC
(In reply to comment #17)
> http://gerrit.ovirt.org/#/c/8288
> 
> see comment 
> 
> It looks like the stack trace is displayed in log only when libvirt is down.
> 
> When the SD is down it has a log per VM every 10 secs, this looks like a
> valid behaviour.

It's not per VM every 10 seconds, it's per VM multiplied by the number of disks, every 10 seconds. Although from the log above, it looks like it is actually printing twice for every disk:
Thread-356342::DEBUG::2012-10-22 12:27:15,283::libvirtvm::224::vm.Vm::(_getDiskStats) vmId=`c75ec2fe-2f73-4ed4-98bf-0d7286160a87`::Disk hdc stats not available
Thread-356342::DEBUG::2012-10-22 12:27:15,283::libvirtvm::224::vm.Vm::(_getDiskStats) vmId=`c75ec2fe-2f73-4ed4-98bf-0d7286160a87`::Disk vda stats not available
Thread-356342::DEBUG::2012-10-22 12:27:15,283::libvirtvm::263::vm.Vm::(_getDiskLatency) vmId=`c75ec2fe-2f73-4ed4-98bf-0d7286160a87`::Disk hdc latency not available
Thread-356342::DEBUG::2012-10-22 12:27:15,283::libvirtvm::263::vm.Vm::(_getDiskLatency) vmId=`c75ec2fe-2f73-4ed4-98bf-0d7286160a87`::Disk vda latency not available

So for 50 VMs with 1 disk, you'll get 10 lines/second, which is quite a lot, IMHO - it will flood the log pretty quickly.

> 
> 
> see comment in the patch.

The problem was not the stack trace, but the log flood.

> 
> Hence moving this BZ to CLOSED WONTFIX