Bug 670005

Summary: NFS domains report old lastCheck time when one is stuck
Product: Red Hat Enterprise Linux 6 Reporter: Mike Kolesnik <mkolesni>
Component: vdsmAssignee: Dafna Ron <dron>
Status: CLOSED ERRATA QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: low    
Version: 6.0CC: abaron, bazulay, iheim, lpeer, tdosek
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 07:04:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Log before problem started with repostats updated
none
Log where problem started (after spm stop)
none
Log of the same NFS scenario on vdsm-4.9-54.el6.x86_64
none
log none

Description Mike Kolesnik 2011-01-16 16:47:34 UTC
Description of problem:
When communication of one NFS domain is down (block with iptables) then other NFS domains appear to be stuck, even though they're on other machines.


Version-Release number of selected component (if applicable):
vdsm-4.9-41.el6.x86_64


How reproducible:
Most of the time


Steps to Reproduce:
1. 1 Host with 1 NFS data domain, 1 NFS ISO domain, and 1 NFS Export domain, all on separate machines.
2. Disconnect the machine hosting the export domain using iptables.
  

Actual results:
One or more of the other domains is reported as having an old lastCheck


Expected results:
Only the disconnected domain should report an old lastCheck


Additional info:
The disconnected domain is 4ce55d5c-71dd-4d14-b9ad-d48ea45f7292, the active domains (which also get stuck) are: 86959292-be17-45d5-9c1d-0e24474ba951, 8afc7c15-3909-41d4-88fb-6db18cbc6a2c.

The original domain was stuck at (/var/log/vdsm/vdsm.log lines 43116-43127/43798):
Thread-9446::INFO::2011-01-16 19:29:56,396::dispatcher::101::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 0}, '4ce55d5c-71dd-4d14-b9ad-d48ea45f7292': {'delay': '0.00349593162537', 'lastCheck': 1295198992.309643, 'valid': True, 'code': 0}, '8afc7c15-3909-41d4-88fb-6db18cbc6a2c': {'delay': '0.00527501106262', 'lastCheck': 1295198992.7263801, 'valid': True, 'code': 0}, '86959292-be17-45d5-9c1d-0e24474ba951': {'delay': '0.0064640045166', 'lastCheck': 1295198993.1063371, 'valid': True, 'code': 0}, '5d190e0d-fd81-44ee-97f3-637e97a61ac9': {'delay': '0.0292568206787', 'lastCheck': 1295198990.5932479, 'valid': False, 'code': 358}, '48819cb9-50bb-451b-9c35-da26c2ff48ac': {'delay': '0.031044960022', 'lastCheck': 1295198990.6218569, 'valid': False, 'code': 358}}



The other domains reported their last lastCheck time at 
(/var/log/vdsm/vdsm.log lines 43211-43220/43798):
Thread-9479::INFO::2011-01-16 19:30:48,851::dispatcher::101::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 0}, '4ce55d5c-71dd-4d14-b9ad-d48ea45f7292': {'delay': '0.00349593162537', 'lastCheck': 1295198992.309643, 'valid': True, 'code': 0}, '8afc7c15-3909-41d4-88fb-6db18cbc6a2c': {'delay': '0.00415015220642', 'lastCheck': 1295199042.7989571, 'valid': True, 'code': 0}, '86959292-be17-45d5-9c1d-0e24474ba951': {'delay': '0.00311708450317', 'lastCheck': 1295199043.1784101, 'valid': True, 'code': 0}, '5d190e0d-fd81-44ee-97f3-637e97a61ac9': {'delay': '0.0292568206787', 'lastCheck': 1295198990.5932479, 'valid': False, 'code': 358}, '48819cb9-50bb-451b-9c35-da26c2ff48ac': {'delay': '0.031044960022', 'lastCheck': 1295198990.6218569, 'valid': False, 'code': 358}}

Comment 2 Ayal Baron 2011-01-18 07:09:29 UTC
Created attachment 473980 [details]
Log before problem started with repostats updated

Comment 3 Ayal Baron 2011-01-18 07:11:24 UTC
Created attachment 473981 [details]
Log where problem started (after spm stop)

It looks like after spm stop repoStats stops updating

Comment 4 Ayal Baron 2011-01-18 14:59:22 UTC
Note that in the logs attached an iSCSI domain got stuck after spmStop and reports incorrectly 'valid': False.

Comment 5 Saggi Mizrahi 2011-03-13 09:11:09 UTC
Should be fixed in build 50. Please test

Comment 6 Tomas Dosek 2011-03-21 09:35:44 UTC
Created attachment 486578 [details]
Log of the same NFS scenario on vdsm-4.9-54.el6.x86_64

Comment 7 Tomas Dosek 2011-03-21 09:37:39 UTC
vdsm-4.9-54.el6.x86_64 - failed to verify - instead of showing lastCheck on "faulty" NFS domain rhevm throws a lot of Java exceptions into log - attached those above. I assume that's not right, it it?

Comment 8 Tomas Dosek 2011-04-05 10:21:23 UTC
No response on needinfo - nack + reopening.

Comment 9 Ayal Baron 2011-04-09 20:23:00 UTC
Whether it's right or not is irrelevant, what you should check for is in VDSM log (obviously RHEVM shouldn't have java exceptions, but what does that have to do with vdsm?)
you did not attach the vdsm log to see whether time reported was wrong.
Moving back to ON_QA.

Comment 10 Dafna Ron 2011-04-14 12:59:11 UTC
Created attachment 492098 [details]
log

verified on ic108
vdsm:

vdsm-debug-plugin-4.9-59.el6.x86_64
vdsm-4.9-59.el6.x86_64
vdsm-debuginfo-4.9-59.el6.x86_64
vdsm-cli-4.9-59.el6.x86_64


libvirt:

libvirt-debuginfo-0.8.7-16.el6.x86_64
libvirt-client-0.8.7-16.el6.x86_64
libvirt-0.8.7-16.el6.x86_64
libvirt-python-0.8.7-16.el6.x86_64
libvirt-devel-0.8.7-16.el6.x86_64


The master domain is not stuck - only the one which is blocked and this is OK since repoStats cannot get a reply from domain. 

However, there is a different issue in which deactivareStorageDomain fails with operation stuck. 
After speaking with Danken, this should also be solved in the next vdsm 4.9.60 

This bug is verified, after new vdsm is released I will check this issue again and if needed open a new bug for deactivateStorageDomain issue.  


Thread-1938::INFO::2011-04-14 14:56:22,199::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 
0}, '90228690-381d-4043-b8bd-bf8651981913': {'delay': '0.00239181518555', 'lastCheck': 1302782179.8576379, 'valid': True, 'code': 0}, 'bf79e924-91f1-487d-b367-f7a2d9faee11': {'d
elay': '0.0031099319458', 'lastCheck': 1302782111.597779, 'valid': True, 'code': 0}}
Thread-1819::ERROR::2011-04-14 14:56:23,670::sp::85::Storage.StatsThread::(run) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 80, in run
    stats, code = self._statsfunc(self._domain)
  File "/usr/share/vdsm/storage/sp.py", line 1553, in _repostats
    if not domain.selftest():
  File "/usr/share/vdsm/storage/nfsSD.py", line 127, in selftest
    return fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 298, in selftest
    oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally
    raise Timeout("Operation Stuck")
Timeout: Operation Stuck
Thread-1941::DEBUG::2011-04-14 14:56:23,672::misc::1150::Event.StatsThread.onDomainConnectivityStateChange::(_emit) Emitting event
Thread-1941::DEBUG::2011-04-14 14:56:23,673::misc::1160::Event.StatsThread.onDomainConnectivityStateChange::(_emit) Calling registered method `run`
Thread-1941::DEBUG::2011-04-14 14:56:23,673::misc::1168::Event.StatsThread.onDomainConnectivityStateChange::(_emit) Event emitted
Thread-1946::DEBUG::2011-04-14 14:56:28,461::clientIF::215::Storage.Dispatcher.Protect::(wrapper) [10.35.113.42]




Thread-2022::INFO::2011-04-14 14:58:34,955::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 
0}, '90228690-381d-4043-b8bd-bf8651981913': {'delay': '0.00276398658752', 'lastCheck': 1302782310.0283749, 'valid': True, 'code': 0}, 'bf79e924-91f1-487d-b367-f7a2d9faee11': {'d
elay': '62.0645911694', 'lastCheck': 1302782255.7480941, 'valid': False, 'code': 200}}


after a few minutes backend will send a command to deactivateStorageDomain - which fails with unexpected error. 
[root@blond-vdsf ~]# tail -f /var/log/vdsm/vdsm.log |grep deactivate


Thread-2132::INFO::2011-04-14 15:00:51,332::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: deactivateStorageDomain, args: ( sdUUID=bf79e924-91f1-487d-b367-f7a2d9faee11 spUUID=6bb565ea-a5d3-4b65-8c45-8c661597ded0 msdUUID=00000000-0000-0000-0000-000000000000 masterVersion=1)
Thread-2132::DEBUG::2011-04-14 15:00:51,333::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.6bb565ea-a5d3-4b65-8c45-8c661597ded0`ReqID=`a7c7a26b-8775-4006-a7d0-3d9f611223c1`::Request was made in '/usr/share/vdsm/storage/spm.py' line '1163' at 'public_deactivateStorageDomain'
Thread-2132::DEBUG::2011-04-14 15:00:51,336::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.bf79e924-91f1-487d-b367-f7a2d9faee11`ReqID=`fb04ef39-ec7c-489b-b6ad-c2066c3223f7`::Request was made in '/usr/share/vdsm/storage/spm.py' line '1164' at 'public_deactivateStorageDomain'
Thread-2132::INFO::2011-04-14 15:00:51,340::sp::943::Storage.StoragePool::(deactivateSD) sdUUID=bf79e924-91f1-487d-b367-f7a2d9faee11 spUUID=6bb565ea-a5d3-4b65-8c45-8c661597ded0 msdUUID=00000000-0000-0000-0000-000000000000
Thread-2132::ERROR::2011-04-14 15:01:53,349::sp::974::Storage.StoragePool::(deactivateSD) Unexpected error
  File "/usr/share/vdsm/storage/sp.py", line 969, in deactivateSD
Thread-2132::INFO::2011-04-14 15:04:00,029::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: deactivateStorageDomain, Return response: {'status': {'message': 'OK', 'code': 0}}


Thread-2137::INFO::2011-04-14 15:00:59,750::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 
0}, '90228690-381d-4043-b8bd-bf8651981913': {'delay': '0.00240421295166', 'lastCheck': 1302782450.2348771, 'valid': True, 'code': 0}, 'bf79e924-91f1-487d-b367-f7a2d9faee11': {'d
elay': '62.0642220974', 'lastCheck': 1302782399.8979051, 'valid': False, 'code': 200}}
Thread-1819::ERROR::2011-04-14 15:01:11,971::sp::85::Storage.StatsThread::(run) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 80, in run
    stats, code = self._statsfunc(self._domain)
  File "/usr/share/vdsm/storage/sp.py", line 1553, in _repostats
    if not domain.selftest():
  File "/usr/share/vdsm/storage/nfsSD.py", line 127, in selftest
    return fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 298, in selftest
    oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally
    raise Timeout("Operation Stuck")
Timeout: Operation Stuck
Thread-2132::ERROR::2011-04-14 15:01:53,349::sp::974::Storage.StoragePool::(deactivateSD) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 969, in deactivateSD
    dom.validate()
  File "/usr/share/vdsm/storage/fileSD.py", line 185, in validate
    self.Metadata(cache=False)
  File "/usr/share/vdsm/storage/sd.py", line 482, in Metadata
    return self._metadata.metadata
  File "/usr/share/vdsm/storage/metadata.py", line 110, in __getmetadata
    md = self._get()
  File "/usr/share/vdsm/storage/fileSD.py", line 39, in _get
    return misc.stripNewLines(oop.directReadLines(self.metafile))
  File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper
    return self.runExternally(func, *args, **kwds)
  File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally
    raise Timeout("Operation Stuck")
Timeout: Operation Stuck
Thread-2132::INFO::2011-04-14 15:01:53,351::sp::1405::Storage.StoragePool::(getDomains) Get storage pool domains: {'90228690-381d-4043-b8bd-bf8651981913': 'Active', 'bf79e924-91f1-487d-b367-f7a2d9faee11': 'Active'}
Thread-2132::DEBUG::2011-04-14 15:01:53,352::misc::498::Storage.Misc::(rotateFiles) dir: /var/log/vdsm/backup, prefixName: 90228690-381d-4043-b8bd-bf8651981913, versions: 30
Thread-2132::DEBUG::2011-04-14 15:01:53,352::misc::519::Storage.Misc::(rotateFiles) versions found: [6, 5, 4, 3, 2, 1, 0]
Thread-2132::INFO::2011-04-14 15:01:53,369::sp::1349::Storage.StoragePool::(setDomains) Set storage pool domains: 90228690-381d-4043-b8bd-bf8651981913:Active,bf79e924-91f1-487d-:

Comment 11 errata-xmlrpc 2011-12-06 07:04:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2011-1782.html