| Summary: | NFS domains report old lastCheck time when one is stuck | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Mike Kolesnik <mkolesni> | ||||||||||
| Component: | vdsm | Assignee: | Dafna Ron <dron> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Dafna Ron <dron> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | low | ||||||||||||
| Version: | 6.0 | CC: | abaron, bazulay, iheim, lpeer, tdosek | ||||||||||
| Target Milestone: | rc | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2011-12-06 07:04:13 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Attachments: |
|
||||||||||||
Created attachment 473980 [details]
Log before problem started with repostats updated
Created attachment 473981 [details]
Log where problem started (after spm stop)
It looks like after spm stop repoStats stops updating
Note that in the logs attached an iSCSI domain got stuck after spmStop and reports incorrectly 'valid': False. Should be fixed in build 50. Please test Created attachment 486578 [details]
Log of the same NFS scenario on vdsm-4.9-54.el6.x86_64
vdsm-4.9-54.el6.x86_64 - failed to verify - instead of showing lastCheck on "faulty" NFS domain rhevm throws a lot of Java exceptions into log - attached those above. I assume that's not right, it it? No response on needinfo - nack + reopening. Whether it's right or not is irrelevant, what you should check for is in VDSM log (obviously RHEVM shouldn't have java exceptions, but what does that have to do with vdsm?) you did not attach the vdsm log to see whether time reported was wrong. Moving back to ON_QA. Created attachment 492098 [details]
log
verified on ic108
vdsm:
vdsm-debug-plugin-4.9-59.el6.x86_64
vdsm-4.9-59.el6.x86_64
vdsm-debuginfo-4.9-59.el6.x86_64
vdsm-cli-4.9-59.el6.x86_64
libvirt:
libvirt-debuginfo-0.8.7-16.el6.x86_64
libvirt-client-0.8.7-16.el6.x86_64
libvirt-0.8.7-16.el6.x86_64
libvirt-python-0.8.7-16.el6.x86_64
libvirt-devel-0.8.7-16.el6.x86_64
The master domain is not stuck - only the one which is blocked and this is OK since repoStats cannot get a reply from domain.
However, there is a different issue in which deactivareStorageDomain fails with operation stuck.
After speaking with Danken, this should also be solved in the next vdsm 4.9.60
This bug is verified, after new vdsm is released I will check this issue again and if needed open a new bug for deactivateStorageDomain issue.
Thread-1938::INFO::2011-04-14 14:56:22,199::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code':
0}, '90228690-381d-4043-b8bd-bf8651981913': {'delay': '0.00239181518555', 'lastCheck': 1302782179.8576379, 'valid': True, 'code': 0}, 'bf79e924-91f1-487d-b367-f7a2d9faee11': {'d
elay': '0.0031099319458', 'lastCheck': 1302782111.597779, 'valid': True, 'code': 0}}
Thread-1819::ERROR::2011-04-14 14:56:23,670::sp::85::Storage.StatsThread::(run) Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/sp.py", line 80, in run
stats, code = self._statsfunc(self._domain)
File "/usr/share/vdsm/storage/sp.py", line 1553, in _repostats
if not domain.selftest():
File "/usr/share/vdsm/storage/nfsSD.py", line 127, in selftest
return fileSD.FileStorageDomain.selftest(self)
File "/usr/share/vdsm/storage/fileSD.py", line 298, in selftest
oop.os.statvfs(self.domaindir)
File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper
return self.runExternally(func, *args, **kwds)
File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally
raise Timeout("Operation Stuck")
Timeout: Operation Stuck
Thread-1941::DEBUG::2011-04-14 14:56:23,672::misc::1150::Event.StatsThread.onDomainConnectivityStateChange::(_emit) Emitting event
Thread-1941::DEBUG::2011-04-14 14:56:23,673::misc::1160::Event.StatsThread.onDomainConnectivityStateChange::(_emit) Calling registered method `run`
Thread-1941::DEBUG::2011-04-14 14:56:23,673::misc::1168::Event.StatsThread.onDomainConnectivityStateChange::(_emit) Event emitted
Thread-1946::DEBUG::2011-04-14 14:56:28,461::clientIF::215::Storage.Dispatcher.Protect::(wrapper) [10.35.113.42]
Thread-2022::INFO::2011-04-14 14:58:34,955::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code':
0}, '90228690-381d-4043-b8bd-bf8651981913': {'delay': '0.00276398658752', 'lastCheck': 1302782310.0283749, 'valid': True, 'code': 0}, 'bf79e924-91f1-487d-b367-f7a2d9faee11': {'d
elay': '62.0645911694', 'lastCheck': 1302782255.7480941, 'valid': False, 'code': 200}}
after a few minutes backend will send a command to deactivateStorageDomain - which fails with unexpected error.
[root@blond-vdsf ~]# tail -f /var/log/vdsm/vdsm.log |grep deactivate
Thread-2132::INFO::2011-04-14 15:00:51,332::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: deactivateStorageDomain, args: ( sdUUID=bf79e924-91f1-487d-b367-f7a2d9faee11 spUUID=6bb565ea-a5d3-4b65-8c45-8c661597ded0 msdUUID=00000000-0000-0000-0000-000000000000 masterVersion=1)
Thread-2132::DEBUG::2011-04-14 15:00:51,333::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.6bb565ea-a5d3-4b65-8c45-8c661597ded0`ReqID=`a7c7a26b-8775-4006-a7d0-3d9f611223c1`::Request was made in '/usr/share/vdsm/storage/spm.py' line '1163' at 'public_deactivateStorageDomain'
Thread-2132::DEBUG::2011-04-14 15:00:51,336::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.bf79e924-91f1-487d-b367-f7a2d9faee11`ReqID=`fb04ef39-ec7c-489b-b6ad-c2066c3223f7`::Request was made in '/usr/share/vdsm/storage/spm.py' line '1164' at 'public_deactivateStorageDomain'
Thread-2132::INFO::2011-04-14 15:00:51,340::sp::943::Storage.StoragePool::(deactivateSD) sdUUID=bf79e924-91f1-487d-b367-f7a2d9faee11 spUUID=6bb565ea-a5d3-4b65-8c45-8c661597ded0 msdUUID=00000000-0000-0000-0000-000000000000
Thread-2132::ERROR::2011-04-14 15:01:53,349::sp::974::Storage.StoragePool::(deactivateSD) Unexpected error
File "/usr/share/vdsm/storage/sp.py", line 969, in deactivateSD
Thread-2132::INFO::2011-04-14 15:04:00,029::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: deactivateStorageDomain, Return response: {'status': {'message': 'OK', 'code': 0}}
Thread-2137::INFO::2011-04-14 15:00:59,750::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code':
0}, '90228690-381d-4043-b8bd-bf8651981913': {'delay': '0.00240421295166', 'lastCheck': 1302782450.2348771, 'valid': True, 'code': 0}, 'bf79e924-91f1-487d-b367-f7a2d9faee11': {'d
elay': '62.0642220974', 'lastCheck': 1302782399.8979051, 'valid': False, 'code': 200}}
Thread-1819::ERROR::2011-04-14 15:01:11,971::sp::85::Storage.StatsThread::(run) Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/sp.py", line 80, in run
stats, code = self._statsfunc(self._domain)
File "/usr/share/vdsm/storage/sp.py", line 1553, in _repostats
if not domain.selftest():
File "/usr/share/vdsm/storage/nfsSD.py", line 127, in selftest
return fileSD.FileStorageDomain.selftest(self)
File "/usr/share/vdsm/storage/fileSD.py", line 298, in selftest
oop.os.statvfs(self.domaindir)
File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper
return self.runExternally(func, *args, **kwds)
File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally
raise Timeout("Operation Stuck")
Timeout: Operation Stuck
Thread-2132::ERROR::2011-04-14 15:01:53,349::sp::974::Storage.StoragePool::(deactivateSD) Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/sp.py", line 969, in deactivateSD
dom.validate()
File "/usr/share/vdsm/storage/fileSD.py", line 185, in validate
self.Metadata(cache=False)
File "/usr/share/vdsm/storage/sd.py", line 482, in Metadata
return self._metadata.metadata
File "/usr/share/vdsm/storage/metadata.py", line 110, in __getmetadata
md = self._get()
File "/usr/share/vdsm/storage/fileSD.py", line 39, in _get
return misc.stripNewLines(oop.directReadLines(self.metafile))
File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper
return self.runExternally(func, *args, **kwds)
File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally
raise Timeout("Operation Stuck")
Timeout: Operation Stuck
Thread-2132::INFO::2011-04-14 15:01:53,351::sp::1405::Storage.StoragePool::(getDomains) Get storage pool domains: {'90228690-381d-4043-b8bd-bf8651981913': 'Active', 'bf79e924-91f1-487d-b367-f7a2d9faee11': 'Active'}
Thread-2132::DEBUG::2011-04-14 15:01:53,352::misc::498::Storage.Misc::(rotateFiles) dir: /var/log/vdsm/backup, prefixName: 90228690-381d-4043-b8bd-bf8651981913, versions: 30
Thread-2132::DEBUG::2011-04-14 15:01:53,352::misc::519::Storage.Misc::(rotateFiles) versions found: [6, 5, 4, 3, 2, 1, 0]
Thread-2132::INFO::2011-04-14 15:01:53,369::sp::1349::Storage.StoragePool::(setDomains) Set storage pool domains: 90228690-381d-4043-b8bd-bf8651981913:Active,bf79e924-91f1-487d-:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2011-1782.html |
Description of problem: When communication of one NFS domain is down (block with iptables) then other NFS domains appear to be stuck, even though they're on other machines. Version-Release number of selected component (if applicable): vdsm-4.9-41.el6.x86_64 How reproducible: Most of the time Steps to Reproduce: 1. 1 Host with 1 NFS data domain, 1 NFS ISO domain, and 1 NFS Export domain, all on separate machines. 2. Disconnect the machine hosting the export domain using iptables. Actual results: One or more of the other domains is reported as having an old lastCheck Expected results: Only the disconnected domain should report an old lastCheck Additional info: The disconnected domain is 4ce55d5c-71dd-4d14-b9ad-d48ea45f7292, the active domains (which also get stuck) are: 86959292-be17-45d5-9c1d-0e24474ba951, 8afc7c15-3909-41d4-88fb-6db18cbc6a2c. The original domain was stuck at (/var/log/vdsm/vdsm.log lines 43116-43127/43798): Thread-9446::INFO::2011-01-16 19:29:56,396::dispatcher::101::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 0}, '4ce55d5c-71dd-4d14-b9ad-d48ea45f7292': {'delay': '0.00349593162537', 'lastCheck': 1295198992.309643, 'valid': True, 'code': 0}, '8afc7c15-3909-41d4-88fb-6db18cbc6a2c': {'delay': '0.00527501106262', 'lastCheck': 1295198992.7263801, 'valid': True, 'code': 0}, '86959292-be17-45d5-9c1d-0e24474ba951': {'delay': '0.0064640045166', 'lastCheck': 1295198993.1063371, 'valid': True, 'code': 0}, '5d190e0d-fd81-44ee-97f3-637e97a61ac9': {'delay': '0.0292568206787', 'lastCheck': 1295198990.5932479, 'valid': False, 'code': 358}, '48819cb9-50bb-451b-9c35-da26c2ff48ac': {'delay': '0.031044960022', 'lastCheck': 1295198990.6218569, 'valid': False, 'code': 358}} The other domains reported their last lastCheck time at (/var/log/vdsm/vdsm.log lines 43211-43220/43798): Thread-9479::INFO::2011-01-16 19:30:48,851::dispatcher::101::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 0}, '4ce55d5c-71dd-4d14-b9ad-d48ea45f7292': {'delay': '0.00349593162537', 'lastCheck': 1295198992.309643, 'valid': True, 'code': 0}, '8afc7c15-3909-41d4-88fb-6db18cbc6a2c': {'delay': '0.00415015220642', 'lastCheck': 1295199042.7989571, 'valid': True, 'code': 0}, '86959292-be17-45d5-9c1d-0e24474ba951': {'delay': '0.00311708450317', 'lastCheck': 1295199043.1784101, 'valid': True, 'code': 0}, '5d190e0d-fd81-44ee-97f3-637e97a61ac9': {'delay': '0.0292568206787', 'lastCheck': 1295198990.5932479, 'valid': False, 'code': 358}, '48819cb9-50bb-451b-9c35-da26c2ff48ac': {'delay': '0.031044960022', 'lastCheck': 1295198990.6218569, 'valid': False, 'code': 358}}