Hide Forgot
Description of problem: When communication of one NFS domain is down (block with iptables) then other NFS domains appear to be stuck, even though they're on other machines. Version-Release number of selected component (if applicable): vdsm-4.9-41.el6.x86_64 How reproducible: Most of the time Steps to Reproduce: 1. 1 Host with 1 NFS data domain, 1 NFS ISO domain, and 1 NFS Export domain, all on separate machines. 2. Disconnect the machine hosting the export domain using iptables. Actual results: One or more of the other domains is reported as having an old lastCheck Expected results: Only the disconnected domain should report an old lastCheck Additional info: The disconnected domain is 4ce55d5c-71dd-4d14-b9ad-d48ea45f7292, the active domains (which also get stuck) are: 86959292-be17-45d5-9c1d-0e24474ba951, 8afc7c15-3909-41d4-88fb-6db18cbc6a2c. The original domain was stuck at (/var/log/vdsm/vdsm.log lines 43116-43127/43798): Thread-9446::INFO::2011-01-16 19:29:56,396::dispatcher::101::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 0}, '4ce55d5c-71dd-4d14-b9ad-d48ea45f7292': {'delay': '0.00349593162537', 'lastCheck': 1295198992.309643, 'valid': True, 'code': 0}, '8afc7c15-3909-41d4-88fb-6db18cbc6a2c': {'delay': '0.00527501106262', 'lastCheck': 1295198992.7263801, 'valid': True, 'code': 0}, '86959292-be17-45d5-9c1d-0e24474ba951': {'delay': '0.0064640045166', 'lastCheck': 1295198993.1063371, 'valid': True, 'code': 0}, '5d190e0d-fd81-44ee-97f3-637e97a61ac9': {'delay': '0.0292568206787', 'lastCheck': 1295198990.5932479, 'valid': False, 'code': 358}, '48819cb9-50bb-451b-9c35-da26c2ff48ac': {'delay': '0.031044960022', 'lastCheck': 1295198990.6218569, 'valid': False, 'code': 358}} The other domains reported their last lastCheck time at (/var/log/vdsm/vdsm.log lines 43211-43220/43798): Thread-9479::INFO::2011-01-16 19:30:48,851::dispatcher::101::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 0}, '4ce55d5c-71dd-4d14-b9ad-d48ea45f7292': {'delay': '0.00349593162537', 'lastCheck': 1295198992.309643, 'valid': True, 'code': 0}, '8afc7c15-3909-41d4-88fb-6db18cbc6a2c': {'delay': '0.00415015220642', 'lastCheck': 1295199042.7989571, 'valid': True, 'code': 0}, '86959292-be17-45d5-9c1d-0e24474ba951': {'delay': '0.00311708450317', 'lastCheck': 1295199043.1784101, 'valid': True, 'code': 0}, '5d190e0d-fd81-44ee-97f3-637e97a61ac9': {'delay': '0.0292568206787', 'lastCheck': 1295198990.5932479, 'valid': False, 'code': 358}, '48819cb9-50bb-451b-9c35-da26c2ff48ac': {'delay': '0.031044960022', 'lastCheck': 1295198990.6218569, 'valid': False, 'code': 358}}
Created attachment 473980 [details] Log before problem started with repostats updated
Created attachment 473981 [details] Log where problem started (after spm stop) It looks like after spm stop repoStats stops updating
Note that in the logs attached an iSCSI domain got stuck after spmStop and reports incorrectly 'valid': False.
Should be fixed in build 50. Please test
Created attachment 486578 [details] Log of the same NFS scenario on vdsm-4.9-54.el6.x86_64
vdsm-4.9-54.el6.x86_64 - failed to verify - instead of showing lastCheck on "faulty" NFS domain rhevm throws a lot of Java exceptions into log - attached those above. I assume that's not right, it it?
No response on needinfo - nack + reopening.
Whether it's right or not is irrelevant, what you should check for is in VDSM log (obviously RHEVM shouldn't have java exceptions, but what does that have to do with vdsm?) you did not attach the vdsm log to see whether time reported was wrong. Moving back to ON_QA.
Created attachment 492098 [details] log verified on ic108 vdsm: vdsm-debug-plugin-4.9-59.el6.x86_64 vdsm-4.9-59.el6.x86_64 vdsm-debuginfo-4.9-59.el6.x86_64 vdsm-cli-4.9-59.el6.x86_64 libvirt: libvirt-debuginfo-0.8.7-16.el6.x86_64 libvirt-client-0.8.7-16.el6.x86_64 libvirt-0.8.7-16.el6.x86_64 libvirt-python-0.8.7-16.el6.x86_64 libvirt-devel-0.8.7-16.el6.x86_64 The master domain is not stuck - only the one which is blocked and this is OK since repoStats cannot get a reply from domain. However, there is a different issue in which deactivareStorageDomain fails with operation stuck. After speaking with Danken, this should also be solved in the next vdsm 4.9.60 This bug is verified, after new vdsm is released I will check this issue again and if needed open a new bug for deactivateStorageDomain issue. Thread-1938::INFO::2011-04-14 14:56:22,199::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 0}, '90228690-381d-4043-b8bd-bf8651981913': {'delay': '0.00239181518555', 'lastCheck': 1302782179.8576379, 'valid': True, 'code': 0}, 'bf79e924-91f1-487d-b367-f7a2d9faee11': {'d elay': '0.0031099319458', 'lastCheck': 1302782111.597779, 'valid': True, 'code': 0}} Thread-1819::ERROR::2011-04-14 14:56:23,670::sp::85::Storage.StatsThread::(run) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 80, in run stats, code = self._statsfunc(self._domain) File "/usr/share/vdsm/storage/sp.py", line 1553, in _repostats if not domain.selftest(): File "/usr/share/vdsm/storage/nfsSD.py", line 127, in selftest return fileSD.FileStorageDomain.selftest(self) File "/usr/share/vdsm/storage/fileSD.py", line 298, in selftest oop.os.statvfs(self.domaindir) File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper return self.runExternally(func, *args, **kwds) File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally raise Timeout("Operation Stuck") Timeout: Operation Stuck Thread-1941::DEBUG::2011-04-14 14:56:23,672::misc::1150::Event.StatsThread.onDomainConnectivityStateChange::(_emit) Emitting event Thread-1941::DEBUG::2011-04-14 14:56:23,673::misc::1160::Event.StatsThread.onDomainConnectivityStateChange::(_emit) Calling registered method `run` Thread-1941::DEBUG::2011-04-14 14:56:23,673::misc::1168::Event.StatsThread.onDomainConnectivityStateChange::(_emit) Event emitted Thread-1946::DEBUG::2011-04-14 14:56:28,461::clientIF::215::Storage.Dispatcher.Protect::(wrapper) [10.35.113.42] Thread-2022::INFO::2011-04-14 14:58:34,955::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 0}, '90228690-381d-4043-b8bd-bf8651981913': {'delay': '0.00276398658752', 'lastCheck': 1302782310.0283749, 'valid': True, 'code': 0}, 'bf79e924-91f1-487d-b367-f7a2d9faee11': {'d elay': '62.0645911694', 'lastCheck': 1302782255.7480941, 'valid': False, 'code': 200}} after a few minutes backend will send a command to deactivateStorageDomain - which fails with unexpected error. [root@blond-vdsf ~]# tail -f /var/log/vdsm/vdsm.log |grep deactivate Thread-2132::INFO::2011-04-14 15:00:51,332::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: deactivateStorageDomain, args: ( sdUUID=bf79e924-91f1-487d-b367-f7a2d9faee11 spUUID=6bb565ea-a5d3-4b65-8c45-8c661597ded0 msdUUID=00000000-0000-0000-0000-000000000000 masterVersion=1) Thread-2132::DEBUG::2011-04-14 15:00:51,333::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.6bb565ea-a5d3-4b65-8c45-8c661597ded0`ReqID=`a7c7a26b-8775-4006-a7d0-3d9f611223c1`::Request was made in '/usr/share/vdsm/storage/spm.py' line '1163' at 'public_deactivateStorageDomain' Thread-2132::DEBUG::2011-04-14 15:00:51,336::resourceManager::154::ResourceManager.Request::(__init__) ResName=`Storage.bf79e924-91f1-487d-b367-f7a2d9faee11`ReqID=`fb04ef39-ec7c-489b-b6ad-c2066c3223f7`::Request was made in '/usr/share/vdsm/storage/spm.py' line '1164' at 'public_deactivateStorageDomain' Thread-2132::INFO::2011-04-14 15:00:51,340::sp::943::Storage.StoragePool::(deactivateSD) sdUUID=bf79e924-91f1-487d-b367-f7a2d9faee11 spUUID=6bb565ea-a5d3-4b65-8c45-8c661597ded0 msdUUID=00000000-0000-0000-0000-000000000000 Thread-2132::ERROR::2011-04-14 15:01:53,349::sp::974::Storage.StoragePool::(deactivateSD) Unexpected error File "/usr/share/vdsm/storage/sp.py", line 969, in deactivateSD Thread-2132::INFO::2011-04-14 15:04:00,029::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: deactivateStorageDomain, Return response: {'status': {'message': 'OK', 'code': 0}} Thread-2137::INFO::2011-04-14 15:00:59,750::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: repoStats, Return response: {'status': {'message': 'OK', 'code': 0}, '90228690-381d-4043-b8bd-bf8651981913': {'delay': '0.00240421295166', 'lastCheck': 1302782450.2348771, 'valid': True, 'code': 0}, 'bf79e924-91f1-487d-b367-f7a2d9faee11': {'d elay': '62.0642220974', 'lastCheck': 1302782399.8979051, 'valid': False, 'code': 200}} Thread-1819::ERROR::2011-04-14 15:01:11,971::sp::85::Storage.StatsThread::(run) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 80, in run stats, code = self._statsfunc(self._domain) File "/usr/share/vdsm/storage/sp.py", line 1553, in _repostats if not domain.selftest(): File "/usr/share/vdsm/storage/nfsSD.py", line 127, in selftest return fileSD.FileStorageDomain.selftest(self) File "/usr/share/vdsm/storage/fileSD.py", line 298, in selftest oop.os.statvfs(self.domaindir) File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper return self.runExternally(func, *args, **kwds) File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally raise Timeout("Operation Stuck") Timeout: Operation Stuck Thread-2132::ERROR::2011-04-14 15:01:53,349::sp::974::Storage.StoragePool::(deactivateSD) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 969, in deactivateSD dom.validate() File "/usr/share/vdsm/storage/fileSD.py", line 185, in validate self.Metadata(cache=False) File "/usr/share/vdsm/storage/sd.py", line 482, in Metadata return self._metadata.metadata File "/usr/share/vdsm/storage/metadata.py", line 110, in __getmetadata md = self._get() File "/usr/share/vdsm/storage/fileSD.py", line 39, in _get return misc.stripNewLines(oop.directReadLines(self.metafile)) File "/usr/share/vdsm/storage/processPool.py", line 35, in wrapper return self.runExternally(func, *args, **kwds) File "/usr/share/vdsm/storage/processPool.py", line 63, in runExternally raise Timeout("Operation Stuck") Timeout: Operation Stuck Thread-2132::INFO::2011-04-14 15:01:53,351::sp::1405::Storage.StoragePool::(getDomains) Get storage pool domains: {'90228690-381d-4043-b8bd-bf8651981913': 'Active', 'bf79e924-91f1-487d-b367-f7a2d9faee11': 'Active'} Thread-2132::DEBUG::2011-04-14 15:01:53,352::misc::498::Storage.Misc::(rotateFiles) dir: /var/log/vdsm/backup, prefixName: 90228690-381d-4043-b8bd-bf8651981913, versions: 30 Thread-2132::DEBUG::2011-04-14 15:01:53,352::misc::519::Storage.Misc::(rotateFiles) versions found: [6, 5, 4, 3, 2, 1, 0] Thread-2132::INFO::2011-04-14 15:01:53,369::sp::1349::Storage.StoragePool::(setDomains) Set storage pool domains: 90228690-381d-4043-b8bd-bf8651981913:Active,bf79e924-91f1-487d-:
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2011-1782.html