| Summary: | Domain monitor thread never wakes up from threading.Event.wait(timeout) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nir Soffer <nsoffer> | ||||
| Component: | vdsm | Assignee: | Saggi Mizrahi <smizrahi> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | |||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 3.2.0 | CC: | bazulay, dgibson, iheim, lpeer, yeylon | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.3.3 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Linux | ||||||
| Whiteboard: | infra | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-03-24 13:53:46 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
One unlikely option is that stopEvent.wait() raises an exception - since we don't have an exception handler for monitor thread main function, this can cause this silent death of the thread. To test this you can apply this patch: http://gerrit.ovirt.org/#/c/21487 If this is indeed the issue, the next time it happens we will have a backtrace showing the error. If the error happens again and we don't see the backtrace, it proves that the issue is not an unhandled exception. I've passed that patch on to the customer, and we'll see what happens. I've tried my best to reproduce this but It just can't. I would very much like a coredump of VDSM when the thread gets stuck. Also, information about the kernel, python and cpus. Saggi, We'll have to wait on the customer to encounter the problem again to get thread trace or core information. As to the rest: python-2.6.6-37.el6_4.x86_64 kernel-2.6.32-358.18.1.el6.x86_64 Linux wsc-voo-057 2.6.32-358.18.1.el6.x86_64 #1 SMP Fri Aug 2 17:04:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux There are 32-cpus, like this: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz stepping : 7 cpu MHz : 2701.000 cache size : 20480 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid bogomips : 5400.12 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: Any other specifics you need? No, surprising that he has only one core. I was suspecting some weird pthread corner cases related to highly concurrent machines but I guess we'll have to wait until it gets stuck again. I missed that line skimming over the comment. My bad. Anyway, I still need a core dump or a semi-reliable way to reproduce this. Sorry, I wasn't clear. There are multiple cores - 32 (logical) CPUs. I just included the cpuinfo from 0 because they're all the same. Created attachment 831834 [details]
/proc/cpuinfo for wsc-voo-057
Attaching full /proc/cpuinfo to make things clearer.
(In reply to David Gibson from comment #4) > I've passed that patch on to the customer, and we'll see what happens. Any news from the customer? Do we see the new traceback in customer logs? Nothing so far, I'm afraid. The customer has had one more failure - but an entirely different failure mode, so no new trace. Note also that the method I was given for getting a python thread trace didn't seem to work. See query in comment #53 of bug 1022352. The build in python gdb scripts sometime fail to work There is a version I made here https://github.com/ficoos/gdb-python-scripts It has way better track record for working than the built in one for us the commands is called pystack Sorry, I was looking for an opportunity to try these scripts on the customer system when it hit the problem. However, the occasion hasn't arisen. In the meantime, the customer has updated their filer, and removed some load from the system which seems to have made the problem stop triggering. So for the time being we're not going to get any more information. The customer has indicated they'd be interested in doing some post mortem - and maybe running some further tests - but only after their rush period is over in a few weeks time. Dropping priority to medium, since the customer is no longer actively hitting this problem. Any updates ? Sorry for the delay - I've been caught up with high priority cases. No updates, I'm afraid. Customer hasn't reencountered problems since updating filer and rearranging things to reduce some loads. They're currently looking at updating to RHEV 3.3. in the fairly near future. I will ping them to see if they're still interested in doing some post mortem analysis. no new data, moving to CLOSE INSUFFICIENT_DATA, IN case this happens again please reopen. |
Description of problem: On customer machine with 3 storage domains, one monitor thread stops monitoring the domain without any reason. Checking the logs show that the thread appear to never return from threading.Event.wait(timeout). Log from the last event (wsc-voo-057-2013111311221384359735/var/log/vdsm/vdsm.log.4) Thread-26::DEBUG::2013-11-12 21:00:46,494::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:00:46,494::domainMonitor::275::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 13, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-25::DEBUG::2013-11-12 21:00:46,495::domainMonitor::291::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 14, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-25::DEBUG::2013-11-12 21:00:46,496::domainMonitor::304::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 15, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-25::DEBUG::2013-11-12 21:00:46,496::domainMonitor::309::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 16, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-25::DEBUG::2013-11-12 21:00:46,496::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-24::DEBUG::2013-11-12 21:00:46,497::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 3, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,497::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct if=/rhev/data-center/mnt/147.249.124.211:_vol_RHEVM__Data_RHEV__NFS/d620559d-dbfc-4ab2-9c63-9118dbf8c5cc/dom_md/metadata bs=4096 count=1' (cwd None) Thread-24::DEBUG::2013-11-12 21:00:46,502::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '0+1 records in\n0+1 records out\n580 bytes (580 B) copied, 0.000255867 s, 2.3 MB/s\n'; <rc> = 0 Thread-24::DEBUG::2013-11-12 21:00:46,502::domainMonitor::221::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 4, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,503::domainMonitor::226::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 5, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,503::domainMonitor::238::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 6, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,504::domainMonitor::243::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 7, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,504::domainMonitor::249::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 8, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::254::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 9, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::259::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 10, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 11, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::270::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 12, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::275::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 13, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::291::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 14, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::304::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 15, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,511::domainMonitor::309::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 16, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-24::DEBUG::2013-11-12 21:00:46,511::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-190290::DEBUG::2013-11-12 21:00:53,283::task::579::TaskManager.Task::(_updateState) Task=`645c66bb-a154-4800-b0c6-f9eeff111376`::moving from state init -> state preparing Thread-190290::INFO::2013-11-12 21:00:53,286::logUtils::40::dispatcher::(wrapper) Run and protect: repoStats(options=None) Thread-190290::INFO::2013-11-12 21:00:53,286::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'d620559d-dbfc-4ab2-9c63-9118dbf8c5cc': {'delay': '0.000255867', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 3}, '1cbea565-8360-43f6-8790-3fc8dffd96c6': {'delay': '0.000297996', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 0}, '660b187b-3b58-421c-9f64-bd1f536b5901': {'delay': '0.000274795', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 0}} Thread-190290::DEBUG::2013-11-12 21:00:53,287::task::1168::TaskManager.Task::(prepare) Task=`645c66bb-a154-4800-b0c6-f9eeff111376`::finished: {'d620559d-dbfc-4ab2-9c63-9118dbf8c5cc': {'delay': '0.000255867', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 3}, '1cbea565-8360-43f6-8790-3fc8dffd96c6': {'delay': '0.000297996', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 0}, '660b187b-3b58-421c-9f64-bd1f536b5901': {'delay': '0.000274795', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 0}} Thread-190290::DEBUG::2013-11-12 21:00:53,287::task::579::TaskManager.Task::(_updateState) Task=`645c66bb-a154-4800-b0c6-f9eeff111376`::moving from state preparing -> state finished Thread-190290::DEBUG::2013-11-12 21:00:53,287::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-190290::DEBUG::2013-11-12 21:00:53,287::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-190290::DEBUG::2013-11-12 21:00:53,287::task::974::TaskManager.Task::(_decref) Task=`645c66bb-a154-4800-b0c6-f9eeff111376`::ref 0 aborting False Thread-26::DEBUG::2013-11-12 21:00:56,495::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:00:56,497::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Next log (wsc-voo-057-2013111311221384359735/var/log/vdsm/vdsm.log.3) Thread-26::DEBUG::2013-11-12 21:01:04,766::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,766::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,767::domainMonitor::199::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 1, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,767::domainMonitor::199::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 1, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,767::domainMonitor::211::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 2, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,768::domainMonitor::211::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 2, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-25::DEBUG::2013-11-12 21:01:04,776::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 3, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,777::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 3, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,857::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct if=/rhev/data-center/mnt/147.249.124.211:_vol_KVM__Images_exports/1cbea565-8360-43f6-8790-3fc8dffd96c6/dom_md/metadata bs=4096 count=1' (cwd None) Thread-26::DEBUG::2013-11-12 21:01:04,858::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct if=/rhev/data-center/mnt/147.249.124.211:_vol_KVM__Images_NFS__ISO/660b187b-3b58-421c-9f64-bd1f536b5901/dom_md/metadata bs=4096 count=1' (cwd None) Thread-26::DEBUG::2013-11-12 21:01:04,888::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '0+1 records in\n0+1 records out\n367 bytes (367 B) copied, 0.00169669 s, 216 kB/s\n'; <rc> = 0 Thread-25::DEBUG::2013-11-12 21:01:04,890::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '0+1 records in\n0+1 records out\n361 bytes (361 B) copied, 0.00170088 s, 212 kB/s\n'; <rc> = 0 Thread-26::DEBUG::2013-11-12 21:01:04,890::domainMonitor::221::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 4, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,894::domainMonitor::221::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 4, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,901::domainMonitor::226::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 5, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,901::domainMonitor::226::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 5, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,903::domainMonitor::238::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 6, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,904::domainMonitor::238::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 6, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,904::domainMonitor::243::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 7, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,904::domainMonitor::243::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 7, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,905::domainMonitor::249::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 8, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,905::domainMonitor::249::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 8, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,905::domainMonitor::254::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 9, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,906::domainMonitor::254::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 9, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,906::domainMonitor::259::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 10, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,906::domainMonitor::259::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 10, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,907::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 11, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,907::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 11, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,907::domainMonitor::270::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 12, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,907::domainMonitor::270::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 12, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,908::domainMonitor::275::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 13, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,908::domainMonitor::275::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 13, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,913::domainMonitor::291::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 14, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,913::domainMonitor::291::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 14, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,913::domainMonitor::304::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 15, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,914::domainMonitor::304::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 15, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,914::domainMonitor::309::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 16, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,914::domainMonitor::309::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 16, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:01:04,914::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-25::DEBUG::2013-11-12 21:01:04,918::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-24 is never seen again in vdsm.log.3, vdsm.log.2 and vdsm.log, until vdsm is restarted on 2013-11-13 11:17:22. Version-Release number of selected component (if applicable): vdsm-4.10.2-25.1.el6ev.x86_64 List of all packages installed: https://bugzilla.redhat.com/attachment.cgi?id=826354 vdsm includes this logging patch: https://bugzilla.redhat.com/attachment.cgi?id=825842 How reproducible: Happens on customer machine 2 or 3 times according to logs. Code failing: self.log.debug("_monitorLoop before wait, host %s, storage domain %s", self.hostId, self.sdUUID) self.stopEvent.wait(self.interval) self.log.debug("_monitorLoop after wait, host %s, storage domain %s", self.hostId, self.sdUUID) We see the before wait log message, but never the after wait log message for Thread-24. Environment info: The host is booting from SAN, and we can see in the sanlock logs that there are io problems with the storage from time to time. We can see how all domain monitor threads wakes up and are blocked on the storage trying to log the "_monitorDomain 0" message for 20 seconds, and wake up in the same second that sanlock wakes up and complain about io problems. wsc-voo-057-2013111311221384359735/var/log/messages: Nov 12 21:00:18 wsc-voo-057 sanlock[4546]: 2013-11-12 21:00:18-0500 459268 [6607]: s4 renewed 459247 delta_length 21 too long wsc-voo-057-2013111311221384359735/var/log/vdsm/vdsm.log.4: Thread-25::DEBUG::2013-11-12 20:59:58,824::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 20:59:58,825::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-24::DEBUG::2013-11-12 20:59:58,899::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc Thread-25::DEBUG::2013-11-12 21:00:18,628::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6 Thread-26::DEBUG::2013-11-12 21:00:18,628::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901 Thread-24::DEBUG::2013-11-12 21:00:18,629::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc At the time when the thread disapper, no io issues reported, and we can see that the other 2 threads wakes up and log normally. All 3 storage domains use NFS, and the same storage server provides both NFS and SAN storage the hosts are booting from.