Bug 1034741 - Domain monitor thread never wakes up from threading.Event.wait(timeout)
Summary: Domain monitor thread never wakes up from threading.Event.wait(timeout)
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: Unspecified
OS: Linux
medium
high
Target Milestone: ---
: 3.3.3
Assignee: Saggi Mizrahi
QA Contact:
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-26 12:31 UTC by Nir Soffer
Modified: 2018-12-05 16:40 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-03-24 13:53:46 UTC
oVirt Team: Infra
Target Upstream Version:


Attachments (Terms of Use)
/proc/cpuinfo for wsc-voo-057 (28.38 KB, text/plain)
2013-12-02 23:07 UTC, David Gibson
no flags Details

Description Nir Soffer 2013-11-26 12:31:43 UTC
Description of problem:

On customer machine with 3 storage domains, one monitor thread stops monitoring the domain without any reason. Checking the logs show that the thread appear to never return from threading.Event.wait(timeout).

Log from the last event (wsc-voo-057-2013111311221384359735/var/log/vdsm/vdsm.log.4)

Thread-26::DEBUG::2013-11-12 21:00:46,494::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:00:46,494::domainMonitor::275::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 13, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-25::DEBUG::2013-11-12 21:00:46,495::domainMonitor::291::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 14, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-25::DEBUG::2013-11-12 21:00:46,496::domainMonitor::304::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 15, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-25::DEBUG::2013-11-12 21:00:46,496::domainMonitor::309::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 16, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-25::DEBUG::2013-11-12 21:00:46,496::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-24::DEBUG::2013-11-12 21:00:46,497::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 3, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,497::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct if=/rhev/data-center/mnt/147.249.124.211:_vol_RHEVM__Data_RHEV__NFS/d620559d-dbfc-4ab2-9c63-9118dbf8c5cc/dom_md/metadata bs=4096 count=1' (cwd None)
Thread-24::DEBUG::2013-11-12 21:00:46,502::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '0+1 records in\n0+1 records out\n580 bytes (580 B) copied, 0.000255867 s, 2.3 MB/s\n'; <rc> = 0
Thread-24::DEBUG::2013-11-12 21:00:46,502::domainMonitor::221::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 4, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,503::domainMonitor::226::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 5, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,503::domainMonitor::238::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 6, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,504::domainMonitor::243::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 7, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,504::domainMonitor::249::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 8, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::254::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 9, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::259::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 10, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 11, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::270::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 12, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::275::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 13, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::291::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 14, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,510::domainMonitor::304::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 15, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,511::domainMonitor::309::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 16, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-24::DEBUG::2013-11-12 21:00:46,511::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-190290::DEBUG::2013-11-12 21:00:53,283::task::579::TaskManager.Task::(_updateState) Task=`645c66bb-a154-4800-b0c6-f9eeff111376`::moving from state init -> state preparing
Thread-190290::INFO::2013-11-12 21:00:53,286::logUtils::40::dispatcher::(wrapper) Run and protect: repoStats(options=None)
Thread-190290::INFO::2013-11-12 21:00:53,286::logUtils::42::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'d620559d-dbfc-4ab2-9c63-9118dbf8c5cc': {'delay': '0.000255867', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 3}, '1cbea565-8360-43f6-8790-3fc8dffd96c6': {'delay': '0.000297996', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 0}, '660b187b-3b58-421c-9f64-bd1f536b5901': {'delay': '0.000274795', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 0}}
Thread-190290::DEBUG::2013-11-12 21:00:53,287::task::1168::TaskManager.Task::(prepare) Task=`645c66bb-a154-4800-b0c6-f9eeff111376`::finished: {'d620559d-dbfc-4ab2-9c63-9118dbf8c5cc': {'delay': '0.000255867', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 3}, '1cbea565-8360-43f6-8790-3fc8dffd96c6': {'delay': '0.000297996', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 0}, '660b187b-3b58-421c-9f64-bd1f536b5901': {'delay': '0.000274795', 'lastCheck': '6.8', 'code': 0, 'valid': True, 'version': 0}}
Thread-190290::DEBUG::2013-11-12 21:00:53,287::task::579::TaskManager.Task::(_updateState) Task=`645c66bb-a154-4800-b0c6-f9eeff111376`::moving from state preparing -> state finished
Thread-190290::DEBUG::2013-11-12 21:00:53,287::resourceManager::809::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-190290::DEBUG::2013-11-12 21:00:53,287::resourceManager::844::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-190290::DEBUG::2013-11-12 21:00:53,287::task::974::TaskManager.Task::(_decref) Task=`645c66bb-a154-4800-b0c6-f9eeff111376`::ref 0 aborting False
Thread-26::DEBUG::2013-11-12 21:00:56,495::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:00:56,497::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6

Next log (wsc-voo-057-2013111311221384359735/var/log/vdsm/vdsm.log.3)

Thread-26::DEBUG::2013-11-12 21:01:04,766::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,766::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,767::domainMonitor::199::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 1, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,767::domainMonitor::199::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 1, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,767::domainMonitor::211::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 2, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,768::domainMonitor::211::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 2, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-25::DEBUG::2013-11-12 21:01:04,776::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 3, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,777::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 3, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,857::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct if=/rhev/data-center/mnt/147.249.124.211:_vol_KVM__Images_exports/1cbea565-8360-43f6-8790-3fc8dffd96c6/dom_md/metadata bs=4096 count=1' (cwd None)
Thread-26::DEBUG::2013-11-12 21:01:04,858::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/dd iflag=direct if=/rhev/data-center/mnt/147.249.124.211:_vol_KVM__Images_NFS__ISO/660b187b-3b58-421c-9f64-bd1f536b5901/dom_md/metadata bs=4096 count=1' (cwd None)
Thread-26::DEBUG::2013-11-12 21:01:04,888::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '0+1 records in\n0+1 records out\n367 bytes (367 B) copied, 0.00169669 s, 216 kB/s\n'; <rc> = 0
Thread-25::DEBUG::2013-11-12 21:01:04,890::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = '0+1 records in\n0+1 records out\n361 bytes (361 B) copied, 0.00170088 s, 212 kB/s\n'; <rc> = 0
Thread-26::DEBUG::2013-11-12 21:01:04,890::domainMonitor::221::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 4, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,894::domainMonitor::221::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 4, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,901::domainMonitor::226::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 5, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,901::domainMonitor::226::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 5, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,903::domainMonitor::238::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 6, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,904::domainMonitor::238::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 6, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,904::domainMonitor::243::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 7, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,904::domainMonitor::243::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 7, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,905::domainMonitor::249::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 8, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,905::domainMonitor::249::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 8, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,905::domainMonitor::254::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 9, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,906::domainMonitor::254::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 9, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,906::domainMonitor::259::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 10, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,906::domainMonitor::259::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 10, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,907::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 11, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,907::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 11, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,907::domainMonitor::270::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 12, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,907::domainMonitor::270::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 12, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,908::domainMonitor::275::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 13, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,908::domainMonitor::275::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 13, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,913::domainMonitor::291::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 14, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,913::domainMonitor::291::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 14, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,913::domainMonitor::304::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 15, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,914::domainMonitor::304::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 15, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,914::domainMonitor::309::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 16, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,914::domainMonitor::309::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 16, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:01:04,914::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-25::DEBUG::2013-11-12 21:01:04,918::domainMonitor::160::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop before wait, host 10, storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6

Thread-24 is never seen again in vdsm.log.3, vdsm.log.2 and vdsm.log, until vdsm is restarted on 2013-11-13 11:17:22.

Version-Release number of selected component (if applicable):

vdsm-4.10.2-25.1.el6ev.x86_64

List of all packages installed:
https://bugzilla.redhat.com/attachment.cgi?id=826354

vdsm includes this logging patch:
https://bugzilla.redhat.com/attachment.cgi?id=825842

How reproducible:

Happens on customer machine 2 or 3 times according to logs.

Code failing:

    self.log.debug("_monitorLoop before wait, host %s, storage domain %s", 	
                   self.hostId, self.sdUUID)
    self.stopEvent.wait(self.interval)
    self.log.debug("_monitorLoop after wait, host %s, storage domain %s",
                   self.hostId, self.sdUUID)

We see the before wait log message, but never the after wait log message for Thread-24.

Environment info:

The host is booting from SAN, and we can see in the sanlock logs that there are io problems with the storage from time to time. We can see how all domain monitor threads wakes up and are blocked on the storage trying to log the "_monitorDomain 0" message for 20 seconds, and wake up in the same second that sanlock wakes up and complain about io problems.

wsc-voo-057-2013111311221384359735/var/log/messages:
Nov 12 21:00:18 wsc-voo-057 sanlock[4546]: 2013-11-12 21:00:18-0500 459268 [6607]: s4 renewed 459247 delta_length 21 too long

wsc-voo-057-2013111311221384359735/var/log/vdsm/vdsm.log.4:
Thread-25::DEBUG::2013-11-12 20:59:58,824::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 20:59:58,825::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-24::DEBUG::2013-11-12 20:59:58,899::domainMonitor::163::Storage.DomainMonitorThread::(_monitorLoop) _monitorLoop after wait, host 10, storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc
Thread-25::DEBUG::2013-11-12 21:00:18,628::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain 1cbea565-8360-43f6-8790-3fc8dffd96c6
Thread-26::DEBUG::2013-11-12 21:00:18,628::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain 660b187b-3b58-421c-9f64-bd1f536b5901
Thread-24::DEBUG::2013-11-12 21:00:18,629::domainMonitor::178::Storage.DomainMonitorThread::(_monitorDomain) _monitorDomain 0, host 10 storage domain d620559d-dbfc-4ab2-9c63-9118dbf8c5cc

At the time when the thread disapper, no io issues reported, and we can see that the other 2 threads wakes up and log normally.

All 3 storage domains use NFS, and the same storage server provides both NFS and SAN storage the hosts are booting from.

Comment 3 Nir Soffer 2013-11-26 12:46:12 UTC
One unlikely option is that stopEvent.wait() raises an exception - since we don't have an exception handler for monitor thread main function, this can cause this silent death of the thread.

To test this you can apply this patch:
http://gerrit.ovirt.org/#/c/21487

If this is indeed the issue, the next time it happens we will have a backtrace showing the error. If the error happens again and we don't see the backtrace, it proves that the issue is not an unhandled exception.

Comment 4 David Gibson 2013-11-27 01:24:35 UTC
I've passed that patch on to the customer, and we'll see what happens.

Comment 6 Saggi Mizrahi 2013-12-01 13:37:47 UTC
I've tried my best to reproduce this but It just can't.

I would very much like a coredump of VDSM when the thread gets stuck.
Also, information about the kernel, python and cpus.

Comment 7 David Gibson 2013-12-01 23:42:52 UTC
Saggi,

We'll have to wait on the customer to encounter the problem again to get thread trace or core information.

As to the rest:
  python-2.6.6-37.el6_4.x86_64
  kernel-2.6.32-358.18.1.el6.x86_64

Linux wsc-voo-057 2.6.32-358.18.1.el6.x86_64 #1 SMP Fri Aug 2 17:04:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

There are 32-cpus, like this:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
stepping	: 7
cpu MHz		: 2701.000
cache size	: 20480 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5400.12
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

Any other specifics you need?

Comment 8 Saggi Mizrahi 2013-12-02 12:25:09 UTC
No, surprising that he has only one core. I was suspecting some weird pthread corner cases related to highly concurrent machines but I guess we'll have to wait until it gets stuck again.

Comment 10 Saggi Mizrahi 2013-12-02 12:52:07 UTC
I missed that line skimming over the comment.
My bad.

Anyway, I still need a core dump or a semi-reliable way to reproduce this.

Comment 11 David Gibson 2013-12-02 23:05:58 UTC
Sorry, I wasn't clear.  There are multiple cores - 32 (logical) CPUs.  I just included the cpuinfo from 0 because they're all the same.

Comment 12 David Gibson 2013-12-02 23:07:40 UTC
Created attachment 831834 [details]
/proc/cpuinfo for wsc-voo-057

Attaching full /proc/cpuinfo to make things clearer.

Comment 15 Nir Soffer 2013-12-09 12:00:32 UTC
(In reply to David Gibson from comment #4)
> I've passed that patch on to the customer, and we'll see what happens.

Any news from the customer? Do we see the new traceback in customer logs?

Comment 16 David Gibson 2013-12-09 22:53:53 UTC
Nothing so far, I'm afraid.

The customer has had one more failure - but an entirely different failure mode, so no new trace.

Note also that the method I was given for getting a python thread trace didn't seem to work.  See query in comment #53 of bug 1022352.

Comment 17 Saggi Mizrahi 2013-12-15 12:00:42 UTC
The build in python gdb scripts sometime fail to work

There is a version I made here
https://github.com/ficoos/gdb-python-scripts

It has way better track record for working than the built in one for us

the commands is called pystack

Comment 18 David Gibson 2014-01-20 02:54:19 UTC
Sorry, I was looking for an opportunity to try these scripts on the customer system when it hit the problem.  However, the occasion hasn't arisen.

In the meantime, the customer has updated their filer, and removed some load from the system which seems to have made the problem stop triggering.  So for the time being we're not going to get any more information.

The customer has indicated they'd be interested in doing some post mortem - and maybe running some further tests - but only after their rush period is over in a few weeks time.

Comment 19 David Gibson 2014-01-20 02:55:15 UTC
Dropping priority to medium, since the customer is no longer actively hitting this problem.

Comment 20 Barak 2014-03-04 12:19:38 UTC
Any updates ?

Comment 21 David Gibson 2014-03-20 03:57:15 UTC
Sorry for the delay - I've been caught up with high priority cases.

No updates, I'm afraid.  Customer hasn't reencountered problems since updating filer and rearranging things to reduce some loads.  They're currently looking at updating to RHEV 3.3. in the fairly near future.

I will ping them to see if they're still interested in doing some post mortem analysis.

Comment 22 Barak 2014-03-24 13:53:46 UTC
no new data,

moving to CLOSE INSUFFICIENT_DATA,
IN case this happens again please reopen.


Note You need to log in before you can comment on or make changes to this bug.