Bug 808853 - VDSM thread and pipe problem, resulting failing to create /rhev/data-center/tasks sometimes
Summary: VDSM thread and pipe problem, resulting failing to create /rhev/data-center/t...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: wenchao xia
QA Contact: yeylon@redhat.com
URL:
Whiteboard: infra
Depends On: 814492
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-04-01 10:18 UTC by wenchao xia
Modified: 2016-04-18 06:44 UTC (History)
8 users (show)

Fixed In Version: v4.9.6-1-154-g51e4c78
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-22 12:27:10 UTC
oVirt Team: ---
Embargoed:


Attachments (Terms of Use)
execute clean.sh in it for some times to reproduce it. (1.48 KB, application/gzipped-tar)
2012-04-01 10:18 UTC, wenchao xia
no flags Details
test script to produce the error. (4.29 KB, application/gzipped-tar)
2012-04-18 08:03 UTC, wenchao xia
no flags Details
test file without vdsm producing the bug (1.95 KB, application/octet-stream)
2012-04-19 06:35 UTC, wenchao xia
no flags Details

Description wenchao xia 2012-04-01 10:18:31 UTC
Created attachment 574313 [details]
execute clean.sh in it for some times to reproduce it.

Description of problem:
    On fedaro 16, when I type "clean /rhev/data-center/*; service vdsmd restart", sometimes one of the sub directory hsm-tasks and mnt were not created.

Version-Release number of selected component (if applicable):
vdsm-cli-4.9.6-0.53.gitd4a0516.fc16.noarch
vdsm-4.9.6-0.53.gitd4a0516.fc16.x86_64
vdsm-python-4.9.6-0.53.gitd4a0516.fc16.noarch


How reproducible:
   sometimes

Steps to Reproduce:
1. clean /rhev/data-center/*
2. service vdsmd restart
3. ll /rhev/data-center/
  
Actual results:
   mnt or hsm-tasks was missing

Expected results:
    It should be predictable whether the directory exist

Additional info:
    When ./mnt is missing, the storageServer connect would fail, so I think this is a bug.

Comment 1 wenchao xia 2012-04-01 10:24:53 UTC
The booting procedure was stop before prepareForShutdown()


Last line of logs:
Thread-12::DEBUG::2012-04-01 18:08:38,368::supervdsm::71::SuperVdsmProxy::(_launchSupervdsm) Launching Super Vdsm
Thread-12::DEBUG::2012-04-01 18:08:38,368::__init__::1241::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.pyc efbdf456-6640-4fa6-b353-923ba5977f58 31433' (cwd None)
MainThread::INFO::2012-04-01 18:08:38,370::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
VM Channels Listener::INFO::2012-04-01 18:08:38,374::vmChannels::127::vds::(run) Starting VM channels listener thread.

Comment 2 wenchao xia 2012-04-05 10:31:26 UTC
the latest vdsm, vdsm-4.9.6-0.68.git19981e1.fc16.x86_64, seems fixed the problem that creating of mnt directory fails, but hsm-task was not created every time.

the sub thread was stopped at iSCSI scanning:


Thread-12::DEBUG::2012-04-05 17:08:28,062::__init__::1241::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)
(never return)

I guess it is a race condition between this thread and main thread.

Comment 3 wenchao xia 2012-04-16 08:38:19 UTC
reproduced it on latest vdsm
acfe53407b450cab3b600e257bf487556fbece5e

it is caused by threading problem, when it happens, two python process appears:

vdsm     20138     1  0 16:31 ?        00:00:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm
vdsm     20141 20138  2 16:31 ?        00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
vdsm     20175 20141  0 16:31 ?        00:00:00 /usr/bin/python /usr/share/vdsm/vdsm


gdb attach 20175, and bt:
.....
#55 0x0000003ef146dc2c in ?? () from /usr/lib64/libpython2.7.so.1.0
#56 0x0000003ef1449193 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
#57 0x0000003ef1457dff in ?? () from /usr/lib64/libpython2.7.so.1.0
#58 0x0000003ef1449193 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
#59 0x0000003ef14da457 in PyEval_CallObjectWithKeywords () from /usr/lib64/libpython2.7.so.1.0
#60 0x0000003ef1510362 in ?? () from /usr/lib64/libpython2.7.so.1.0
#61 0x0000003a9dc07d90 in start_thread () from /lib64/libpthread.so.0
#62 0x0000003a9d4ef48d in clone () from /lib64/libc.so.6

seems a result of threading create, but it is a process.

Comment 4 wenchao xia 2012-04-18 03:00:19 UTC
there is some discuss on Python subprocessing, it is a bit different, but also related to thread and pipe:
http://bugs.python.org/issue2320

There will be many changes in subprocessing behavior for new Python, I guess the problem have its chance to be fixed then. Should I add some locks to thread class in VDSM to walk around it?

Comment 5 wenchao xia 2012-04-18 07:58:41 UTC
walking around patch:
http://gerrit.ovirt.org/#change,3638

Comment 6 wenchao xia 2012-04-18 08:03:19 UTC
Created attachment 578264 [details]
test script to produce the error.

    Test script to reproduce the error, it happens with little chance. Usage: tar xzvf it, and sudo ./test_clean.sh. It requires vdsm using /rhev/data-center, which was planned to be changed.
    My test Env is Python2.7.2, fedora 16.

Comment 7 wenchao xia 2012-04-19 06:35:33 UTC
Created attachment 578512 [details]
test file without vdsm producing the bug

run it directly for about 10 times, thread 1 sometimes hang. It is verified on python 2.7.2 and python 2.7.2+(ubuntu).

Comment 8 wenchao xia 2012-04-19 12:14:31 UTC
(In reply to comment #3)
> #60 0x0000003ef1510362 in ?? () from /usr/lib64/libpython2.7.so.1.0
> #61 0x0000003a9dc07d90 in start_thread () from /lib64/libpthread.so.0
> #62 0x0000003a9d4ef48d in clone () from /lib64/libc.so.6
> 
> seems a result of threading create, but it is a process.

    Sorry for the mistake, it is normal because the caller of fork() is a sub thread, so it's entry point is clone().
    Back to the issue, http://bugs.python.org/issue2320 does not hit the situation 100%, because error still happen when close_fds=True. In new release python2.7.3 and python 3.2, there are many changes about subprocess likely mechnism change, no special patch fixing the bug was found, but my test program passed.

Comment 9 Dan Kenigsberg 2012-06-22 12:27:10 UTC
we decided to simply require a newer python version 2.7.3, without this bug.


Note You need to log in before you can comment on or make changes to this bug.