Created attachment 574313 [details]
execute clean.sh in it for some times to reproduce it.
Description of problem:
On fedaro 16, when I type "clean /rhev/data-center/*; service vdsmd restart", sometimes one of the sub directory hsm-tasks and mnt were not created.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. clean /rhev/data-center/*
2. service vdsmd restart
3. ll /rhev/data-center/
mnt or hsm-tasks was missing
It should be predictable whether the directory exist
When ./mnt is missing, the storageServer connect would fail, so I think this is a bug.
The booting procedure was stop before prepareForShutdown()
Last line of logs:
Thread-12::DEBUG::2012-04-01 18:08:38,368::supervdsm::71::SuperVdsmProxy::(_launchSupervdsm) Launching Super Vdsm
Thread-12::DEBUG::2012-04-01 18:08:38,368::__init__::1241::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.pyc efbdf456-6640-4fa6-b353-923ba5977f58 31433' (cwd None)
MainThread::INFO::2012-04-01 18:08:38,370::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
VM Channels Listener::INFO::2012-04-01 18:08:38,374::vmChannels::127::vds::(run) Starting VM channels listener thread.
the latest vdsm, vdsm-4.9.6-0.68.git19981e1.fc16.x86_64, seems fixed the problem that creating of mnt directory fails, but hsm-task was not created every time.
the sub thread was stopped at iSCSI scanning:
Thread-12::DEBUG::2012-04-05 17:08:28,062::__init__::1241::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)
I guess it is a race condition between this thread and main thread.
reproduced it on latest vdsm
it is caused by threading problem, when it happens, two python process appears:
vdsm 20138 1 0 16:31 ? 00:00:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm
vdsm 20141 20138 2 16:31 ? 00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
vdsm 20175 20141 0 16:31 ? 00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
gdb attach 20175, and bt:
#55 0x0000003ef146dc2c in ?? () from /usr/lib64/libpython2.7.so.1.0
#56 0x0000003ef1449193 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
#57 0x0000003ef1457dff in ?? () from /usr/lib64/libpython2.7.so.1.0
#58 0x0000003ef1449193 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
#59 0x0000003ef14da457 in PyEval_CallObjectWithKeywords () from /usr/lib64/libpython2.7.so.1.0
#60 0x0000003ef1510362 in ?? () from /usr/lib64/libpython2.7.so.1.0
#61 0x0000003a9dc07d90 in start_thread () from /lib64/libpthread.so.0
#62 0x0000003a9d4ef48d in clone () from /lib64/libc.so.6
seems a result of threading create, but it is a process.
there is some discuss on Python subprocessing, it is a bit different, but also related to thread and pipe:
There will be many changes in subprocessing behavior for new Python, I guess the problem have its chance to be fixed then. Should I add some locks to thread class in VDSM to walk around it?
walking around patch:
Created attachment 578264 [details]
test script to produce the error.
Test script to reproduce the error, it happens with little chance. Usage: tar xzvf it, and sudo ./test_clean.sh. It requires vdsm using /rhev/data-center, which was planned to be changed.
My test Env is Python2.7.2, fedora 16.
Created attachment 578512 [details]
test file without vdsm producing the bug
run it directly for about 10 times, thread 1 sometimes hang. It is verified on python 2.7.2 and python 2.7.2+(ubuntu).
(In reply to comment #3)
> #60 0x0000003ef1510362 in ?? () from /usr/lib64/libpython2.7.so.1.0
> #61 0x0000003a9dc07d90 in start_thread () from /lib64/libpthread.so.0
> #62 0x0000003a9d4ef48d in clone () from /lib64/libc.so.6
> seems a result of threading create, but it is a process.
Sorry for the mistake, it is normal because the caller of fork() is a sub thread, so it's entry point is clone().
Back to the issue, http://bugs.python.org/issue2320 does not hit the situation 100%, because error still happen when close_fds=True. In new release python2.7.3 and python 3.2, there are many changes about subprocess likely mechnism change, no special patch fixing the bug was found, but my test program passed.
we decided to simply require a newer python version 2.7.3, without this bug.