Bug 808853

Summary: VDSM thread and pipe problem, resulting failing to create /rhev/data-center/tasks sometimes
Product: [Retired] oVirt Reporter: wenchao xia <xiawenc>
Component: vdsmAssignee: wenchao xia <xiawenc>
Status: CLOSED CURRENTRELEASE QA Contact: yeylon <yeylon>
Severity: high Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: abaron, acathrow, bazulay, danken, dyasny, iheim, srevivo, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: v4.9.6-1-154-g51e4c78 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-22 12:27:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 814492    
Bug Blocks:    
Attachments:
Description Flags
execute clean.sh in it for some times to reproduce it.
none
test script to produce the error.
none
test file without vdsm producing the bug none

Description wenchao xia 2012-04-01 10:18:31 UTC
Created attachment 574313 [details]
execute clean.sh in it for some times to reproduce it.

Description of problem:
    On fedaro 16, when I type "clean /rhev/data-center/*; service vdsmd restart", sometimes one of the sub directory hsm-tasks and mnt were not created.

Version-Release number of selected component (if applicable):
vdsm-cli-4.9.6-0.53.gitd4a0516.fc16.noarch
vdsm-4.9.6-0.53.gitd4a0516.fc16.x86_64
vdsm-python-4.9.6-0.53.gitd4a0516.fc16.noarch


How reproducible:
   sometimes

Steps to Reproduce:
1. clean /rhev/data-center/*
2. service vdsmd restart
3. ll /rhev/data-center/
  
Actual results:
   mnt or hsm-tasks was missing

Expected results:
    It should be predictable whether the directory exist

Additional info:
    When ./mnt is missing, the storageServer connect would fail, so I think this is a bug.

Comment 1 wenchao xia 2012-04-01 10:24:53 UTC
The booting procedure was stop before prepareForShutdown()


Last line of logs:
Thread-12::DEBUG::2012-04-01 18:08:38,368::supervdsm::71::SuperVdsmProxy::(_launchSupervdsm) Launching Super Vdsm
Thread-12::DEBUG::2012-04-01 18:08:38,368::__init__::1241::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /usr/bin/python /usr/share/vdsm/supervdsmServer.pyc efbdf456-6640-4fa6-b353-923ba5977f58 31433' (cwd None)
MainThread::INFO::2012-04-01 18:08:38,370::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
VM Channels Listener::INFO::2012-04-01 18:08:38,374::vmChannels::127::vds::(run) Starting VM channels listener thread.

Comment 2 wenchao xia 2012-04-05 10:31:26 UTC
the latest vdsm, vdsm-4.9.6-0.68.git19981e1.fc16.x86_64, seems fixed the problem that creating of mnt directory fails, but hsm-task was not created every time.

the sub thread was stopped at iSCSI scanning:


Thread-12::DEBUG::2012-04-05 17:08:28,062::__init__::1241::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)
(never return)

I guess it is a race condition between this thread and main thread.

Comment 3 wenchao xia 2012-04-16 08:38:19 UTC
reproduced it on latest vdsm
acfe53407b450cab3b600e257bf487556fbece5e

it is caused by threading problem, when it happens, two python process appears:

vdsm     20138     1  0 16:31 ?        00:00:00 /bin/bash -e /usr/share/vdsm/respawn --minlifetime 10 --daemon --masterpid /var/run/vdsm/respawn.pid /usr/share/vdsm/vdsm
vdsm     20141 20138  2 16:31 ?        00:00:00 /usr/bin/python /usr/share/vdsm/vdsm
vdsm     20175 20141  0 16:31 ?        00:00:00 /usr/bin/python /usr/share/vdsm/vdsm


gdb attach 20175, and bt:
.....
#55 0x0000003ef146dc2c in ?? () from /usr/lib64/libpython2.7.so.1.0
#56 0x0000003ef1449193 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
#57 0x0000003ef1457dff in ?? () from /usr/lib64/libpython2.7.so.1.0
#58 0x0000003ef1449193 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
#59 0x0000003ef14da457 in PyEval_CallObjectWithKeywords () from /usr/lib64/libpython2.7.so.1.0
#60 0x0000003ef1510362 in ?? () from /usr/lib64/libpython2.7.so.1.0
#61 0x0000003a9dc07d90 in start_thread () from /lib64/libpthread.so.0
#62 0x0000003a9d4ef48d in clone () from /lib64/libc.so.6

seems a result of threading create, but it is a process.

Comment 4 wenchao xia 2012-04-18 03:00:19 UTC
there is some discuss on Python subprocessing, it is a bit different, but also related to thread and pipe:
http://bugs.python.org/issue2320

There will be many changes in subprocessing behavior for new Python, I guess the problem have its chance to be fixed then. Should I add some locks to thread class in VDSM to walk around it?

Comment 5 wenchao xia 2012-04-18 07:58:41 UTC
walking around patch:
http://gerrit.ovirt.org/#change,3638

Comment 6 wenchao xia 2012-04-18 08:03:19 UTC
Created attachment 578264 [details]
test script to produce the error.

    Test script to reproduce the error, it happens with little chance. Usage: tar xzvf it, and sudo ./test_clean.sh. It requires vdsm using /rhev/data-center, which was planned to be changed.
    My test Env is Python2.7.2, fedora 16.

Comment 7 wenchao xia 2012-04-19 06:35:33 UTC
Created attachment 578512 [details]
test file without vdsm producing the bug

run it directly for about 10 times, thread 1 sometimes hang. It is verified on python 2.7.2 and python 2.7.2+(ubuntu).

Comment 8 wenchao xia 2012-04-19 12:14:31 UTC
(In reply to comment #3)
> #60 0x0000003ef1510362 in ?? () from /usr/lib64/libpython2.7.so.1.0
> #61 0x0000003a9dc07d90 in start_thread () from /lib64/libpthread.so.0
> #62 0x0000003a9d4ef48d in clone () from /lib64/libc.so.6
> 
> seems a result of threading create, but it is a process.

    Sorry for the mistake, it is normal because the caller of fork() is a sub thread, so it's entry point is clone().
    Back to the issue, http://bugs.python.org/issue2320 does not hit the situation 100%, because error still happen when close_fds=True. In new release python2.7.3 and python 3.2, there are many changes about subprocess likely mechnism change, no special patch fixing the bug was found, but my test program passed.

Comment 9 Dan Kenigsberg 2012-06-22 12:27:10 UTC
we decided to simply require a newer python version 2.7.3, without this bug.