Bug 727246

Summary: [lvm] some lvm commands are stuck in deadlock due to one process blocked in semaphore
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: lvm2Assignee: Peter Rajnoha <prajnoha>
Status: CLOSED WORKSFORME QA Contact: Corey Marthaler <cmarthal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.2CC: abaron, agk, cpelland, danken, dwysocha, hateya, heinzm, jbrassow, mbroz, mgoldboi, prajnoha, prockai, smizrahi, thornber, yeylon, ykaul, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-22 13:46:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 682015    

Description Haim 2011-08-01 16:12:21 UTC
Description of problem:

have a machine with Iscsi storage connected to several luns, and some lvm  commands are stuck in S< for the last 4-5 hours now.
this state stuck our hypervisor management system called 'VDSM'.

from short debugging cycle held with mbroz, it appears that one process was blocked in semaphore (and other waiting for it in flock(), and udevcomplete_all unblocked it.
it also appears that we had lots of cookies\semaphore under dmsetup udevcookies.


[root@nott-vds3 ~]# dmsetup udevcookies
cookie       semid      value      last_semop_time
0xd4d1360    114720771  1          Wed Jul  6 18:13:44 2011
0xd4d91b8    114753540  1          Wed Jul  6 18:13:44 2011
0xd4de72c    321388549  2          Thu Jul 14 16:48:08 2011
0xd4d79ba    114917382  1          Wed Jul  6 18:14:00 2011
0xd4d421d    272072711  1          Mon Jul 11 14:43:46 2011
0xd4d952e    280821768  1          Mon Jul 11 15:47:15 2011
0xd4d8465    288587785  1          Mon Jul 11 16:36:44 2011
0xd4d8fa5    321421322  1          Thu Jul 14 18:21:06 2011
0xd4df60f    321454091  2          Sun Jul 17 11:42:52 2011
0xd4d8b03    321486860  2          Sun Jul 17 12:10:32 2011
0xd4d7c6d    321519629  26         Sun Jul 17 17:58:12 2011

root@nott-vds3 ~]# ps -w `pgrep lvm`
  PID TTY      STAT   TIME COMMAND
 1619 ?        S<     0:00 /sbin/lvm lvchange --config  devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 filter = [ "a%1Daffi1309719|1Dafna-Export1308598|1EXPORT1310355|1EXP_Dom_Bckup|1Exp-Daf
 1931 ?        S<     0:00 /sbin/lvm vgck --config  devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 filter = [ "a%1Daffi1309719|1Dafna-Export1308598|1EXPORT1310355|1EXP_Dom_Bckup|1Exp-Daf1310
 8895 ?        S<     0:00 /sbin/lvm lvchange --config  devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 filter = [ "a%/dev/mapper/1Daffi1309719|/dev/mapper/1Dafna-Export1308598|/dev/mapper/1E
10774 ?        S<     0:00 /sbin/lvm pvs --config  devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 filter = [ "a%1Daffi1309719|1Dafna-Export1308598|1EXPORT1310355|1EXP_Dom_Bckup|1Exp-Daf13100
11433 ?        S<     0:00 /sbin/lvm pvs --config  devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 filter = [ "a%1Daffi1309719|1Dafna-Export1308598|1EXPORT1310355|1EXP_Dom_Bckup|1Exp-Daf13100
17550 ?        S<     0:00 /sbin/lvm vgck --config  devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 filter = [ "a%1Daffi1309719|1Dafna-Export1308598|1EXPORT1310355|1EXP_Dom_Bckup|1Exp-Daf1310
19435 ?        S<     0:00 /sbin/lvm vgck --config  devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 filter = [ "a%1Daffi1309719|1Dafna-Export1308598|1EXPORT1310355|1EXP_Dom_Bckup|1Exp-Daf1310


1619
[<ffffffff811bbd8d>] flock_lock_file_wait+0x18d/0x350
[<ffffffff811bc10b>] sys_flock+0x1bb/0x1d0
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
1931
[<ffffffff811bbd8d>] flock_lock_file_wait+0x18d/0x350
[<ffffffff811bc10b>] sys_flock+0x1bb/0x1d0
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
8895
[<ffffffff811f7c15>] sys_semtimedop+0x725/0x8b0
[<ffffffff811f7db0>] sys_semop+0x10/0x20
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
10774
[<ffffffff811bbd8d>] flock_lock_file_wait+0x18d/0x350
[<ffffffff811bc10b>] sys_flock+0x1bb/0x1d0
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
11433
[<ffffffff811bbd8d>] flock_lock_file_wait+0x18d/0x350
[<ffffffff811bc10b>] sys_flock+0x1bb/0x1d0
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
17550
[<ffffffff811bbd8d>] flock_lock_file_wait+0x18d/0x350
[<ffffffff811bc10b>] sys_flock+0x1bb/0x1d0
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
19435
[<ffffffff811bbd8d>] flock_lock_file_wait+0x18d/0x350
[<ffffffff811bc10b>] sys_flock+0x1bb/0x1d0
[<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
 

currently there is no reproducer, but i will try to create one (script).

Comment 2 Peter Rajnoha 2011-08-02 07:36:33 UTC
(In reply to comment #0)
> currently there is no reproducer, but i will try to create one (script).

If you manage to reproduce that, please try to grab the -vvvv debug log as well. Thanks.

Comment 3 Haim 2011-08-08 11:13:38 UTC
I think i'v managed to reproduce using the following script:


#! /usr/bin/python

import subprocess
import commands
import os

def attachPid():
        pids = commands.getoutput('pgrep lvm')
        if not pids:
            print "No LVM process found"
            attachPid()
        else:
            pids = pids.split('\n')
            for pid in pids:
                pid = int(pid)
                os.kill(pid, 19)
                print "LVM process found (%s)- SIGSTOP sent to it" % pid
attachPid()

please run this script, along with several heavy lvm commands, and leave it for some time to run. 

[root@rhev-a8c-02 tmp]# ps -ww `pgrep lvm`
  PID TTY      STAT   TIME COMMAND
20728 ?        Z<     0:00 [lvm] <defunct>

Comment 4 Milan Broz 2011-09-21 10:21:19 UTC
This reproducer (comment #3) make no sense to me.

You are using  os.kill(pid, 19) - this send SIGSTOP/SIGCONT, and of course causes program to stop. This is not a bug.

What do you want to try do to here? What do you expect from SIGSTOP sent to lvm?

man 2 signal:

 SIGCONT   19,18,25    Cont    Continue if stopped
 SIGSTOP   17,19,23    Stop    Stop process

 The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.

Comment 5 Milan Broz 2011-09-21 10:22:02 UTC
(err, man 7 signal)

Comment 6 Milan Broz 2011-09-21 11:26:59 UTC
If it is problem we tried to debug some time agi, there were some active lvm semaphores. But I think this was relict of some previous testing or upgrade.

(For reproducer above - it was stopped, so obviously some semaphores were still present, but after "pkill -SIGCONT lvm" it simply continues, so this was not the former observed problem.)