RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 644612 - lvm operations deadlock while waiting for user input
Summary: lvm operations deadlock while waiting for user input
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: lvm2
Version: 8.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: 8.4
Assignee: LVM and device-mapper development team
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 756082
TreeView+ depends on / blocked
 
Reported: 2010-10-19 20:21 UTC by Corey Marthaler
Modified: 2023-03-08 07:25 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-01 03:02:29 UTC
Type: ---
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)
log from hayes-02 (293.81 KB, text/plain)
2010-10-19 20:26 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2010-10-19 20:21:38 UTC
Description of problem:
While testing the latest 6.0.z rpms, I hit this deadlock.

 ---------------------------------------------
 TEST CASE=23
  base legs 1; convert to 1 legs;
  base log disk; convert to core log;
  active 0; sync=1; pvs=sufficient
 ---------------------------------------------
 creating a base mirror on hayes-02
 Waiting until all mirrors become fully syncd...
    0/1 mirror(s) are fully synced: ( 54.29% )
    1/1 mirror(s) are fully synced: ( 100.00% )
 deactivating base volume before convert on hayes-02
 Converting from 1 leg(s) disk log; to 1 leg(s) core log on hayes-02
 
 lvconvert --mirrorlog core -m 1 centipede/centi_base /dev/etherd/e1.1p1
[HANG]

An 'lvs' cmd also hung.

I'll post the kernel dump from this machine.

Version-Release number of selected component (if applicable):
2.6.32-71.el6.x86_64

lvm2-2.02.72-8.el6_0.1    BUILT: Mon Oct 11 10:45:21 CDT 2010
lvm2-libs-2.02.72-8.el6_0.1    BUILT: Mon Oct 11 10:45:21 CDT 2010
lvm2-cluster-2.02.72-8.el6_0.1    BUILT: Mon Oct 11 10:45:21 CDT 2010
udev-147-2.29.el6    BUILT: Tue Aug 31 16:44:10 CDT 2010
device-mapper-1.02.53-8.el6_0.1    BUILT: Mon Oct 11 10:45:21 CDT 2010
device-mapper-libs-1.02.53-8.el6_0.1    BUILT: Mon Oct 11 10:45:21 CDT 2010
device-mapper-event-1.02.53-8.el6_0.1    BUILT: Mon Oct 11 10:45:21 CDT 2010
device-mapper-event-libs-1.02.53-8.el6_0.1    BUILT: Mon Oct 11 10:45:21 CDT 2010
cmirror-2.02.72-8.el6_0.1    BUILT: Mon Oct 11 10:45:21 CDT 2010

How reproducible:
Only once so far

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Corey Marthaler 2010-10-19 20:24:58 UTC
Oct 19 14:59:37 hayes-02 kernel: lvconvert     S ffff88011fc24100     0 32645  32644 0x00000080
Oct 19 14:59:37 hayes-02 kernel: ffff880218e87a88 0000000000000086 0000000000000000 ffffffff8126644c
Oct 19 14:59:37 hayes-02 kernel: ffff880218e87a08 0000000800000082 ffff88021a5a8ab0 0000000100a12629
Oct 19 14:59:37 hayes-02 kernel: ffff880218e590e8 ffff880218e87fd8 0000000000010518 ffff880218e590e8
Oct 19 14:59:37 hayes-02 kernel: Call Trace:
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff8126644c>] ? __bitmap_weight+0x8c/0xb0
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff814c8fc5>] schedule_timeout+0x225/0x2f0
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff81401829>] sk_wait_data+0xd9/0xe0
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff8145045b>] tcp_recvmsg+0x2cb/0xe80
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff812069f1>] ? avc_has_perm+0x71/0x90
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff814007c9>] sock_common_recvmsg+0x39/0x50
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff81400361>] sock_aio_read+0x181/0x190
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff8116c65a>] do_sync_read+0xfa/0x140
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff81091ca0>] ? autoremove_wake_function+0x0/0x40
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff8120bf0f>] ? selinux_file_permission+0xbf/0x150
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff810d40a2>] ? audit_syscall_entry+0x272/0x2a0
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff811ff3b6>] ? security_file_permission+0x16/0x20
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff8116d151>] vfs_read+0x181/0x1a0
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff810d40a2>] ? audit_syscall_entry+0x272/0x2a0
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff8116d1c1>] sys_read+0x51/0x90
Oct 19 14:59:37 hayes-02 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b

Comment 2 Corey Marthaler 2010-10-19 20:26:50 UTC
Created attachment 454434 [details]
log from hayes-02

Comment 4 Corey Marthaler 2010-10-20 16:42:15 UTC
Hit this again last night, bumping priority.

Comment 5 Corey Marthaler 2010-10-20 20:14:56 UTC
I should have known. The cmd isn't hung, it's just waiting to user input.

[root@hayes-01 ~]# lvconvert -m 1 --mirrorlog core centipede/mirror /dev/etherd/e1.1p9
Full resync required to convert inactive mirror mirror to core log. Proceed? [y/n]:

However, should that really lock all other lvm operations in the mean time? Especially if the convert was scripted like in this test? Once in this state how do I get out of it?

[root@hayes-01 ~]# pvs
[STUCK]

[root@hayes-01 ~]# lvs -a -o +devices
[STUCK]

Comment 6 Alasdair Kergon 2010-10-21 10:40:38 UTC
Indeed - locks should not be held while waiting for user input:)
Some amount of code restructuring may be required to fix this.

Comment 9 Suzanne Logcher 2011-03-28 21:14:11 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains 
unresolved, it has been rejected as it is not proposed as an 
exception or blocker.

Red Hat invites you to ask your support representative to 
propose this request, if appropriate and relevant, in the 
next release of Red Hat Enterprise Linux.

Comment 10 Corey Marthaler 2011-06-02 15:13:43 UTC
Adding QA ack for 6.2. 

Devel will need to provide unit testing results however before this bug can be
ultimately verified by QA.

Comment 21 Jonathan Earl Brassow 2014-12-16 21:51:34 UTC
I am moving this one to rhel7 for consideration.  It is not high enough priority to target rhel6 anymore.

Comment 24 Zdenek Kabelac 2015-12-07 10:15:30 UTC
Hmm - we all on going changes this seems to failing of the radar.

We seem to convert now everything into process_each_vg loop. 

However to handle prompts better - we should rather let just pick
matching  'vg'(s)  to select rule.

Drop lock

Prompt

Reacquire lock - compare VG in-process has not been changed
and continue with operation.

Doing this via 'process_each_vg' loop seems to be scaling complexity and efficiency of such operation.

Passing this to David as he now masters this code.

Comment 25 David Teigland 2016-01-19 15:09:28 UTC
This will be an incremental project, more an evolution of the code.  There are two ways we should go about fixing this:

1. Modify the process_each loops to do two loops, and prompting between the two loops without locks held.  This has been done for the new pvcreate (still needing review), which shows how it can be done for other cases.

2. Reconsider if some of the prompts are really needed.  In some cases prompts are really warranted, but in other cases we could probably remove the prompt without causing any real problem (the prompt about about resync seems like one that could be removed without any harm.)

Comment 26 David Teigland 2016-01-22 16:15:13 UTC
The specific case in the description was 'lvconvert -m1' asking the question:

Full resync required to convert inactive mirror mirror to core log. Proceed? [y/n]:

We will not have process_each_lv reworked to handle a prompting-without-locks phase in the immediate future, but it may be possible to to simply remove that prompt.
Someone who is more familiar with this specific resync case and the possible consequences of removing the prompt would have to answer that.

Comment 27 Jonathan Earl Brassow 2020-08-19 21:05:33 UTC
(In reply to David Teigland from comment #26)
> The specific case in the description was 'lvconvert -m1' asking the question:
> 
> Full resync required to convert inactive mirror mirror to core log. Proceed?
> [y/n]:
> 
> We will not have process_each_lv reworked to handle a
> prompting-without-locks phase in the immediate future, but it may be
> possible to to simply remove that prompt.
> Someone who is more familiar with this specific resync case and the possible
> consequences of removing the prompt would have to answer that.

When a user switches to a core log, it is going to require a full sync for EVERY activation thereafter - a known limitation of using core log.  I'm fine with removing the prompt.  I'm also fine closing this bug WONTFIX (legacy mirror implementation); but would prefer the prompt removal.

Comment 28 Zdenek Kabelac 2020-10-01 12:17:51 UTC
Side note - this is outstanding global issue - it's not a mirror bug as such.

We don't have yet any solution for the problem when lvm2 detects a 'promptable' issue and command ask a users for  y|n.
During such prompt we hold all the locks to avoid grabing system state again.

Dropping WRITE locks while prompt is shown may get very complicated if there has happened any change while the lock
has been released.

And it's not just state of VG as such - during secondary 'rescan' we may discover different set of devices,
which rise more prompts.

Comment 29 Corey Marthaler 2020-10-01 16:25:42 UTC
Zdenek is correct, this is not a "mirror" issue. Normally I'd be fine closing this, but this bug turns * 10 yrs old * this month. :)

Proposing for rhel8 consideration to force the issue, as this still exists with the latest release.


[root@hayes-02 ~]# lvs
  LV        VG           Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  POOL      cache_sanity -wi-a-----  4.00g                                                    
  POOL_meta cache_sanity -wi-a----- 12.00m                                                    
  suspend   cache_sanity -wi-a-----  4.00g                                                    

[root@hayes-02 ~]# lvconvert --type cache-pool --cachepolicy smq --cachemode writethrough -c 32 --poolmetadata cache_sanity/POOL_meta cache_sanity/POOL
  WARNING: Converting cache_sanity/POOL and cache_sanity/POOL_meta to cache pool's data and metadata volumes with metadata wiping.
  THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Do you really want to convert cache_sanity/POOL and cache_sanity/POOL_meta? [y/n]: 



### On another term:
[root@hayes-02 ~]# pvscan

[STUCK]


kernel-4.18.0-235.el8    BUILT: Thu Sep  3 13:19:50 CDT 2020
lvm2-2.03.09-5.el8    BUILT: Wed Aug 12 15:51:50 CDT 2020
lvm2-libs-2.03.09-5.el8    BUILT: Wed Aug 12 15:51:50 CDT 2020
device-mapper-1.02.171-5.el8    BUILT: Wed Aug 12 15:51:50 CDT 2020
device-mapper-libs-1.02.171-5.el8    BUILT: Wed Aug 12 15:51:50 CDT 2020

Comment 33 RHEL Program Management 2020-11-01 03:02:29 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.


Note You need to log in before you can comment on or make changes to this bug.