Bug 175158

Summary:	scsi devices busy in cluster
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Nate Straz <nstraz>
Component:	lvm2-cluster	Assignee:	Christine Caulfield <ccaulfie>
Status:	CLOSED NOTABUG	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	agk
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-12-07 21:42:30 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nate Straz 2005-12-06 23:52:02 UTC

Description of problem:

The lvm_config test case distributes the work of creating LVs across multiple
nodes in a cluster.  So node A will create the PVs, node B will create a VG,
node C will activate the VG, etc.  At the beginning of each iteration the current
LVM structures are cleaned up (all the way to removing PVs) and the disks are
partitioned again.

I've run this in a 5 node cluster.  Generally one node will still be able to
re-read the partition tables, but the other 4 nodes won't be able to re-read the
partition tables.

Version-Release number of selected component (if applicable):
lvm2-cluster-2.01.14-1.0.RHEL4

How reproducible:
Every time.

Steps to Reproduce:
1. Form a GULM cluster (haven't seen it on DLM yet)
2. Run lvm_config test case from STS test suite

  
Actual results:
[nstraz@try sts]$ lvm2/bin/lvm_config -s 29547 -R /home/msp/nstraz/morph-cluster
.xml -l /usr/tests/sts -r /usr/tests/sts -i 15
Random seed being used: 29547
iter=1
### REMOVE old and then CREATE new volume ###
Skipping mirroring attempts
iter=2
### REMOVE old and then CREATE new volume ###
4 disk(s) to be used: /dev/sda /dev/sdb /dev/sdc /dev/sdd

cleaning up lvm if need be on morph-01...
  No volume groups found
  No volume groups found
  No volume groups found

cleaning up lvm if need be on morph-02...
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.

cleaning up lvm if need be on morph-03...
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.

cleaning up lvm if need be on morph-04...
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.

cleaning up lvm if need be on morph-05...
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.
  connect() failed on local socket: Connection refused
  Locking type 2 initialisation failed.

killing clvmd to get around BZ 138396
Killing clvmd on morph-01...
Killing clvmd on morph-02...
Killing clvmd on morph-03...
Killing clvmd on morph-04...
Killing clvmd on morph-05...

LEGS=4
dicenum=1
dicing /dev/sda into 1... 
dicing /dev/sdb into 1... 
dicing /dev/sdc into 1... 
dicing /dev/sdd into 1... 
re-reading disks on morph-01...
re-reading disks on morph-02...
re-reading disks on morph-03...
re-reading disks on morph-04...
re-reading disks on morph-05...

starting clvmd again to get around BZ 138396
starting clvmd on morph-01
starting clvmd on morph-02
starting clvmd on morph-03
starting clvmd on morph-04
starting clvmd on morph-05
wait for clvm services on all nodes...

creating lvm devices...
make PV out of all avail partitions on morph-03
make VG out of all avail partitions on morph-05
activate VG on morph-01
make needed LVs on morph-03
lvcreate -l 173636 --stripes 4 --stripesize 64 -n stripe_4_78480 stripe_4_7848

REDUCE size of lv from 173636 to 164504 on morph-04
EXTEND size of lv from 164504 to 165773 on morph-04

REDUCE size of lv from 165772 to 50192 on morph-02
EXTEND size of lv from 50192 to 59005 on morph-02

REDUCE size of lv from 59004 to 23774 on morph-01
EXTEND size of lv from 23772 to 31200 on morph-01

REDUCE size of lv from 31200 to 4090 on morph-03
EXTEND size of lv from 4088 to 120893 on morph-04

REDUCE size of lv from 120892 to 73101 on morph-01
EXTEND size of lv from 73100 to 126808 on morph-05

REDUCE size of lv from 126808 to 122689 on morph-02
EXTEND size of lv from 122688 to 138797 on morph-02

iter=3
### REMOVE old and then CREATE new volume ###
Skipping mirroring attempts
iter=4
### REMOVE old and then CREATE new volume ###
Skipping mirroring attempts
iter=5
### REMOVE old and then CREATE new volume ###
4 disk(s) to be used: /dev/sda /dev/sdb /dev/sdc /dev/sdd

cleaning up lvm if need be on morph-01...
removing lv /dev/stripe_4_7848/stripe_4_78480 on node morph-01
deactivating vg stripe_4_7848 on node morph-01
removing vg stripe_4_7848 on node morph-01
removing pv /dev/sda1 on node morph-01
removing pv /dev/sdb1 on node morph-01
removing pv /dev/sdc1 on node morph-01
removing pv /dev/sdd1 on node morph-01

cleaning up lvm if need be on morph-02...
  No volume groups found
  No volume groups found
  No volume groups found

cleaning up lvm if need be on morph-03...
  No volume groups found
  No volume groups found
  No volume groups found

cleaning up lvm if need be on morph-04...
  No volume groups found
  No volume groups found
  No volume groups found

cleaning up lvm if need be on morph-05...
  No volume groups found
  No volume groups found
  No volume groups found

killing clvmd to get around BZ 138396
Killing clvmd on morph-01...
Killing clvmd on morph-02...
Killing clvmd on morph-03...
Killing clvmd on morph-04...
Killing clvmd on morph-05...

LEGS=3
dicenum=1
dicing /dev/sda into 1... 
dice of /dev/sda into 1 on morph-02 failed
Using default ID type
Checking that no-one is using this disk right now ...
BLKRRPART: Device or resource busy

This disk is currently in use - repartitioning is probably a bad idea.
Umount all file systems, and swapoff all swap partitions on this disk.
Use the --no-reread flag to suppress this check.
Use the --force flag to overrule all checks.
Dice failed
vol_config failed


Expected results:
Second iteration should continue like the first iteration and we should get
through all 15 iterations.

Additional info:
Here is some information I've been collecting when this happens.  On morph-01
the disks are still usable.  The stats for the partitions seems to be reset
on morph-01.

$ for i in morph-0{1,2,3,4,5}; do qarsh root@$i 'for d in sda 
sdb sdc sdd; do /usr/tests/sts/bin/readpart /dev/$d; done; cat /sys/block/sd*/sd
*1/stat'; done
      68      272        0        0
       0        0        0        0
       0        0        0        0
       0        0        0        0
Unable to BLKRRPART on  /dev/sda: Device or resource busy
Unable to BLKRRPART on  /dev/sdb: Device or resource busy
Unable to BLKRRPART on  /dev/sdc: Device or resource busy
Unable to BLKRRPART on  /dev/sdd: Device or resource busy
     949     3880       12       56
     897     7280       12      112
    1249     2035       12       20
    1249     2035       12       20
Unable to BLKRRPART on  /dev/sda: Device or resource busy
Unable to BLKRRPART on  /dev/sdb: Device or resource busy
Unable to BLKRRPART on  /dev/sdc: Device or resource busy
Unable to BLKRRPART on  /dev/sdd: Device or resource busy
     917     3740       11       56
     848     6848       10       80
    1312     2058       10       24
    1312     2058       10       24
Unable to BLKRRPART on  /dev/sda: Device or resource busy
Unable to BLKRRPART on  /dev/sdb: Device or resource busy
Unable to BLKRRPART on  /dev/sdc: Device or resource busy
Unable to BLKRRPART on  /dev/sdd: Device or resource busy
     943     3856        9       44
     891     7208        9       80
    1243     2030        9       15
    1243     2030        9       15
Unable to BLKRRPART on  /dev/sda: Device or resource busy
Unable to BLKRRPART on  /dev/sdb: Device or resource busy
Unable to BLKRRPART on  /dev/sdc: Device or resource busy
Unable to BLKRRPART on  /dev/sdd: Device or resource busy
     963     3916        6       24
     895     7240        6       48
    1359     2139        6        9
    1359     2139        6        9

If there are other things I should collect, please let me know.  I would like
to collect the refrence counts on the block devices or who has them open still,
but I don't know how to do that.

Comment 1 Nate Straz 2005-12-07 00:02:10 UTC

Tried it with DLM on the same cluster and ran into the same problem.
I have another cluster running the same tests without problems.  Same storage
hardware and HBAs too.

Comment 2 Nate Straz 2005-12-07 20:38:08 UTC

I did some digging with systemtap and found why EBUSY is being returned.
In rescan_partitions(), bdev->bd_part_count == 1 on the failing nodes.

Here is the systemtap script I used:

/* rescan.stp
   This is a simple script to figure out what is happening with block device
   rescans.
 */

global count_rescan

probe kernel.function("blkdev_ioctl")
{
  log("blkdev_ioctl cmd " . string($cmd));
}

probe kernel.function("blkdev_reread_part")
{
  log("entered blkdev_reread_part");
}

probe kernel.function("rescan_partitions")
{
  log("entered rescan_partitions");
  log("part_count = " . string($bdev->bd_part_count));
}

probe kernel.function("invalidate_partition")
{
  log("entered invalidate_partition");
}

probe begin { log("starting probe") }

probe end
{
        log("ending probe");
}


Passing output:
[root@morph-01 ~]# stap /tmp/rescan.stp
starting probe
blkdev_ioctl cmd 21286
blkdev_ioctl cmd 21286
blkdev_ioctl cmd 4703
entered blkdev_reread_part
entered rescan_partitions
part_count = 0
entered invalidate_partition
blkdev_ioctl cmd 4712
blkdev_ioctl cmd -2147216782
blkdev_ioctl cmd 21286
ending probe

Failing output:
[root@morph-02 ~]# stap /tmp/rescan.stp
starting probe
blkdev_ioctl cmd 21286
blkdev_ioctl cmd 21286
blkdev_ioctl cmd 4703
entered blkdev_reread_part
entered rescan_partitions
part_count = 1
blkdev_ioctl cmd 21286
ending probe

Comment 3 Nate Straz 2005-12-07 20:54:00 UTC

Ah ha!  I found the holder!  On the nodes that weren't working, there was still
an entry in the DM table.  After using dmsetup to remove it, I was back in business.

Comment 4 Nate Straz 2005-12-07 21:42:30 UTC

ARG!  Found that lvm.conf was wrong on one node.  It didn't have cluster locking
enabled.