Bug 683208 - lvcreate in 6 node cluster no longer getting propagated to other nodes
Summary: lvcreate in 6 node cluster no longer getting propagated to other nodes
Keywords:
Status: CLOSED DUPLICATE of bug 673981
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: lvm2-cluster
Version: 5.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Milan Broz
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-08 20:23 UTC by Debbie Johnson
Modified: 2018-11-14 14:08 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-03-09 22:35:02 UTC
Target Upstream Version:


Attachments (Terms of Use)
clvmd debug logs (69.08 KB, application/x-gzip)
2011-03-08 20:24 UTC, Debbie Johnson
no flags Details
dmsetup info -c and pvscans -vvv (88.73 KB, application/x-gzip)
2011-03-08 22:01 UTC, Debbie Johnson
no flags Details

Description Debbie Johnson 2011-03-08 20:23:14 UTC
Description of problem:
Creating BZ as this is becoming very urgent to customer to get resolved.
Customer has a 6 node cluster and when they now do a lvcreate or lvremove
it does not get propagated to the other nodes.  Deleting the lv will cause the other nodes will only mark the logical volume as 'inactive' in a lvscan.
Creating the lv it does not show the lv at all except for the created node.
clvmd is running on all nodes.  clvmd -R is issued.  Only way to have cluster nodes consistent is to move all virtual machines using that volumes group off a host, and issuing a 'service clvmd stop/start' will the new logical volume be detected and available on that host. The process has to be repeated on all other nodes. 

Version-Release number of selected component (if applicable):
2.6.18-238.el5xen

How reproducible:
every time

Steps to Reproduce:
1.
2.
3.
  
Actual results:
only lv is shown on the creation node

Expected results:
lv to appear on all nodes

Additional info:

I have lots of data.  lvmdump, sosreports, clvmd -d output
after we had them do a 

We want to restart clvmd with -d, then lvcreate to repro the issue, wait for it to propagate to all nodes, then kill clvmd and grab the logs

Please do the following:

a) "killall clvmd" on each node
b) script /tmp/clvmd-$(uname -n).out
c) clvmd -d
d) repro issue
e) ctrl+c the clvmd
f) exit script
g) attach script
h) clvmd --  to start it back up, not in debug mode


Will attach the clvmd logs.  If you need anything else, just ask.

Comment 1 Debbie Johnson 2011-03-08 20:24:47 UTC
Created attachment 483007 [details]
clvmd debug logs

Comment 2 Debbie Johnson 2011-03-08 20:34:41 UTC
Some of the latest testing and what was performed during the logs I uploaded...

Created a new lvol (LV_CLVMD_TEST) on CVG_101 at 10:32am on cglnxhv03.  Issued
clvmd -R at 12:30am (got sidetracked).  Stopped debug clvmd 20minutes later.  I
don't see that lvol on any other node.  You will see a few lvscans in the logs.


Currently only cglnxhv03 displays the new lvol- the other 5 do not.
[root@cglnxhv11 ~]# ssh n09 lvscan|grep CLVMD
[root@cglnxhv11 ~]# ssh n07 lvscan|grep CLVMD
[root@cglnxhv11 ~]# ssh n05 lvscan|grep CLVMD
[root@cglnxhv11 ~]# ssh n03 lvscan|grep CLVMD
  ACTIVE            '/dev/CVG_101/LV_CLVMD_TEST' [2.00 GB] inherit
[root@cglnxhv11 ~]# ssh n01 lvscan|grep CLVMD
[root@cglnxhv11 ~]# 


I used this to create the lvol:
[root@cglnxhv03 ~]# lvcreate -L 2G -n LV_CLVMD_TEST CVG_101
  Logical volume "LV_CLVMD_TEST" created

Later followed by:
[root@cglnxhv03 tmp]# clvmd -R
This was only run on cglnxhv03.

Removing the .cache file did not help.  A new one was generated.
[root@cglnxhv11 cache]# lvscan|grep CLVMD
[root@cglnxhv11 cache]# rm .cache 
[root@cglnxhv11 cache]# lvscan|grep CLVMD
[root@cglnxhv11 cache]# ls -la
total 32
drwx------ 2 root root  4096 Mar  8 08:58 .
drwx------ 5 root root  4096 Feb  3 16:30 ..
-rw------- 1 root root 15037 Mar  8 08:58 .cache

Comment 3 Debbie Johnson 2011-03-08 22:01:39 UTC
Created attachment 483039 [details]
dmsetup info -c and pvscans -vvv

Comment 4 Debbie Johnson 2011-03-08 22:07:35 UTC
What was done before problem started..

Prior to this issue, one of the activities was removing 3 multipathed luns from the cluster.  They would have been named WARR_05FC_MP, WARR_060B_MP, WARR_061A_MP.  

I know one was part of an existing volume group that had to be reduced, the other two were in isolated volume groups that were removed entirely.  LVM work was completed first (clvmd -R's frequently), followed by the removal the multipath definitions from multipath.conf, multipath was flushed, scsi devices deleted, multipath restarted, luns were unzoned.

Comment 10 Debbie Johnson 2011-03-09 14:18:35 UTC
Milan,

These dumps have nothing to do with the clvmd logs.  The clvmd logs are from a
# lvcreate -L 2G -n LV_CLVMD_TEST CVG_101
I will ask Robert to ask the customer to get the lvmdumps now if you wish.

Please let me know what you need to go along with the clvmd -d logs.

Deb

Comment 13 Debbie Johnson 2011-03-09 15:22:29 UTC
Milan,

Thanks.  Will do.  Thanks so much for getting to this so quickly.  So action plan is:

1) Install lvm2-2.02.74-5.el5_6.1 lvm2-cluster-2.02.74-3.el5_6.1 on each cluster node.

2) Start with a clean configuration on all nodes and a fresh boot.  Then:

Please do the following:

a) "killall clvmd" on each node
b) script /tmp/clvmd-$(uname -n).out
c) clvmd -d
d) repro issue
e) ctrl+c the clvmd
f) exit script
g) attach script
h) clvmd --  to start it back up, not in debug mode

3) If every node was propagated with the new LV, let us know that the problem is resolved.  If every node was not propagated then send us:

The clvmd logs from all nodes and an lvmdump from each node collected after this
test.

Does this action plan cover what we need to have done?

Deb

Comment 14 Milan Broz 2011-03-09 16:16:14 UTC
Yes. I think Dave and Chrissie can help here as well to find source of problem.

Comment 15 Debbie Johnson 2011-03-09 22:10:05 UTC
Milan,

This BZ can be closed.  The problem was fixed by upgrading the packages.
Thank you so much for your help.

From customer:

After updating these 2 packages and rebooting each node, we are no longer experiencing this issue.  I verified that lvcreate, lvremove, and lvrename results are now seen by all other nodes as expected. I used several different volume groups, and issued commands from different nodes to test.

Comment 16 Milan Broz 2011-03-09 22:35:02 UTC
Then it was almost for sure bug #673981 ...

Please reopen if this appears again.

*** This bug has been marked as a duplicate of bug 673981 ***


Note You need to log in before you can comment on or make changes to this bug.