Description of problem: The errors seen here in the syslog are nothing new, but I couldn't find an open bug that resembles this issue, though it looks similar to "fixed" bug 501473. This was on the hayes cluster which uses AOE devices, but that shouldn't affect communication with the clog server. pv_shuffle output: [lvm_cluster_pvmove] [lvm_cluster_pvmove] === Iteration 1 of 10 started on hayes-01 at Thu Jun 18 15:21:40 CDT 2009 === [lvm_cluster_pvmove] [lvm_cluster_pvmove] FREE PVS: /dev/etherd/e1.1p1 /dev/etherd/e1.1p10 /dev/etherd/e1.1p11 /dev/etherd/e1.1p12 [lvm_cluster_pvmove] [lvm_cluster_pvmove] INUSE PVS: [lvm_cluster_pvmove] [lvm_cluster_pvmove] pv_shuffle_A=/dev/etherd/e1.1p6 /dev/etherd/e1.1p7 /dev/etherd/e1.1p8 /dev/etherd/e1.1p9 [lvm_cluster_pvmove] [lvm_cluster_pvmove] pv_shuffle_B=/dev/etherd/e1.1p14 /dev/etherd/e1.1p2 /dev/etherd/e1.1p3 /dev/etherd/e1.1p4 [lvm_cluster_pvmove] [lvm_cluster_pvmove] NOT INUSE PVS: [lvm_cluster_pvmove] [lvm_cluster_pvmove] pv_shuffle_A=/dev/etherd/e1.1p5 [lvm_cluster_pvmove] [lvm_cluster_pvmove] pv_shuffle_B=/dev/etherd/e1.1p13 [lvm_cluster_pvmove] [lvm_cluster_pvmove] [lvm_cluster_pvmove] [lvm_cluster_pvmove] Adding /dev/etherd/e1.1p10 to pv_shuffle_A [lvm_cluster_pvmove] [lvm_cluster_pvmove] Adding /dev/etherd/e1.1p11 to pv_shuffle_B [lvm_cluster_pvmove] [lvm_cluster_pvmove] [lvm_cluster_pvmove] [lvm_cluster_pvmove] Moving data from /dev/etherd/e1.1p9 to /dev/etherd/e1.1p10 on hayes-01 [lvm_cluster_pvmove] [lvm_cluster_pvmove] Moving data from /dev/etherd/e1.1p4 to /dev/etherd/e1.1p11 on hayes-01 [lvm_cluster_pvmove] [lvm_cluster_pvmove] hayes-01: pvmove -v /dev/etherd/e1.1p9 /dev/etherd/e1.1p10 [lvm_cluster_pvmove] [lvm_cluster_pvmove] hayes-01: pvmove -v /dev/etherd/e1.1p4 /dev/etherd/e1.1p11 [lvm_cluster_pvmove] [lvm_cluster_pvmove] Finding volume group "pv_shuffle_A" [lvm_cluster_pvmove] [lvm_cluster_pvmove] Executing: /sbin/modprobe dm-log-clustered [lvm_cluster_pvmove] [lvm_cluster_pvmove] Archiving volume group "pv_shuffle_A" metadata (seqno 4). [lvm_cluster_pvmove] [lvm_cluster_pvmove] Finding volume group "pv_shuffle_B" [lvm_cluster_pvmove] [lvm_cluster_pvmove] Executing: /sbin/modprobe dm-log-clustered [lvm_cluster_pvmove] [lvm_cluster_pvmove] Archiving volume group "pv_shuffle_B" metadata (seqno 4). [lvm_cluster_pvmove] [lvm_cluster_pvmove] Creating logical volume pvmove0 [lvm_cluster_pvmove] [lvm_cluster_pvmove] Moving 896 extents of logical volume pv_shuffle_A/linear [lvm_cluster_pvmove] [lvm_cluster_pvmove] Creating logical volume pvmove0 [lvm_cluster_pvmove] [lvm_cluster_pvmove] Moving 896 extents of logical volume pv_shuffle_B/linear [lvm_cluster_pvmove] [lvm_cluster_pvmove] Updating volume group metadata [lvm_cluster_pvmove] [lvm_cluster_pvmove] Updating volume group metadata [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-03: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-02: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-01: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Temporary pvmove mirror activation failed. [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-03: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-02: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-01: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Temporary pvmove mirror activation failed. [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-03: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-02: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-01: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Unable to drop cached metadata for VG pv_shuffle_A. [lvm_cluster_pvmove] [lvm_cluster_pvmove] ABORTING: Failed to write new data locations to disk. [lvm_cluster_pvmove] [lvm_cluster_pvmove] ABORTING: Restoring original configuration before pvmove failed. Run pvmove --abort. [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-03: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-02: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-01: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Unable to drop cached metadata for VG pv_shuffle_B. [lvm_cluster_pvmove] [lvm_cluster_pvmove] ABORTING: Failed to write new data locations to disk. [lvm_cluster_pvmove] [lvm_cluster_pvmove] ABORTING: Restoring original configuration before pvmove failed. Run pvmove --abort. [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-03: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-02: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-01: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 5). [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-03: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-02: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Error locking on node hayes-01: Command timed out [lvm_cluster_pvmove] [lvm_cluster_pvmove] Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 5). Jun 18 15:21:16 hayes-01 qarshd[17393]: Running cmdline: pvmove -v /dev/etherd/e1.1p9 /dev/etherd/e1.1p10 Jun 18 15:21:16 hayes-01 xinetd[2835]: START: qarsh pid=17395 from=10.15.80.47 Jun 18 15:21:16 hayes-01 qarshd[17395]: Talking to peer 10.15.80.47:42875 Jun 18 15:21:16 hayes-01 qarshd[17395]: Running cmdline: dmsetup status pv_shuffle_A-pvmove0 2>&1 | grep -q core Jun 18 15:21:16 hayes-01 xinetd[2835]: EXIT: qarsh status=0 pid=17395 duration=0(sec) Jun 18 15:21:16 hayes-01 xinetd[2835]: START: qarsh pid=17399 from=10.15.80.47 Jun 18 15:21:16 hayes-01 qarshd[17399]: Talking to peer 10.15.80.47:42876 Jun 18 15:21:16 hayes-01 qarshd[17399]: Running cmdline: pvmove -v /dev/etherd/e1.1p4 /dev/etherd/e1.1p11 Jun 18 15:21:17 hayes-01 xinetd[2835]: START: qarsh pid=17460 from=10.15.80.47 Jun 18 15:21:17 hayes-01 qarshd[17460]: Talking to peer 10.15.80.47:42877 Jun 18 15:21:17 hayes-01 qarshd[17460]: Running cmdline: dmsetup status pv_shuffle_A-pvmove0 2>&1 | grep -q core Jun 18 15:21:32 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_RESUME/49] - retrying Jun 18 15:21:32 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_GET_SYNC_COUNT/50] - retrying Jun 18 15:21:32 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_GET_SYNC_COUNT/51] - retrying Jun 18 15:21:47 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_RESUME/52] - retrying [...] Jun 18 15:42:33 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_RESUME/301] - retrying Jun 18 15:42:33 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_GET_SYNC_COUNT/302] - retrying Jun 18 15:42:33 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_GET_SYNC_COUNT/303] - retrying Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_RESUME/304] - retrying Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_RESUME] to server: -3 Jun 18 15:42:48 hayes-01 kernel: device-mapper: raid1: log resume failed Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_GET_RESYNC_WORK] to server: -3 Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: Userspace cluster log server not found. Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_CTR] to server: -3 Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: Userspace cluster log server not found Jun 18 15:42:48 hayes-01 kernel: device-mapper: table: 253:5: mirror: Error creating mirror dirty log Jun 18 15:42:48 hayes-01 kernel: device-mapper: ioctl: error adding target to table Jun 18 15:42:48 hayes-01 xinetd[2835]: EXIT: qarsh status=0 pid=17399 duration=1292(sec) Jun 18 15:42:48 hayes-01 xinetd[2835]: EXIT: qarsh status=0 pid=17393 duration=1292(sec) Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_GET_SYNC_COUNT/305] - retrying Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_GET_SYNC_COUNT] to server: -3 Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: Userspace cluster log server not found. Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: [gFnfLkER] Request timed out: [DM_CLOG_GET_SYNC_COUNT/306] - retrying Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_GET_SYNC_COUNT] to server: -3 Jun 18 15:42:48 hayes-01 kernel: device-mapper: dm-log-clustered: Userspace cluster log server not found. Jun 18 15:42:50 hayes-01 kernel: device-mapper: dm-log-clustered: Attempting to contact cluster log server... Jun 18 15:42:50 hayes-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_CTR] to server: -3 Jun 18 15:42:50 hayes-01 kernel: device-mapper: dm-log-clustered: Attempting to contact cluster log server... Jun 18 15:42:50 hayes-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_CTR] to server: -3 Jun 18 15:42:50 hayes-01 kernel: device-mapper: dm-log-clustered: Attempting to contact cluster log server... clogd is still running on all nodes in the cluster: [root@hayes-01 ~]# ps -ef | grep clogd root 5275 1 0 10:11 ? 00:00:00 clogd [root@hayes-02 ~]# ps -ef | grep clogd root 5286 1 0 10:11 ? 00:00:00 clogd [root@hayes-03 ~]# ps -ef | grep clogd root 5336 1 0 10:12 ? 00:00:00 clogd [root@hayes-01 ~]# dmsetup ls pv_shuffle_A-pvmove0 (253, 4) pv_shuffle_A-linear (253, 2) pv_shuffle_B-pvmove0 (253, 5) VolGroup00-LogVol01 (253, 1) VolGroup00-LogVol00 (253, 0) pv_shuffle_B-linear (253, 3) [root@hayes-01 ~]# dmsetup status [HANG] [root@hayes-03 ~]# dmsetup info Name: pv_shuffle_A-pvmove0 State: ACTIVE /dev/mapper/pv_shuffle_A-pvmove0: open failed: No such file or directory Tables present: LIVE Open count: 0 Event number: 0 Major, minor: 253, 4 Number of targets: 1 UUID: LVM-K55oFeoo5Edjvy3BTskg5LC8XSxkhpeNw1df0vzuhJ3jalSNWQZb502BgFnfLkER Name: pv_shuffle_A-linear State: SUSPENDED Read Ahead: 256 Tables present: LIVE Open count: 0 Event number: 0 Major, minor: 253, 2 Number of targets: 1 UUID: LVM-K55oFeoo5Edjvy3BTskg5LC8XSxkhpeNUhfUM7yV8V9QNEZWbkQYd0x2lSHzlcQn Name: pv_shuffle_B-pvmove0 State: ACTIVE /dev/mapper/pv_shuffle_B-pvmove0: open failed: No such file or directory Tables present: None Open count: 0 Event number: 0 Major, minor: 253, 5 Number of targets: 0 UUID: LVM-3BcsvJ7TmsqCqrpn6gDjvSGxgmLTA95VvEt8Zq6Y2B1t6PSCReD6bWVy8QoLGbWK Name: VolGroup00-LogVol01 State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 1 Number of targets: 1 UUID: LVM-daGgA9bZOYmMtbX3yGBmK0mVSb2pdnLo6TdzQIlAXLtrUnMriWHC8M8JGZaBsJak Name: VolGroup00-LogVol00 State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 UUID: LVM-daGgA9bZOYmMtbX3yGBmK0mVSb2pdnLoM2U5LYl3xqXXNYTL2VKyTgHTuv23RgMt Name: pv_shuffle_B-linear State: SUSPENDED Read Ahead: 256 Tables present: LIVE Open count: 0 Event number: 0 Major, minor: 253, 3 Number of targets: 1 UUID: LVM-3BcsvJ7TmsqCqrpn6gDjvSGxgmLTA95VxPP5scFn23HjHe48yKILXA4onUVdwj8J I'll attach the logs and dumps of what's currently running Version-Release number of selected component (if applicable): [root@hayes-03 ~]# /usr/tests/sts-rhel5.4/lvm2/bin/lvm_rpms 2.6.18-150.el5 lvm2-2.02.46-8.el5 BUILT: Thu Jun 18 08:06:12 CDT 2009 lvm2-cluster-2.02.46-8.el5 BUILT: Thu Jun 18 08:05:27 CDT 2009 device-mapper-1.02.32-1.el5 BUILT: Thu May 21 02:18:23 CDT 2009 cmirror-1.1.37-1.el5 BUILT: Tue May 5 11:46:05 CDT 2009 kmod-cmirror-0.1.21-14.el5 BUILT: Thu May 21 08:28:17 CDT 2009
Created attachment 348558 [details] log from hayes-01
Created attachment 348559 [details] log from hayes-02
Created attachment 348560 [details] log from hayes-03
This may be a bigger deal than just pvmoves. I'm seeing this during other operations as well like lvreduce. --- ALTERATION ITERATION 2 --------------------------- VOLUME REDUCTION from 33072 to 22789 on taft-04 WARNING: Reducing active logical volume to 89.02 GB THIS MAY DESTROY YOUR DATA (filesystem etc.) [DEADLOCK] Name: mirror_1_4509-mirror_1_45090 State: SUSPENDED Read Ahead: 256 Tables present: LIVE Open count: 0 Event number: 1 Major, minor: 253, 75 Number of targets: 1 UUID: LVM-mLiEzEQbpF3luJ6Z2sZz6PA53YR9sAlvHh7pPeGHEezzgW2sNMUP6aq0uFJf0t4H Appears to be a regression as I've now seen this quite a few times.
Are you still seeing this with the latest build? The log server looks to be hung. Have you tried getting infomation from 'killall -USR1 clogd'? an strace? attaching gdb? We need to find out where/why it is hung. I could check out your machines if you can repo the problem too.
Easily reproduced, attaching logs containing debug info...
Created attachment 354550 [details] log from grant-01
Created attachment 354552 [details] log from grant-02
Created attachment 354553 [details] log from grant-03
commit fa15cb16f0de9ddf06ddb50f2fcf26dd9c257931 Author: Jonathan Brassow <jbrassow> Date: Mon Jul 27 13:07:36 2009 -0500 clogd/dm-log-clustered.ko: Fix for bugs 506843, 479749, 507400 Device-mapper userspace logs (like the clustered log) are identified by a universally unique identifier (UUID). This identifier is used to associate requests from the kernel to a specific log in userspace. The UUID must be unique everywhere, since multiple machines may use this identifier when sharing a log, as is the case for cluster logs. Sometimes, device-mapper/LVM may re-use a UUID. This is the case during 'pvmove's, when moving from one segment of an LV to another, or when resizing a mirror, etc. In these cases, a new log is created with the same UUID and loaded in the "inactive" slot. When a device-mapper "resume" is issued, the "live" talbe is deactivated and the new "inactive" table becomes "live". (The "inactive" table can also be removed via a device-mapper 'clear' command.) The above two issues where colliding in the 3 bugs (possibly more) mentioned in the title. More than one log was being created with the same UUID, and there was no way to distinguish between them. So, sometimes the wrong log would be swapped out during the exchange. The solution is to create a 'uuid_instance', or perhaps a 'luid'. A local unique identifier to go along with the UUID. This new identifier is used to determine exactly which log is being referenced by the kernel when the log exchange is made. The identifier is not universally safe, but it does not need to be, since create/destroy/suspend/resume operations are bound to a specific machine; and these are the operations that make up the exchange.
Marking fix verified in cmirror-1.1.39-2.el5. Ran pv_shuffle and lvm_config all night without seeing this issue, however we did end up hitting bug 506986.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1340.html