Created attachment 376062 [details] screenshot of machine console with stack trace Description of problem: In a cluster using clvmd/cmirror, if cmirror is started after clvmd none of the startups fails, but a following lvcreate of a _mirrored_ volume will panic all cluster nodes. Version-Release number of selected component (if applicable): lvm2-2.02.46-8.el5 kernel-2.6.18-164.el5 rgmanager-2.0.52-1.el5 cmirror-1.1.39-2.el5 lvm2-cluster-2.02.46-8.el5_4.1 cman-2.0.115-1.el5 kmod-cmirror-0.1.22-1.el5 How reproducible: Always Steps to Reproduce: 1. Install 2 virtual machines with RHEL 5.4, over a Fedora 12 host, using kvm/libvirt and iscsi for shared storage. 2. Configure a simple cluster (with no resources) 3. Install lvm2-cluster and cmirror 4. Start services by hand in this order: iscsid iscsi multipathd cman rgmanager clvmd cmirror (last two should be started in the reverse order, i.e. cmirror _before_ clvmd) clvmd throws a warning during start, nonetheless it starts well (apparently): # service clvmd start Starting clvmd: [ OK ] Activating VGs: Error locking on node testcnode1: device-mapper: reload ioctl failed: Invalid argument 1 logical volume(s) in volume group "testvg" now active 2 logical volume(s) in volume group "vg_root" now active [ OK ] [root@testhost1 ~]# service cmirror start Loading clustered mirror log module: [ OK ] Starting clustered mirror log server: [ OK ] 5. Create a test volume group: # vgcreate testvg /dev/mpath/disk00p1 /dev/mpath/disk01p1 /dev/mpath/disk02p1 /dev/mpath/disk03p1 /dev/mpath/disk04p1 6. Try to create a mirrored volume # lvcreate -n lvtest02 -L 1g -m 1 testvg Actual results: Both cluster nodes panic. Expected results: Successful mirrored clustered volume creation clvmd should probably start cmirror or require its start prior to its own start Additional info: I can provide a crash dump taken using kdump; I'll attach a screenshot of the console with the stack trace.
There's something more to this than cmirror being started after clvmd. That alone wont cause any problems. [root@taft-01 ~]# service clvmd start Starting clvmd: [ OK ] Activating VGs: 2 logical volume(s) in volume group "VolGroup00" now active [ OK ] [root@taft-01 ~]# service cmirror start Loading clustered mirror log module: [ OK ] Starting clustered mirror log server: [ OK ] [root@taft-01 ~]# vgcreate vg /dev/sd[bcdefgh]1 Clustered volume group "vg" successfully created [root@taft-01 ~]# lvcreate -n mirror -L 100M -m 1 vg Logical volume "mirror" created I see locking errors during your clvmd start up. "Error locking on node testcnode1: device-mapper: reload ioctl failed: Invalid argument" Did all the nodes in your cluster have clvmd/cmirror started? What is your locking_type set to?
Hi, I'm using locking_type = 3 Cluster is a 2-node cluster. I usually run 'service clvmd start' on both nodes at the same time, after the cluster is up (and cmirror is up). From the tests I did today, I see that "Error locking on node testcnode1: device-mapper: reload ioctl failed: Invalid argument" happens only if: a) I have at least one already existent mirrored volume in my VG AND b) I start clvmd when cmirror is NOT up (clvmd before cmirror). But I got a panic even starting clvmd by now, so I'm trying some more tests to see if I find any clue.
I have now been able to reproduce waht appears to be the bug. I had left over cmirrors fragments (from a device failure test that had hung), and I rebooted all 4 of my taft machines. When they came back up I started clvmd w/o starting cmirror. That caused expected mirror locking errors. I then stopped clvmd, started cmirror, and then attempted to restart clvmd at which point I hit what appears to be this panic on 3/4 machines. [root@taft-01 ~]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] [root@taft-01 ~]# service clvmd start Starting clvmd: [ OK ] Activating VGs: Error locking on node taft-01-bond: device-mapper: reload ioctl failed: Invalid argument 1 logical volume(s) in volume group "helter_skelter" now active 2 logical volume(s) in volume group "VolGroup00" now active [FAILED] [root@taft-01 ~]# service clvmd stop Deactivating VG helter_skelter: 0 logical volume(s) in volume group "helter_skelter" now active [ OK ] Stopping clvm: [ OK ] [root@taft-01 ~]# service cmirror start Loading clustered mirror log module: [ OK ] Starting clustered mirror log server: [ OK ] [root@taft-01 ~]# service clvmd start Starting clvmd: [ OK ] Activating VGs: [PANIC] Dec 15 10:28:11 taft-01 clvmd: Cluster LVM daemon started - connected to CMAN Dec 15 10:28:20 taft-01 kernel: device-mapper: dm-log-clustered: Unable to send cluster log request [DM_CLOG_CTR]3 Dec 15 10:28:20 taft-01 kernel: device-mapper: dm-log-clustered: Userspace cluster log server not found Dec 15 10:28:20 taft-01 kernel: device-mapper: table: 253:4: mirror: Error creating mirror dirty log Dec 15 10:28:20 taft-01 kernel: device-mapper: ioctl: error adding target to table Dec 15 10:28:35 taft-01 dmeventd[7203]: dmeventd ready for processing. Dec 15 10:28:41 taft-01 clogd[7267]: Starting clogd: Dec 15 10:28:41 taft-01 clogd[7267]: Built: Jul 27 2009 16:39:04 dlm: Using TCP for communications dlm: connecting to 3 dlm: got connection from 3 dlm: got connection from 4 dlm: connecting to 2 Dec 15 10:28:47 taft-01 kernel: dlm: Using TCP for communications Dec 15 10:28:47 taft-01 kernel: dlm: connecting to 3 Dec 15 10:28:47 taft-01 kernel: dlm: got connection from 3 Dec 15 10:28:47 taft-01 kernel: dlm: got connection from 4 Dec 15 10:28:47 taft-01 kernel: dlm: connecting to 2 general protection fault: 0000 [1] SMP last sysfs file: /block/ram0/dev CPU 0 Modules linked in: lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap bluetooth dm_log_clustered(U) lockd sunrpd Pid: 7298, comm: clvmd Tainted: G 2.6.18-160.el5 #1 RIP: 0010:[<ffffffff88551209>] [<ffffffff88551209>] :dm_log_clustered:cluster_ctr+0x150/0x293 RSP: 0018:ffff81020c24bb78 EFLAGS: 00010286 RAX: 00000000ffffffb4 RBX: 00010102464c457f RCX: 0000000000000000 RDX: 0000000000000081 RSI: ffff81021c8a7aa8 RDI: ffff8102144e4ce8 RBP: ffff81021c8a7a80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000003000333a33 R12: 000000000012c3ff R13: 0000000000000000 R14: 0000000000000000 R15: ffff81020b8eda10 FS: 00000000417eb940(0063) GS:ffffffff803c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000415030 CR3: 000000021d9c0000 CR4: 00000000000006e0 Process clvmd (pid: 7298, threadinfo ffff81020c24a000, task ffff8101ffd93080) Stack: 000000007fffffff 000000030001dc10 ffffc2000007e080 ffff8102142babe0 ffff81021c8a7aa8 0000000000000000 0000040000000163 ffffffff885545a0 000000000000000a ffffc200100f4160 ffff8102142babe0 ffff8102142babe0 Call Trace: [<ffffffff881019e9>] :dm_log:dm_dirty_log_create+0x109/0x14d [<ffffffff8810dd85>] :dm_mirror:mirror_ctr+0x6f/0x578 [<ffffffff880e9443>] :dm_mod:dm_table_add_target+0x152/0x29f [<ffffffff880eb3e0>] :dm_mod:table_load+0xfb/0x1c9 [<ffffffff880eb2e5>] :dm_mod:table_load+0x0/0x1c9 [<ffffffff880ebe8e>] :dm_mod:ctl_ioctl+0x210/0x25b [<ffffffff800420db>] do_ioctl+0x55/0x6b [<ffffffff800302a0>] vfs_ioctl+0x457/0x4b9 [<ffffffff800b6193>] audit_syscall_entry+0x180/0x1b3 [<ffffffff8004c741>] sys_ioctl+0x59/0x78 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 8b 03 0f 18 08 48 81 fb 10 c9 55 88 75 c1 48 89 de 48 89
I was able to reproduce this yet again with the current state of my volumes.
commit 6819f2d19bf02448b3b099c90450031d60192d06 Author: Jonathan Brassow <jbrassow> Date: Tue Dec 22 03:04:18 2009 -0600 dm-log-clustered: Remove an entry from a list before freeing memory (bz5442 Error handling logic failed to remove an item from a list before freeing it. Later, when the list is accessed, a panic occurs.
This appears to still be a problem as customer is seeing issue with 5.5 Beta..
I can see a possibility for a failure like this, but to know if the customer is seeing what I expect, I need their syslog messages (/var/log/messages). Specifically, I'm looking for: "Unable to allocate memory for constructor string" The 'messages' file should be useful in any event...
Ok, this seems to stand out :) Feb 22 08:48:00 b99ar00e clogd[15209]: Kernel/Userspace version mismatch! It seems that 'cmirror' is up-to-date (or they wouldn't get this message), but their kernel module must not be... Thankfully, the cmirror kernel module prints the date that it was built. If the module being loaded is that same as the latest package installed (from comment #11: '* Tue Dec 22 2009 Chris Feist <cfeist> - 0.1.22-3.el5'), then they should have the same build date. However, we see: Feb 22 15:17:51 b99ar00e kernel: device-mapper: dm-log-clustered: (built Jul 27 2009 16:28:42) installed Ironically, since the userspace package and the kernel package are not in step, it actually /forces/ this bug to appear in the old, unfixed, kernel module! I think the argument can be made that when the userspace log daemon detects the version mismatch, it should complain as it is, but _not_ send back a response to the kernel that it will not accept its message. Userspace should happily play along with the older module, but inform the user that they better upgrade their kernel module. Thoughs?
clogd: Fix rhbz#544253: lvcreate mirrored volume panics all cluster ... Two bug fixes for 5.5 where colliding to cause this issue. In one previous bug fix, the version number of the communication structure used between kernel and userspace was checked for a mismatch. If a mismatch was found, an error was sent to the kernel to abandon the creation of the mirror. The other bug fix (the original fix for 544253) addressed a list member that wasn't being removed from the list before the memory was freed on an error condition. If the user had booted an older kernel (one that didn't contain the original fix for 544253) /and/ had the updated userspace package that would reject communication if there was a version mismatch, the panic that 544253 was meant to fix would actually be forced - every time! This update causes the version checking to complain about a mismatch but tolerate it. This way, we don't panic the machine if a user is forced to use an older kernel without the original fix for 544253.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0309.html
Just a late note that this fix was tested as apart of the 5.5 cmirror/lvm2-cluster regression testing.