Bug 506986 - clvmd segfaults during pvs option checks
Summary: clvmd segfaults during pvs option checks
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: lvm2-cluster
Version: 5.4
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: LVM and device-mapper development team
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-19 16:48 UTC by Corey Marthaler
Modified: 2018-11-29 19:40 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 09:02:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2010:0299 0 normal SHIPPED_LIVE lvm2-cluster bug fix and enhancement update 2010-03-29 14:26:30 UTC

Description Corey Marthaler 2009-06-19 16:48:54 UTC
Description of problem:

I didn't see a core file for this, thought there was one from earlier testing that day. I'll attempt to reproduce this and grab one.

Jun 18 15:43:30 grant-02 qarshd[30501]: Running cmdline: pvs --noheadings -o pv_used -O pvseg_size
Jun 18 15:43:30 grant-02 kernel: clvmd[18383]: segfault at 0000000000000010 rip 0000000000421bf1 rsp 00000000425b2af0 error 4
Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30489 duration=1(sec)
Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30492 duration=1(sec)
Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30498 duration=0(sec)
Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30501 duration=0(sec)
Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30495 duration=1(sec)
Jun 18 15:44:31 grant-02 xinetd[9565]: START: qarsh pid=30517 from=10.15.80.47
Jun 18 15:44:31 grant-02 qarshd[30517]: Talking to peer 10.15.80.47:51737
Jun 18 15:44:31 grant-02 qarshd[30517]: Running cmdline: pvs --noheadings -o pv_name -O pv_fmt

Version-Release number of selected component (if applicable):
2.6.18-150.el5

lvm2-2.02.46-8.el5    BUILT: Thu Jun 18 08:06:12 CDT 2009
lvm2-cluster-2.02.46-8.el5    BUILT: Thu Jun 18 08:05:27 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009
cmirror-1.1.37-1.el5    BUILT: Tue May  5 11:46:05 CDT 2009
kmod-cmirror-0.1.21-14.el5    BUILT: Thu May 21 08:28:17 CDT 2009

Comment 1 Tom Coughlan 2009-07-21 17:44:21 UTC
Corey,

Can we infer that this is not reproducible ;) 

If it is not <easily> reproducible then it may be best to defer it to 5.5, considering where we are in the 5.4 schedule. 

Tom

Comment 2 Corey Marthaler 2009-07-28 16:04:27 UTC
I haven't reproduced this lately. Let's move it off to 5.5

Comment 3 Corey Marthaler 2009-07-31 21:37:19 UTC
Just reproduced it today. Still no core was created even though I had everything set up to collect core files.

Jul 31 16:14:46 grant-01 qarshd[25333]: Running cmdline: pvs --noheadings -o pv_tags -O pv_name
Jul 31 16:14:46 grant-01 kernel: clvmd[8947]: segfault at 0000000000000010 rip 0000000000421bf1 rsp 0000000041655af0 error 4


2.6.18-159.el5                                                                                                                                                                                                                                                                          lvm2-2.02.46-8.el5    BUILT: Thu Jun 18 08:06:12 CDT 2009
lvm2-cluster-2.02.46-8.el5    BUILT: Thu Jun 18 08:05:27 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009
cmirror-1.1.39-2.el5    BUILT: Mon Jul 27 15:39:05 CDT 2009
kmod-cmirror-0.1.21-14.el5    BUILT: Thu May 21 08:28:17 CDT 2009

Comment 4 Corey Marthaler 2009-08-05 14:29:16 UTC
Finally reproduce this and got a core to attach.

Here's the 'clvmd -d' output on the node (grant-03) that crashed.

CLVMD[4296a940]: Aug  4 19:21:20 CLVMD[2483c820]: Aug  4 19:21:20 read on PIPE 10: 4 bytes: status: 0
LVM thread waiting for work
CLVMD[2483c820]: Aug  4 19:21:20 background routine status was 0, sock_client=0x114f6090
CLVMD[4296a940]: Aug  4 19:21:20 process_work_item: local
CLVMD[4296a940]: Aug  4 19:21:20 CLVMD[2483c820]: Aug  4 19:21:20 Send local reply
process_local_command: LOCK_VG (0x33) msg=0x112cba70, msglen =33, client=0x114424f0
CLVMD[4296a940]: Aug  4 19:21:20 Dropping metadata for VG VolGroup00
CLVMD[4296a940]: Aug  4 19:21:20 Reply from node grant-03: 0 bytes
CLVMD[4296a940]: Aug  4 19:21:20 Got 1 replies, expecting: 1
CLVMD[4296a940]: Aug  4 19:21:20 process_work_item: local
CLVMD[4296a940]: Aug  4 19:21:20 process_local_command: LOCK_VG (0x33) msg=0x112cceb0, msglen =30, client=0x11441fd0
CLVMD[4516e940]: Aug  4 19:21:20 Got post command condition...
CLVMD[4296a940]: Aug  4 19:21:20 Refreshing context
CLVMD[2483c820]: Aug  4 19:21:20 read on PIPE 24: 4 bytes: status: 0
CLVMD[2483c820]: Aug  4 19:21:20 background routine status was 0, sock_client=0x114424f0
CLVMD[2483c820]: Aug  4 19:21:20 Send local reply
CLVMD[4516e940]: Aug  4 19:21:20 Waiting for next pre command
CLVMD[2483c820]: Aug  4 19:21:20 Read on local socket 9, len = 0
CLVMD[2483c820]: Aug  4 19:21:20 EOF on local socket: inprogress=0
CLVMD[2483c820]: Aug  4 19:21:20 Waiting for child thread
CLVMD[4476d940]: Aug  4 19:21:20 Got pre command condition...
CLVMD[4476d940]: Aug  4 19:21:20 Subthread finished
CLVMD[2483c820]: Aug  4 19:21:20 Joined child thread
CLVMD[2483c820]: Aug  4 19:21:20 ret == 0, errno = 9. removing client
CLVMD[2483c820]: Aug  4 19:21:20 add_to_lvmqueue: cmd=0x112ce6e0. client=0x114f6090, msg=(nil), len=0, csid=(nil), xid=10574
  /etc/lvm/lvm.conf: mmap failed: Bad file descriptor
  /etc/lvm/lvm.conf: munmap failed: Invalid argument
  /etc/lvm/lvm.conf: close failed: Bad file descriptor
  Failed to load config file /etc/lvm/lvm.conf
CLVMD[2483c820]: Aug  4 19:21:20 Read on local socket 23, len = 30
CLVMD[2483c820]: Aug  4 19:21:20 check_all_clvmds_running
CLVMD[4516e940]: Aug  4 19:21:20 Got pre command condition...
CLVMD[4516e940]: Aug  4 19:21:20 doing PRE command LOCK_VG 'P_#global' at 6 (client=0x114424f0)
CLVMD[4516e940]: Aug  4 19:21:20 sync_unlock: 'P_#global' lkid:1040029
Segmentation fault (core dumped)

Comment 7 Christine Caulfield 2009-10-21 13:57:26 UTC
I think this need the attentions of the lvm-team ...

(gdb) bt
#0  0x0000000000421bf1 in persistent_filter_wipe ()
#1  0x000000000041d83a in dev_iter_create ()
#2  0x0000000000416b85 in lvmcache_label_scan ()
#3  0x0000000000410ad0 in do_refresh_cache ()
#4  0x000000000040d3ea in do_command ()
#5  0x000000000040e973 in ?? ()
#6  0x0000003992a064a7 in start_thread () from /lib64/libpthread.so.0
#7  0x00000039922d3c2d in clone () from /lib64/libc.so.6

Comment 8 Alasdair Kergon 2009-10-22 17:41:58 UTC
Quote from Jon Brassow on lvm-devel:


clvmd: Fix segfault.

I don't know /why/ it is failing... and I don't even know
exactly where the problem gets started, but here is what
I've found so far...

>From the core dump:
Program terminated with signal 11, Segmentation fault.
#0  persistent_filter_wipe (f=0x0) at filters/filter-persistent.c:54
54              struct pfilter *pf = (struct pfilter *) f->private;
(gdb) bt
#0  persistent_filter_wipe (f=0x0) at filters/filter-persistent.c:54
#1  0x000000000041d83a in dev_iter_create (f=0x0, dev_scan=1) at device/dev-cache.c:740
#2  0x0000000000416b85 in lvmcache_label_scan (cmd=0x112aacd0, full_scan=2) at cache/lvmcache.c:463
#3  0x0000000000410ad0 in do_refresh_cache () at lvm-functions.c:606
#4  0x000000000040d3ea in do_command (client=0x11441fd0, msg=<value optimized out>, msglen=30,
    buf=0x4296a0c0, buflen=1481, retlen=0x4296a0cc) at clvmd-command.c:127
#5  0x000000000040e973 in lvm_thread_fn (arg=<value optimized out>) at clvmd.c:1554
#6  0x0000003992a064a7 in start_thread () from /lib64/libpthread.so.0
#7  0x00000039922d3c2d in clone () from /lib64/libc.so.6

'f' is being dereferenced in persistent_filter_wipe - initially provided by
lvmcache_label_scan, here:

463             if (!(iter = dev_iter_create(cmd->filter, (full_scan == 2) ? 1 : 0))) {
464                     log_error("dev_iter creation failed");
465                     goto out;
466             }

'cmd->filter' isn't set because 'refresh_toolcontext' failed in
do_refresh_cache:

604             ret = refresh_toolcontext(cmd);
605             init_full_scan_done(0);
606             lvmcache_label_scan(cmd, 2);
(gdb) p ret
$2 = 0

'refresh_toolcontext' failed at 
_init_lvm_conf->_load_config_file->read_config_file->read_config_fd'

client=0x114f6090, msg=(nil), len=0, csid=(nil), xid=10574
  /etc/lvm/lvm.conf: mmap failed: Bad file descriptor
  /etc/lvm/lvm.conf: munmap failed: Invalid argument
  /etc/lvm/lvm.conf: close failed: Bad file descriptor
  Failed to load config file /etc/lvm/lvm.conf

Is the file descriptor being lost, corrupted, etc?

This patch is a bandaid for the problem.

Index: LVM2/daemons/clvmd/lvm-functions.c
===================================================================
--- LVM2.orig/daemons/clvmd/lvm-functions.c
+++ LVM2/daemons/clvmd/lvm-functions.c
@@ -593,20 +593,23 @@ int do_check_lvm1(const char *vgname)
 
 int do_refresh_cache()
 {
-	int ret;
 	DEBUGLOG("Refreshing context\n");
 	log_notice("Refreshing context");
 
 	pthread_mutex_lock(&lvm_lock);
 
-	ret = refresh_toolcontext(cmd);
+	if (!refresh_toolcontext(cmd)) {
+		pthread_mutex_unlock(&lvm_lock);
+		return -1;
+	}
+
 	init_full_scan_done(0);
 	lvmcache_label_scan(cmd, 2);
 	dm_pool_empty(cmd->mem);
 
 	pthread_mutex_unlock(&lvm_lock);
 
-	return ret==1?0:-1;
+	return 0;
 }

Comment 9 Corey Marthaler 2009-11-03 22:02:23 UTC
Is the fix in comment #8 possible in the 5.5 time frame? Hit this again today.

Comment 10 Corey Marthaler 2009-11-04 17:31:46 UTC
Grabbed another core, it's located here:
/home/msp/cmarthal/core.7046

The running versions of lvm2/lvm2-cluster were:
lvm2-2.02.46-10.el5    BUILT: Fri Sep 18 09:38:06 CDT 2009
lvm2-cluster-2.02.46-10.el5    BUILT: Fri Sep 18 09:39:48 CDT 2009

Comment 11 Milan Broz 2009-11-24 17:44:52 UTC
The patch from comment #8 should by in 2.02.54 build but not sure if it is final fix. Is it reproducible with recent code?

Comment 14 Milan Broz 2010-01-14 11:32:34 UTC
This should be included in recent 5.5 build, marking modified.

Comment 16 Corey Marthaler 2010-02-01 23:35:41 UTC
Fix verified in lvm2-2.02.56-6.el5/lvm2-cluster-2.02.56-6.el5.

Comment 18 errata-xmlrpc 2010-03-30 09:02:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0299.html


Note You need to log in before you can comment on or make changes to this bug.