Description of problem: I didn't see a core file for this, thought there was one from earlier testing that day. I'll attempt to reproduce this and grab one. Jun 18 15:43:30 grant-02 qarshd[30501]: Running cmdline: pvs --noheadings -o pv_used -O pvseg_size Jun 18 15:43:30 grant-02 kernel: clvmd[18383]: segfault at 0000000000000010 rip 0000000000421bf1 rsp 00000000425b2af0 error 4 Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30489 duration=1(sec) Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30492 duration=1(sec) Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30498 duration=0(sec) Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30501 duration=0(sec) Jun 18 15:43:30 grant-02 xinetd[9565]: EXIT: qarsh status=0 pid=30495 duration=1(sec) Jun 18 15:44:31 grant-02 xinetd[9565]: START: qarsh pid=30517 from=10.15.80.47 Jun 18 15:44:31 grant-02 qarshd[30517]: Talking to peer 10.15.80.47:51737 Jun 18 15:44:31 grant-02 qarshd[30517]: Running cmdline: pvs --noheadings -o pv_name -O pv_fmt Version-Release number of selected component (if applicable): 2.6.18-150.el5 lvm2-2.02.46-8.el5 BUILT: Thu Jun 18 08:06:12 CDT 2009 lvm2-cluster-2.02.46-8.el5 BUILT: Thu Jun 18 08:05:27 CDT 2009 device-mapper-1.02.32-1.el5 BUILT: Thu May 21 02:18:23 CDT 2009 cmirror-1.1.37-1.el5 BUILT: Tue May 5 11:46:05 CDT 2009 kmod-cmirror-0.1.21-14.el5 BUILT: Thu May 21 08:28:17 CDT 2009
Corey, Can we infer that this is not reproducible ;) If it is not <easily> reproducible then it may be best to defer it to 5.5, considering where we are in the 5.4 schedule. Tom
I haven't reproduced this lately. Let's move it off to 5.5
Just reproduced it today. Still no core was created even though I had everything set up to collect core files. Jul 31 16:14:46 grant-01 qarshd[25333]: Running cmdline: pvs --noheadings -o pv_tags -O pv_name Jul 31 16:14:46 grant-01 kernel: clvmd[8947]: segfault at 0000000000000010 rip 0000000000421bf1 rsp 0000000041655af0 error 4 2.6.18-159.el5 lvm2-2.02.46-8.el5 BUILT: Thu Jun 18 08:06:12 CDT 2009 lvm2-cluster-2.02.46-8.el5 BUILT: Thu Jun 18 08:05:27 CDT 2009 device-mapper-1.02.32-1.el5 BUILT: Thu May 21 02:18:23 CDT 2009 cmirror-1.1.39-2.el5 BUILT: Mon Jul 27 15:39:05 CDT 2009 kmod-cmirror-0.1.21-14.el5 BUILT: Thu May 21 08:28:17 CDT 2009
Finally reproduce this and got a core to attach. Here's the 'clvmd -d' output on the node (grant-03) that crashed. CLVMD[4296a940]: Aug 4 19:21:20 CLVMD[2483c820]: Aug 4 19:21:20 read on PIPE 10: 4 bytes: status: 0 LVM thread waiting for work CLVMD[2483c820]: Aug 4 19:21:20 background routine status was 0, sock_client=0x114f6090 CLVMD[4296a940]: Aug 4 19:21:20 process_work_item: local CLVMD[4296a940]: Aug 4 19:21:20 CLVMD[2483c820]: Aug 4 19:21:20 Send local reply process_local_command: LOCK_VG (0x33) msg=0x112cba70, msglen =33, client=0x114424f0 CLVMD[4296a940]: Aug 4 19:21:20 Dropping metadata for VG VolGroup00 CLVMD[4296a940]: Aug 4 19:21:20 Reply from node grant-03: 0 bytes CLVMD[4296a940]: Aug 4 19:21:20 Got 1 replies, expecting: 1 CLVMD[4296a940]: Aug 4 19:21:20 process_work_item: local CLVMD[4296a940]: Aug 4 19:21:20 process_local_command: LOCK_VG (0x33) msg=0x112cceb0, msglen =30, client=0x11441fd0 CLVMD[4516e940]: Aug 4 19:21:20 Got post command condition... CLVMD[4296a940]: Aug 4 19:21:20 Refreshing context CLVMD[2483c820]: Aug 4 19:21:20 read on PIPE 24: 4 bytes: status: 0 CLVMD[2483c820]: Aug 4 19:21:20 background routine status was 0, sock_client=0x114424f0 CLVMD[2483c820]: Aug 4 19:21:20 Send local reply CLVMD[4516e940]: Aug 4 19:21:20 Waiting for next pre command CLVMD[2483c820]: Aug 4 19:21:20 Read on local socket 9, len = 0 CLVMD[2483c820]: Aug 4 19:21:20 EOF on local socket: inprogress=0 CLVMD[2483c820]: Aug 4 19:21:20 Waiting for child thread CLVMD[4476d940]: Aug 4 19:21:20 Got pre command condition... CLVMD[4476d940]: Aug 4 19:21:20 Subthread finished CLVMD[2483c820]: Aug 4 19:21:20 Joined child thread CLVMD[2483c820]: Aug 4 19:21:20 ret == 0, errno = 9. removing client CLVMD[2483c820]: Aug 4 19:21:20 add_to_lvmqueue: cmd=0x112ce6e0. client=0x114f6090, msg=(nil), len=0, csid=(nil), xid=10574 /etc/lvm/lvm.conf: mmap failed: Bad file descriptor /etc/lvm/lvm.conf: munmap failed: Invalid argument /etc/lvm/lvm.conf: close failed: Bad file descriptor Failed to load config file /etc/lvm/lvm.conf CLVMD[2483c820]: Aug 4 19:21:20 Read on local socket 23, len = 30 CLVMD[2483c820]: Aug 4 19:21:20 check_all_clvmds_running CLVMD[4516e940]: Aug 4 19:21:20 Got pre command condition... CLVMD[4516e940]: Aug 4 19:21:20 doing PRE command LOCK_VG 'P_#global' at 6 (client=0x114424f0) CLVMD[4516e940]: Aug 4 19:21:20 sync_unlock: 'P_#global' lkid:1040029 Segmentation fault (core dumped)
I think this need the attentions of the lvm-team ... (gdb) bt #0 0x0000000000421bf1 in persistent_filter_wipe () #1 0x000000000041d83a in dev_iter_create () #2 0x0000000000416b85 in lvmcache_label_scan () #3 0x0000000000410ad0 in do_refresh_cache () #4 0x000000000040d3ea in do_command () #5 0x000000000040e973 in ?? () #6 0x0000003992a064a7 in start_thread () from /lib64/libpthread.so.0 #7 0x00000039922d3c2d in clone () from /lib64/libc.so.6
Quote from Jon Brassow on lvm-devel: clvmd: Fix segfault. I don't know /why/ it is failing... and I don't even know exactly where the problem gets started, but here is what I've found so far... >From the core dump: Program terminated with signal 11, Segmentation fault. #0 persistent_filter_wipe (f=0x0) at filters/filter-persistent.c:54 54 struct pfilter *pf = (struct pfilter *) f->private; (gdb) bt #0 persistent_filter_wipe (f=0x0) at filters/filter-persistent.c:54 #1 0x000000000041d83a in dev_iter_create (f=0x0, dev_scan=1) at device/dev-cache.c:740 #2 0x0000000000416b85 in lvmcache_label_scan (cmd=0x112aacd0, full_scan=2) at cache/lvmcache.c:463 #3 0x0000000000410ad0 in do_refresh_cache () at lvm-functions.c:606 #4 0x000000000040d3ea in do_command (client=0x11441fd0, msg=<value optimized out>, msglen=30, buf=0x4296a0c0, buflen=1481, retlen=0x4296a0cc) at clvmd-command.c:127 #5 0x000000000040e973 in lvm_thread_fn (arg=<value optimized out>) at clvmd.c:1554 #6 0x0000003992a064a7 in start_thread () from /lib64/libpthread.so.0 #7 0x00000039922d3c2d in clone () from /lib64/libc.so.6 'f' is being dereferenced in persistent_filter_wipe - initially provided by lvmcache_label_scan, here: 463 if (!(iter = dev_iter_create(cmd->filter, (full_scan == 2) ? 1 : 0))) { 464 log_error("dev_iter creation failed"); 465 goto out; 466 } 'cmd->filter' isn't set because 'refresh_toolcontext' failed in do_refresh_cache: 604 ret = refresh_toolcontext(cmd); 605 init_full_scan_done(0); 606 lvmcache_label_scan(cmd, 2); (gdb) p ret $2 = 0 'refresh_toolcontext' failed at _init_lvm_conf->_load_config_file->read_config_file->read_config_fd' client=0x114f6090, msg=(nil), len=0, csid=(nil), xid=10574 /etc/lvm/lvm.conf: mmap failed: Bad file descriptor /etc/lvm/lvm.conf: munmap failed: Invalid argument /etc/lvm/lvm.conf: close failed: Bad file descriptor Failed to load config file /etc/lvm/lvm.conf Is the file descriptor being lost, corrupted, etc? This patch is a bandaid for the problem. Index: LVM2/daemons/clvmd/lvm-functions.c =================================================================== --- LVM2.orig/daemons/clvmd/lvm-functions.c +++ LVM2/daemons/clvmd/lvm-functions.c @@ -593,20 +593,23 @@ int do_check_lvm1(const char *vgname) int do_refresh_cache() { - int ret; DEBUGLOG("Refreshing context\n"); log_notice("Refreshing context"); pthread_mutex_lock(&lvm_lock); - ret = refresh_toolcontext(cmd); + if (!refresh_toolcontext(cmd)) { + pthread_mutex_unlock(&lvm_lock); + return -1; + } + init_full_scan_done(0); lvmcache_label_scan(cmd, 2); dm_pool_empty(cmd->mem); pthread_mutex_unlock(&lvm_lock); - return ret==1?0:-1; + return 0; }
Is the fix in comment #8 possible in the 5.5 time frame? Hit this again today.
Grabbed another core, it's located here: /home/msp/cmarthal/core.7046 The running versions of lvm2/lvm2-cluster were: lvm2-2.02.46-10.el5 BUILT: Fri Sep 18 09:38:06 CDT 2009 lvm2-cluster-2.02.46-10.el5 BUILT: Fri Sep 18 09:39:48 CDT 2009
The patch from comment #8 should by in 2.02.54 build but not sure if it is final fix. Is it reproducible with recent code?
This should be included in recent 5.5 build, marking modified.
Fix verified in lvm2-2.02.56-6.el5/lvm2-cluster-2.02.56-6.el5.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0299.html