Bug 113702

Summary: LTC5820-Segmentation fault while lvremove --autobackup y /dev/test/snap001
Product: Red Hat Enterprise Linux 3 Reporter: IBM Bug Proxy <bugproxy>
Component: lvmAssignee: Heinz Mauelshagen <heinzm>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: sct
Target Milestone: ---   
Target Release: ---   
Hardware: powerpc   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-07-02 19:51:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description IBM Bug Proxy 2004-01-16 17:46:42 UTC
The following has be reported by IBM LTC:  
Segmentation fault while 
Hardware Environment:
   p630 

Software Environment:
   OS: RHEL3 Update1 
   Kernel : 2.4.21-6.EL

Steps to Reproduce:
1. create a volume group named test
2. create a logical volume lv001, and make a snapshot of lv001, named
snap001
3. after backup, remove this lv adding an "autobackup" option

[root@plinuxt12 root]# lvremove --autobackup y /dev/test/snap001
Segmentation fault

Actual Results:
   segmentation fault error

Expected Results:
    no segmentation fault, snap001 could be removed

Additional Information:
    without "autobackup" option, snap001 could be removed
[root@plinuxt12 root]# lvremove /dev/test/snap001reassign to myself,
try to recreate itafter create the snapshot, the vgdisplay -v has a error:

--- Logical volume ---
LV Name                /dev/test/lv001
VG Name                test
LV Write Access        read/write
LV snapshot status     source of
                       /dev/test/snap001 [active]
free(): invalid pointer 0x10027288!
LV Status              available
LV #                   3
# open                 1
LV Size                400 MB
Current LE             100
Allocated LE           100
Allocation             next free
Read ahead sectors     1024
Block device           58:2

/var/log/message got this error:
Jan 14 08:17:24 thinh kernel: Trying to vfree() nonexistent vm area
(d000000000360000)
 

not much from the coredump:
[root@thinh root]# gdb /sbin/lvremove core.1897
GNU gdb Red Hat Linux (5.3.90-0.20030710.40rh)
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "ppc-redhat-linux-gnu"...(no debugging symbols
found)...Using host libthread_db library "/lib/tls/libthread_db.so.1".

Core was generated by `lvremove --autobackup y /dev/test/snap001'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/liblvm-10.so.1...done.
Loaded symbols for /lib/liblvm-10.so.1
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
#0  0x10002044 in ?? ()
(gdb) where
#0  0x10002044 in ?? ()
(gdb)

if lvremove without the --autobackup option, I also notice some error
occurs,
(same as vgdisplay error ??)
Jan 14 08:07:24 thinh kernel: Trying to vfree() nonexistent vm area
(d000000000360000)
Jan 14 08:08:11 thinh last message repeated 5 times

Comment 1 IBM Bug Proxy 2004-01-17 03:28:20 UTC
----- Additional Comments From zhouwu.com  2004-01-16 22:16 -------
Hi, Thinh/Glen

    I am doing some debugging on this defect. I found that after calling "   
while ( ( c = getopt_long ( argc, argv, options, long_options, NULL)) != EOF) " 
at line 84 of lvremove.c, the value of optarg is 0, so poped the "segmentation 
fault" at line 93 : 
    if ( strcmp ( optarg, "y") == 0);

    Following is the debugging process:

[root@plinuxt9 tools]# gdb -q ./lvremove
Using host libthread_db library "/lib/tls/libthread_db.so.1".
(gdb) set args --autobackup y /dev/test/snap001
(gdb) b main
Breakpoint 1 at 0x10001da0: file lvremove.c, line 57.
(gdb) r
Starting program: /usr/src/redhat/BUILD/LVM/1.0.3/tools/lvremove --autobackup 
y /dev/test/snap001

Breakpoint 1, main (argc=4, argv=0xffffe9b4) at lvremove.c:57
57         int c = 0;
(gdb) l
52      int opt_d = 0;
53      #endif
54
55      int main ( int argc, char **argv)
56      {
57         int c = 0;
58         int c1 = 0;
59         int opt_A = 1;
60         int opt_A_set = 0;
61         int opt_f = 0;
(gdb) ......
......
(gdb) 
82         LVMTAB_CHECK;
83
84         while ( ( c = getopt_long ( argc, argv, options,
85                                     long_options, NULL)) != EOF) {
86            switch ( c) {
87               case 'A':
88                  opt_A_set++;
89                  if ( opt_A > 1) {
90                     fprintf ( stderr, "%s -- A option already given

", 
cmd);
91                     return LVM_EINVALID_CMD_LINE;
(gdb) b 84
Breakpoint 2 at 0x10001f60: file lvremove.c, line 84.
(gdb) c
Continuing.

Breakpoint 2, main (argc=4, argv=0xffffe9b4) at lvremove.c:84
84         while ( ( c = getopt_long ( argc, argv, options,
(gdb) n
86            switch ( c) {
(gdb) n
88                  opt_A_set++;
(gdb) n
89                  if ( opt_A > 1) {
(gdb) n
93                  if ( strcmp ( optarg, "y") == 0);
(gdb) n

Program received signal SIGSEGV, Segmentation fault.
0x0fec6a60 in strcmp () from /lib/tls/libc.so.6
gdb) p optarg
$1 = 0x0
(gdb) b 93
Breakpoint 3 at 0x10002020: file lvremove.c, line 93.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y

Starting program: /usr/src/redhat/BUILD/LVM/1.0.3/tools/lvremove --autobackup 
y /dev/test/snap001

Breakpoint 1, main (argc=4, argv=0xffffe9b4) at lvremove.c:57
57         int c = 0;
(gdb) c
Continuing.

Breakpoint 2, main (argc=4, argv=0xffffe9b4) at lvremove.c:84
84         while ( ( c = getopt_long ( argc, argv, options,
(gdb) c
Continuing.

Breakpoint 3, main (argc=4, argv=0xffffe9b4) at lvremove.c:93
93                  if ( strcmp ( optarg, "y") == 0);
(gdb) p optarg
$2 = 0x0
(gdb) 

Comment 2 IBM Bug Proxy 2004-01-17 05:13:23 UTC
----- Additional Comments From zhouwu.com  2004-01-17 00:03 -------
Hi, Thinh/Glen

    Investigating the source of lvremov.c, I found a mismatch between the 
source and the "man lvremove": lvremove.c define "autobackup" option as no-
argument, but "man lvremove" shows that it need an argument.

    Following is the definition of long_options in lvremove.c(starting from 
line 66):
   struct option long_options[] = {
      { "autobackup", no_argument, NULL, 'A'},
      DEBUG_LONG_OPTION
      { "force",      no_argument, NULL, 'f'},
      { "help",       no_argument, NULL, 'h'},
      { "verbose",    no_argument, NULL, 'v'},
      { NULL,         0,           NULL, 0}
   };

   But in file lvrename.c, which have also an "autobackup" option, "autobackup" 
option was defined as required_argument:
   struct option long_options[] = {
      { "autobackup", required_argument, NULL, 'A'},
      DEBUG_LONG_OPTION
      { "help",       no_argument,       NULL, 'h'},
      { "verbose",    no_argument,       NULL, 'v'},
      { "version",    no_argument,       NULL, 22},
      { NULL, 0, NULL, 0}
   };

   After changing the "no_argument" to "required_argument" and recompiling 
lvremove, snap001 could be removed now:

[root@plinuxt9 tools]# ./lvremove --autobackup y /dev/test/snap001 
lvremove -- do you really want to remove "/dev/test/snap001"? [y/n]: y
lvremove -- doing automatic backup of volume group "test"
lvremove -- logical volume "/dev/test/snap001" successfully removed
[root@plinuxt9 tools]# 

Comment 3 IBM Bug Proxy 2004-01-17 05:38:00 UTC
----- Additional Comments From zhouwu.com  2004-01-17 00:28 -------
  oops, the "vfree() noonexistent vm area" still exist with the above patch 
applied. After successfully removing the snapshot lv, "dmesg" will still show 
the following message: 

Trying to vfree() nonexistent vm area (d0000000002a9000)
Trying to vfree() nonexistent vm area (d0000000002a9000)
Trying to vfree() nonexistent vm area (d0000000002a9000) 

Comment 4 IBM Bug Proxy 2004-01-17 15:38:04 UTC
----- Additional Comments From gjlynx.com(prefers email via gjohnson.com)  2004-01-17 10:27 -------
Is this project/patch OSSC approved? 

Comment 5 IBM Bug Proxy 2004-01-19 13:47:09 UTC
----- Additional Comments From zhouwu.com  2004-01-18 21:05 -------
Hi, Glen

   This patch didn't get OSSC approval yet. It is intended to resolve the 
segmentation fault error in lvremove.c. Could you please help me OSSC approval? 
Or tell me how to get OSSC approval? Please advise, thanks very much!

Wu 

Comment 6 IBM Bug Proxy 2004-01-22 02:41:50 UTC
----- Additional Comments From thinh.com  2004-01-21 16:25 -------
Wu, you are right. it does require an argument. Your patch is good.

The vfree() errors are something else that may come from one of the LVM
functions or macro being used, but it is a different issue than this bug. 

Comment 7 IBM Bug Proxy 2004-01-24 04:44:57 UTC
----- Additional Comments From khoa.com  2004-01-23 09:30 -------
I agree that the patch above does address the root cause of the segmentation
fault and that it is separate from the vfree() warning, so in this sense,
the patch is valid.  But I hope that we address the vfree() warning as well.
Wu/Thinh - are you going to look into this as well ? 

Comment 8 IBM Bug Proxy 2004-02-02 15:12:03 UTC
----- Additional Comments From zhouwu.com  2004-02-01 20:26 -------
Hi, Khoa/Thinh

   Thanks for your reviewing this patch! And sorry for replying laterly because 
I am just back from the chinese spring festival. If time is available, I am 
quite willing to looking into this defect. Should there be any new findings, I 
will add my comments as soon as possible. 

   BTW, I have one question that need your kind advices. Do I need to get the 
OSSC approval Glen suggested? And How to get such an approval? Your advices 
will be highly appreciated! Thanks again! 

Comment 9 Stephen Tweedie 2004-02-02 20:46:09 UTC
Thanks for the effort on this.  As soon as you can get a patch cleared
and submitted to Red Hat bugzilla, the better --- for now, we have no
access to the patch you refer to so the conversation that we're
viewing in our bugzilla lacks a bit of context!

Comment 10 IBM Bug Proxy 2004-02-06 14:55:17 UTC
----- Additional Comments From khoa.com  2004-02-06 09:54 -------
I've put this on RHEL3 QU3 list as a Sev 2. 

Comment 11 IBM Bug Proxy 2004-02-07 01:24:03 UTC
----- Additional Comments From zhouwu.com  2004-02-06 20:24 -------
The vfree() non-existent message was printed out in line 268 of mm/vmalloc.c. 
To determine the call trace, I add a "BUG();" after that line. After 
recompiling the kernel, enabling xmon and rerun the above reproduce process, an 
additional "lvdisplay -v /dev/test/lv001" will panic the kernel, and I could 
get such backtrace in /var/log/message:
===================== Error message start here ===========================
Feb  5 05:57:02 plinuxt9 kernel: Trying to vfree() nonexistent vm area 
(d0000000002bd000)
Feb  5 05:57:02 plinuxt9 kernel: kernel BUG at vmalloc.c:269!
Feb  5 05:57:02 plinuxt9 kernel: loop lvm-mod autofs e100 ipt_state 
iptable_filter iptable_nat ip_conntrack ip_tables sg ext3 jbd sym53c8xx sd_mod 
scsi_mod  
Feb  5 05:57:02 plinuxt9 kernel: NIP: c000000000092974 XER: 0000000000000000 
LR: c000000000092960 REGS: c0000001ef77f4c0 TRAP: 0700    Not tainted
Feb  5 05:57:02 plinuxt9 kernel: NIP is at .vfree [kernel] 0x104 (2.4.21-
9.ELlvm)
Feb  5 05:57:02 plinuxt9 kernel: MSR: 9000000000089032 EE: 1 PR: 0 FP: 0 ME: 1 
IR/DR: 11
Feb  5 05:57:03 plinuxt9 kernel: TASK = c0000001ef77c000[2130] 'lvdisplay' Last 
syscall: 54
Feb  5 05:57:03 plinuxt9 kernel: last math 0000000000000000 CPU: 1
Feb  5 05:57:03 plinuxt9 kernel: GPR00: 0000000000000000 c0000001ef77f740 
c0000000005f2000 000000000000001d 
Feb  5 05:57:03 plinuxt9 kernel: GPR04: 0000000000000003 c0000001fa768000 
0000000000000000 c0000000005853b0 
Feb  5 05:57:03 plinuxt9 kernel: GPR08: c0000000005853a8 c0000000003b0a20 
c000000000582410 c000000000582410 
Feb  5 05:57:03 plinuxt9 kernel: GPR12: c00000000068feb3 c0000001ef77c000 
00000000ffffe7f8 0000000000000000 
Feb  5 05:57:03 plinuxt9 kernel: GPR16: 000000000ff97fb8 0000000000000000 
0000000000000000 0000000000000000 
Feb  5 05:57:03 plinuxt9 kernel: GPR20: 0000000000000001 c00000000050eec0 
b000000000009032 0000000000000003 
Feb  5 05:57:03 plinuxt9 kernel: GPR24: c0000001ef77fcd0 0000000000000000 
000000000ffec9a4 c0000001ef77f8c0 
Feb  5 05:57:03 plinuxt9 kernel: GPR28: 00000000ffffe5e8 d0000000002bd000 
c0000000003e4b58 0000000000000000 
Feb  5 05:57:03 plinuxt9 kernel: Call Trace: 
Feb  5 05:57:03 plinuxt9 kernel: [<c000000000092960>] .vfree [kernel] 0xf0
Feb  5 05:57:03 plinuxt9 kernel: [<c000000000032044>] .put_lv_t [kernel] 0x50
Feb  5 05:57:04 plinuxt9 kernel: [<c0000000000328fc>] .do_lvm_ioctl [kernel] 
0x20c
Feb  5 05:57:04 plinuxt9 kernel: [<c000000000034a44>] .sys32_ioctl [kernel] 
0x128
=============================================================================

So the error is in function put_lv_t of arch/ppc64/kernel/ioctl32.c:

static void put_lv_t(lv_t *l)
{
	if (l->lv_current_pe) vfree(l->lv_current_pe);
	if (l->lv_block_exception) vfree(l->lv_block_exception);
	kfree(l);
} 

Comment 12 IBM Bug Proxy 2004-02-07 01:38:33 UTC
----- Additional Comments From zhouwu.com  2004-02-06 20:40 -------
             So I continue to investigate where does this l->lv_current_pe and 
l->lv_block_exception was vmalloc before. They are in function get_lv_t of the 
same ioctl32.c, which mean to do conversion between 32bit and 64bit native 
ioctls. 

          In this file, function do_lvm_ioctl will process the ioctl conversion 
related with lvm. It first get arg from 32-bit user-space, and convert it to 64-
bit structure, and then call sys_ioctl to handle the ioctl command:
============  start from line 2600 ======================================
        old_fs = get_fs(); set_fs (KERNEL_DS);
        err = sys_ioctl (fd, cmd, (unsigned long)karg);
        set_fs (old_fs);

and then re-convert the resulted "karg" back to 32-bit structure.

the vmalloc was done in function get_lv_t before this sys_ioctl, and vfree was 
done after the sys_ioctl. After adding some printk, I found that a 
lv_block_exception was not vmalloc'ed, but after the sys_ioctl, it get a value 
of non-zero, so comes the "vfree() non-existent" error 

Comment 13 IBM Bug Proxy 2004-02-07 01:59:02 UTC
----- Additional Comments From zhouwu.com  2004-02-06 21:00 -------
continue the investigation into lvm_chr_ioctl of driver/md/lvm.c, I found it 
goes into lvm_do_lv_status_byindex(vg_ptr, arg) while ioctl "cmd" 
equal "LV_STATUS_BYINDEX".

In this function of lvm_do_lv_status_byindex:
====================== code of function lvm_do_lv_status_byindex ===========
/*
 * character device support function logical volume status by index
 */
static int lvm_do_lv_status_byindex(vg_t *vg_ptr,void *arg)
{
	lv_status_byindex_req_t lv_status_byindex_req;
	void *saved_ptr1;
	void *saved_ptr2;
	lv_t *lv_ptr;

	if (vg_ptr == NULL) return -ENXIO;
	if (copy_from_user(&lv_status_byindex_req, arg,
			   sizeof(lv_status_byindex_req)) != 0)
		return -EFAULT;

	if (lv_status_byindex_req.lv == NULL)
		return -EINVAL;
	if ( ( lv_ptr = vg_ptr->lv[lv_status_byindex_req.lv_index]) == NULL)
		return -ENXIO;

	/* Save usermode pointers */
	if (copy_from_user(&saved_ptr1, &lv_status_byindex_req.lv-
>lv_current_pe, sizeof(void*)) != 0)
	        return -EFAULT;
	if (copy_from_user(&saved_ptr2, &lv_status_byindex_req.lv-
>lv_block_exception, sizeof(void*)) != 0)
	        return -EFAULT;

	if (copy_to_user(lv_status_byindex_req.lv, lv_ptr, sizeof(lv_t)) != 0)
		return -EFAULT;
	if (saved_ptr1 != NULL) {
		if (copy_to_user(saved_ptr1,
				 lv_ptr->lv_current_pe,
				 lv_ptr->lv_allocated_le *
		       		 sizeof(pe_t)) != 0)
			return -EFAULT;
	}

	/* Restore usermode pointers */
	if (copy_to_user(&lv_status_byindex_req.lv->lv_current_pe, 
&saved_ptr1, sizeof(void *)) != 0)
	        return -EFAULT;

	return 0;
} /* lvm_do_lv_status_byindex() */
======================================================================

From above, we can see that if will first save lv_current_pe memeber of 
lv_status_byindex_req.lv into saved_ptr1, and lv_block_exception into 
saved_ptr2; then copy the indexed lv_ptr memeber of vg_ptr into 
lv_status_byindex_req.lv; after that it will cope saved_ptr1 back into 
lv_status_byindex_req.lv->lv_current_pe. 

from the structure of this function, it should also copy the saved_ptr2 back 
into lv_status_byindex_req.lv->lv_blcok_exception. but there is no code to 
handle this. Because I don't clearly know what "lv_current_pe" 
& "lv_blcok_exception" are used for, So I can only try adding the similar copy-
back code for lv_block_exception.

After adding the above copy-back code and recompile the kernel and restart the 
above reproduce process, no "vfree() non-existenet" error this time. But I just 
don't know why. Anyone could help on this? 

Thanks very much!

BTW, Seen from the changelog at the head of lvm.c:
=======================================================================
 *    28/07/1999 - implemented snapshot logical volumes
 *                 - lvm_chr_ioctl
 *                   - LV_STATUS_BYINDEX
 *                   - LV_STATUS_BYNAME

maybe this could provide some clues. 

Comment 14 IBM Bug Proxy 2004-03-25 03:55:35 UTC
----- Additional Comments From zhouwu.com  2004-03-24 22:58 -------
This defect still exist in the 03/16 released RHEL3 U2 Beta. 

Comment 15 IBM Bug Proxy 2004-05-06 07:19:40 UTC
----- Additional Comments From bherren.com(prefers email via benh.com)  2004-05-06 03:20 -------
What is the status here ? We are waiting for redhat or shall one of
us try to find a fix ? 

Comment 16 Heinz Mauelshagen 2004-05-06 16:07:39 UTC
the code on ppc64 to copy LV metadata in/out is to be found in
linux/arch/ppc64/kernel/ioctl32.c rather than in linux/drivers/md/lvm.c.
With no access to ppc64 right now, i'll carry on spotting...


Comment 17 IBM Bug Proxy 2004-05-28 09:46:53 UTC
----- Additional Comments From zhouwu.com  2004-05-28 05:41 -------
Hello all, 

   Latest status for BZ #5820 and #5821 from my side: 

1. Both defects was not fixed yet in official RHEL3 U2. 

2. BZ #5820 is an application bug, it is fixed by the patch I provided in 
Comment #9. Seen from the CVS of lvm-1.08, this defect is also fixed and it is 
same as mine. It seems that Redhat guys can't see the patch on our side, could 
anyone who have the right priviledge put it on RedHat's side.

3. BZ #5821 is a kernel bug as far as I tell. It was found at the same time as 
this one. So many discussion about 5821 also took place here. But to help 
resolve the problem, I'd like to distinguish between these two: application 
defect discussion at here, and the kernel defect at 5821. So I only report 
here shortly about the kernel defect: I ever created a patch, which could 
remedy the vfree and kernel panic error. Now I am running LTP testcase on this 
patched kernel. by the time I write this comment, it runs smoothly. 

could anyone tell me what you thought about this plan? These two defects have 
been here for already 4 monthes. And in fact, we could have made some progress 
on them. 

Thanks for reading this long comments :)

- Wu 

Comment 18 Heinz Mauelshagen 2004-05-28 09:59:05 UTC
1. Yes, planned for RHEL U3
2. Yes, fixed in lvm-1_0_8-2. No further activity needed on our side IMO.
3. Please provide the kernel patch once you're done with your LTP
testing and I'll integrate it. Thanks, Heinz.

Comment 19 IBM Bug Proxy 2004-05-31 02:31:11 UTC
----- Additional Comments From zhouwu.com  2004-05-30 22:26 -------
Hello Heinz,

   I have attached the patch in LTC BZ# 5821(RH113704). You could have a look 
at there. Please be noted that patch didn't go through any review, so before 
integrating that, please make double check. 

  In the above comment, you mention a new version LVM utility. Would you 
please tell me where could I get that? The SEGV fault still exist after 
applying that patch. So I wish to have a look at the new version LVM. Thanks a 
lot.

- Wu 

Comment 20 IBM Bug Proxy 2004-06-07 14:24:05 UTC
----- Additional Comments From khoa.com  2004-06-07 04:37 -------
We need Red Hat to review and accept the patch from Wu.  Thanks. 

Comment 21 IBM Bug Proxy 2004-06-28 17:38:43 UTC
----- Additional Comments From thinh.com  2004-06-28 13:34 -------
Can anyone verify/update this bug?  is it fixed? 

Comment 22 IBM Bug Proxy 2004-07-01 02:14:10 UTC
----- Additional Comments From zhouwu.com  2004-06-30 22:13 -------
No segv fault with the latest 1.0.8-3 lvm package. It is fixed. Close it. 
Thanks.