665521 – RHEL4.9:ext2 filesystem cannot work normally with HP IO cards on ia64 platform

Bug 665521 - RHEL4.9:ext2 filesystem cannot work normally with HP IO cards on ia64 platform

Summary: RHEL4.9:ext2 filesystem cannot work normally with HP IO cards on ia64 platform

Keywords:
Status:	CLOSED DUPLICATE of bug 662839
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	e2fsprogs
Sub Component:
Version:	4.9
Hardware:	ia64
OS:	Linux
Priority:	low
Severity:	urgent
Target Milestone:	beta
Target Release:	---
Assignee:	Eric Sandeen
QA Contact:	BaseOS QE - Apps
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-12-24 09:40 UTC by ShangLi
Modified:	2011-02-15 21:58 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-02-15 21:58:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Disk Information of rx2660 (1.13 KB, text/plain) 2010-12-24 09:40 UTC, ShangLi	no flags	Details
sysrq-t on rhel4.9beta system of rx2660 (57.38 KB, text/plain) 2010-12-24 09:41 UTC, ShangLi	no flags	Details
sysrq-w on rhel4.9beta system of rx2660 (3.76 KB, text/plain) 2010-12-24 09:42 UTC, ShangLi	no flags	Details
View All

Description ShangLi 2010-12-24 09:40:58 UTC

Created attachment 470602 [details]
Disk Information of rx2660

Description of problem:
Situation 1：
write a big file(size > 1GB,use dd or cp commands) to ext2 will cause process uninterruptible sleep when the written file size increased to about 369MB, the “D” STAT process cannot be killed, and will lead to the system cannot reboot normally.
Situation 2：
I have test ext2 with rhel4.9beta on other ia64 servers, such as rx7640、bl860c. The write performance is too high and abnormal,once umount the ext2 and remount it, all the file in the filesystem will lost and turn to Invalid.


Version-Release number of selected component (if applicable):
System:                      IO cards:            FW version：
HP Integrity Server-rx2660   P600 in slot (1)       2.04
                            A7173A: in slot (2)     fw_1.03.35.70-efi_1.05.04.00
                            AB429A: in slot (3)     fw_5.03.02-efi_2.2
                            Core-IO LSI-1068
System FW：
Current firmware revisions
 MP FW     : F.02.25
 BMC FW    : 05.26
 EFI FW    : ROM A 07.14, ROM B 07.14
 System FW : ROM A 04.15, ROM B 04.11, Boot ROM A
 PDH FW    : 50.07
 UCIO FW   : 03.0b
 PRS FW    : 00.08 UpSeqRev: 02, DownSeqRev: 01


How reproducible:
Steps to Reproduce:
1.Default or Everything install Rhel4.8GA to the disk from Core-IO LSI-1068
2.update Rhel4.8GA to Rhel4.9Beta
3.login the system, mkfs.ext2 on the disk of IO cards( This issue occurs on P600\ A7173A\ AB429A,as all the disk can reproduce ). 
4.mount the disk to a mountpoint, and dd or cp a big file(size > 1GB) to the mountpoint directory, or touch a file and mkdir several directories. 
5.check the dd process status by ps ax ,the cost time, count the write performance,and check the files status in the ext2 filesystem.
6.umount the ext2,and then re mount it.
7.check the files status in the ext2 filesystem.

  
Actual results:
Situation 1：
Write a big file(size > 1GB,use dd or cp commands) to ext2 will cause process uninterruptible sleep when the written file size increased to about 369MB, the “D” STAT process cannot be killed, and will lead to the system cannot reboot normally.
Situation 2：
The write performance is too high and abnormal,once umount the ext2 and remount it, all the file in the filesystem will lost and turn to Invalid.


Expected results:
Situation 1：can dd or cp big file to ext2 normally.
Situation 2：after remount ext2, all the existing files should be ok.


Additional info:
Situation 1：Testing on rx2660-12 server
[root@minxm ~]# uname -a
Linux minxm.rx2660-12 2.6.9-92.EL #1 SMP Mon Nov 29 14:42:44 EST 2010 ia64 ia64
ia64 GNU/Linux
[root@minxm ~]# fdisk  /dev/cciss/c0d0
[root@minxm ~]# mkfs.ext2  /dev/cciss/c0d0p1
[root@minxm ~]# mount  -t  ext2  /dev/cciss/c0d0p1  /root/p600/  
[root@minxm ~]#  time dd if=/dev/zero  of=/root/p600/dd  bs=1M count=5k

Then check the written file size and process status:
[root@minxm ~]# cd  /root/p600/
[root@minxm p600]# ll
total 375744
-rw-r--r--  1 root root 384335872 Dec 21 22:41 dd
drwx------  2 root root     16384 Dec 21 22:40 lost+found
[root@minxm p600]# ll  -h
total 367M
-rw-r--r--  1 root root 367M Dec 21 22:41 dd
drwx------  2 root root  16K Dec 21 22:40 lost+found
[root@minxm p600]# ps ax |grep dd
24134 ttyS0    D+     0:00 dd if /dev/zero of /root/p600/dd bs 1M count 5k
24148 pts/0    S+     0:00 grep dd
[root@minxm p600]#kill  -9  24134
[root@minxm p600]# ps ax |grep dd
24134 ttyS0    D+     0:00 dd if /dev/zero of /root/p600/dd bs 1M count 5k
24154 pts/0    S+     0:00 grep dd
[root@minxm p600]# reboot    
Broadcast message from root (pts/0) (Tue Dec 21 06:27:26 2010):
The system is going down for reboot NOW!
INIT: Switching to runlevel: 6
INIT: Sending processes the TERM signal
(System will hang in reboot.)


Situation 2：Testing on rx7640-3 server
[root@maxcv ~]# uname -a
Linux maxcv.rx7640-3-p0.test 2.6.9-92.EL #1 SMP Mon Nov 29 14:42:44 EST 2010 ia64 ia64 ia64 GNU/Linux
[root@maxcv ~]# fdisk -lu /dev/sdb
Disk /dev/sdb: 72.8 GB, 72839168000 bytes
255 heads, 63 sectors/track, 8855 cylinders, total 142264000 sectors
Units = sectors of 1 * 512 = 512 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1              63    58605119    29302528+  83  Linux
[root@maxcv ~]# mkfs.ext2 /dev/sdb1
mke2fs 1.35 (28-Feb-2004)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
3662848 inodes, 7325632 blocks
366281 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
224 block groups
32768 blocks per group, 32768 fragments per group
16352 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 22 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
[root@maxcv ~]# mount  -t  ext2  /dev/sdb1  /root/ext2/
[root@maxcv ~]# ll  -h  dddd
-rw-r--r--  1 root root  12G  Dec 24 06:05 dddd
[root@maxcv ~]# time  cp  /root/dddd  /root/ext2/
real    0m17.049s
user    0m0.362s
sys     0m16.372s
(The write performance is about 722MB/S)
[root@maxcv ~]# cd  /root/ext2/
[root@maxcv ext2]# mkdir 123
[root@maxcv ext2]# ll -h
total 13G
drwxr-xr-x  2 root root 4.0K Dec 24 09:29 123
-rw-r--r--  1 root root  12G Dec 24 09:29 dddd
drwx------  2 root root  16K Dec 24 09:28 lost+found
[root@maxcv ext2]# cd
[root@maxcv ~]# umount  /root/ext2
[root@maxcv ~]# mount  -t  ext2  /dev/sdb1 /root/ext2/
[root@maxcv ~]# cd  /root/ext2/
[root@maxcv ext2]# ll
total 16
?---------  ? ?    ?        ?            ? 123
?---------  ? ?    ?        ?            ? dddd
drwx------  2 root root 16384 Dec 24 09:28 lost+found
[root@maxcv ext2]# mkdir 456
[root@maxcv ext2]# ll
[root@maxcv ext2]# ll
total 20
?---------  ? ?    ?        ?            ? 123
drwxr-xr-x  2 root root  4096 Dec 24 09:31 456
?---------  ? ?    ?        ?            ? dddd
drwx------  2 root root 16384 Dec 24 09:28 lost+found
[root@maxcv ext2]# rm -rf dddd
rm: cannot remove `dddd': Stale NFS file handle
[root@maxcv ext2]# cd
[root@maxcv ~]# umount  /root/ext2
[root@maxcv ~]# mount  -t  ext2  /dev/sdb1 /root/ext2/
[root@maxcv ~]# cd  /root/ext2
[root@maxcv ext2]# ll
total 16
?---------  ? ?    ?        ?            ? 123
?---------  ? ?    ?        ?            ? 456
?---------  ? ?    ?        ?            ? dddd
drwx------  2 root root 16384 Dec 24 09:28 lost+found

The metadata of file seems go wrong.
Disk information of rx2660 please see the attachment of “diskinfo-rx2660”.
Sysrq information of rx2660 please see the attachment of “sysrq-w-rx2660” and “sysrq-t-rx2660”.

Comment 1 ShangLi 2010-12-24 09:41:49 UTC

Created attachment 470603 [details]
sysrq-t on rhel4.9beta system of rx2660

Comment 2 ShangLi 2010-12-24 09:42:25 UTC

Created attachment 470604 [details]
sysrq-w on rhel4.9beta system of rx2660

Comment 3 Eric Sandeen 2011-01-03 18:57:58 UTC

Is this a regression?

Comment 4 ShangLi 2011-01-04 08:59:53 UTC

Hi Eric
I can`t confirm whether this is a regression, I start my testing from RHEL4.9beta version, and then met this issue. This issue don`t occur on RHEL4.8GA.

Thanks,
-Li

Comment 5 Eric Sandeen 2011-01-04 16:16:20 UTC

If it occurs in 4.9 but not in 4.8, then it is a regression since 4.8.

Could you please test the two kernels here:

http://people.redhat.com/esandeen/.bz665521/

and tell me if the problem is present in one but not the other?

Thanks,
-Eric

Comment 6 ShangLi 2011-01-05 07:23:15 UTC

Hi Eric
This issue don`t occur on kernel 2.6.9-89.44.EL, but be present in kernel 2.6.9-89.45.EL.

Thanks,
-Li

Comment 7 Eric Sandeen 2011-01-05 15:02:45 UTC

Thank you.  This is likely a dup of bug #662839; when we have that built I'll alert you for another test...

Thanks,
-Eric

Comment 8 Li Zhang 2011-01-07 10:30:14 UTC

Could you please tell which build this fix will be included in? Currently IO stress testing are blocked by this issue.

Comment 9 Eric Sandeen 2011-01-07 19:27:06 UTC

For reference these are the changes in 89.45:

* Fri Oct 15 2010 Vivek Goyal <vgoyal> [2.6.9-89.45]
-scsi: scsi_do_req submitted commands (tape) never complete when device goes (Rob Evers) [636289]
-scsi: log msg when getting unit attention (Mike Christie) [585430]
-jbd: fix panic in jbd when running bashmemory (Josef Bacik) [488611]
-qla2xxx: work around hypertransport sync flood error on sun x4200 with qla2xxx (Chad Dupuis) [621621]
-aio: implement request batching for better merging and throughput (Jeff Moyer) [508377]
-fs: a bunch of patches to fix various nfsd/iget() races (Alexander Viro) [189918]
-net: bonding: add debug module option (Jiri Pirko) [247116]
-fix fd leaks if pipe() is called with an invalid address (Amerigo Wang) [509627]

Comment 10 Eric Sandeen 2011-01-07 19:28:56 UTC

(In reply to comment #8)
> Could you please tell which build this fix will be included in? Currently IO
> stress testing are blocked by this issue.

We're still discussing the fix, and will let you know.

Thanks,
-Eric

Comment 11 Eric Sandeen 2011-02-10 22:30:24 UTC

Can you please retest with the latest snapshots, I think this may be resolved
now.

Comment 12 Dawei Pang 2011-02-11 16:28:52 UTC

Hi Eric,
Currently, I can not find any update on the RHN, the kernel version still is 2.6.9-92.EL.

Thanks,
-Dawei

(In reply to comment #11)
> Can you please retest with the latest snapshots, I think this may be resolved
> now.

Comment 13 Eric Sandeen 2011-02-11 16:35:35 UTC

I'm sorry; you can test a later snapshot by getting a kernel from the maintainer's URL at http://people.redhat.com/vgoyal/rhel4/RPMS.kernel/

-Eric

Comment 14 Dawei Pang 2011-02-12 07:45:40 UTC

Hi Eric,

I have run some tests using kernel-2.6.9-100.EL.ia64.rpm which downloaded from you supplied URL.
The ext2 works fine.

Thanks,
Dawei

Comment 15 Eric Sandeen 2011-02-15 21:58:22 UTC

Thank you for testing.  I'll dup this bug to the other.

*** This bug has been marked as a duplicate of bug 662839 ***

Note You need to log in before you can comment on or make changes to this bug.