Bug 121515

Summary: EXT-fs error causing filesystem corruption
Product: Red Hat Enterprise Linux 2.1 Reporter: Jude T. Cruz <cruz.jude>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-10-15 00:20:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jude T. Cruz 2004-04-22 10:51:12 UTC
We are in the midst of setting a 2 node cluster using Red Hat Cluster 
Manager. 
The hardware summary is as follows.  
  
2 units of HP Proliant DL580 connected using Smart Array 532 SCSI HBA 
to a HP 
MSA500.  The cluster initialization completed without any errors and 
we tried a 
failover using Samba it works fine. We stopped the cluster config and 
installed 
Oracle 9i RDBMS one node and Oracle 10g Apps on another node. Node 
one is called 
ecos1 and the 2nd node is ecos2. 
  
The kernel version is linux-2.4.90e.38smp  
  
Both servers are accesing different filesystems on the shared 
storage. When we 
wanted to test the Oracle Database, we shutdown the database  followd 
by the 
server(ecos1) itself. We then started the 2nd node from power-down 
stage and 
tried to mount the Oracle Database filesystems, it mounted cleanly  
but took 
some time at  when we tried to su as oracle. At the background I 
captured the 
following errors in /var/log/messages  :- 
  
Apr 19 16:46:04 ecos2 syslogd 1.4.1: restart. 
Apr 19 16:47:37 ecos2 kernel: kjournald starting.  Commit interval 5 
seconds 
Apr 19 16:47:37 ecos2 kernel: EXT3 FS 2.4-0.9.11, 3 Oct 2001 on cciss1
(105,6), 
internal journal 
Apr 19 16:47:37 ecos2 kernel: EXT3-fs: mounted filesystem with 
ordered data 
mode. 
Apr 19 16:47:53 ecos2 kernel: kjournald starting.  Commit interval 5 
seconds 
Apr 19 16:47:53 ecos2 kernel: EXT3 FS 2.4-0.9.11, 3 Oct 2001 on cciss1
(105,7), 
internal journal 
Apr 19 16:47:53 ecos2 kernel: EXT3-fs: mounted filesystem with 
ordered data 
mode. 
Apr 19 16:48:02 ecos2 kernel: kjournald starting.  Commit interval 5 
seconds 
Apr 19 16:48:02 ecos2 kernel: EXT3 FS 2.4-0.9.11, 3 Oct 2001 on cciss1
(105,8), 
internal journal 
Apr 19 16:48:02 ecos2 kernel: EXT3-fs: mounted filesystem with 
ordered data 
mode. 
Apr 19 16:49:15 ecos2 kernel: st: Version 20010812, bufsize 32768, 
wrt 30720, 
max init. bufs 4, s/g segs 16 
Apr 19 16:49:15 ecos2 kernel: Attached scsi tape st0 at scsi0, 
channel 0, id 0, 
lun 0 
Apr 19 16:49:15 ecos2 kernel: st0: Block limits 1 - 16777215 bytes. 
Apr 19 16:52:51 ecos2 su(pam_unix)[8905]: session opened for user 
oracle by 
root(uid=0) 
Apr 19 16:53:00 ecos2 kernel: cciss: cmd f6960000 timedout 
Apr 19 16:53:13 ecos2 last message repeated 2 times 
Apr 19 16:58:14 ecos2 su(pam_unix)[8905]: session closed for user 
oracle 
Apr 19 17:01:59 ecos2 kernel: cciss: cmd f6960000 timedout 
Apr 19 17:01:59 ecos2 kernel: EXT3-fs error (device cciss1(105,6)): 
ext3_readdir: directory #1632391 contains a hole at offset 0 
Apr 19 17:06:03 ecos2 PAM-securetty[1203]: Couldn't 
open /etc/securetty 
Apr 19 17:06:05 ecos2 login(pam_unix)[1203]: session opened for user 
root by 
LOGIN(uid=0) 
Apr 19 17:06:05 ecos2  -- root[1203]: ROOT LOGIN ON tty4 
Apr 19 17:12:21 ecos2 su(pam_unix)[10127]: session opened for user 
oracle by 
root(uid=0) 
Apr 19 17:14:16 ecos2 su(pam_unix)[10127]: session closed for user 
oracle 
  
When we tried to run sqlplus the executable was not found but 
actually the file 
has been renamed as sqlplusO. There were other  files which has O or 
0 appended 
at the end. 
  
I suspect it due to the filesystem error :- 
Apr 19 17:01:59 ecos2 kernel: EXT3-fs error (device cciss1(105,6)): 
ext3_readdir: directory #1632391 contains a hole at offset 0 

I have re-formatted the shared filesystems and tried mounting after 
starting up both nodes in cluster mode. This will work fine but 
intermittenly it fail with the following error message 
in /var/log/messages during any apps install :- 

ecos2 kernel: cciss: cmd f6960000 timedout 

The cciss driver version is 2.4.49 . 

This problem can be reproduced. 

Desparately need to resolve this problem due to project deadline.

Open a RHN Csutomer Service Report : 318556

e-mail : jude.my

Comment 1 Jeff Needle 2004-04-23 00:07:07 UTC
Since you have reported this to our support folks, someone will be in
touch with you very shortly to work through this.

Comment 3 Stephen Tweedie 2004-10-15 00:20:28 UTC
Closing as NOTABUG for now; please reopen if there is a genuine bug
discovered which needs to be escalated.