Bug 182577

Summary: Oracle ASM produces lots of kernel errors. Ext3 works fine
Product: Red Hat Enterprise Linux 4 Reporter: Boris Mironov <bmironov>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: greg.marsden, jbaron, joel.becker, john.sobecki, tao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-11-21 20:22:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 176344    

Description Boris Mironov 2006-02-23 14:41:47 UTC
Description of problem:


Version-Release number of selected component (if applicable):
2.6.9-11.ELsmp #1 SMP x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install Oracle 10g R2 according to Metalink Note 339367.1
2. Create ASM-based database
3.
  
Actual results:
Installer hungs during 'shutdown immediate' (74% of total progress)

Expected results:
Installation should be smooth

Additional info:
We use the following hardware:
IBM e326 server (2 * AMD64 CPU, 8Gb RAM)
Qlogic QLA2342 FC card
IBM TotalStorage DS400

Please not: this configuration is failing to create database on raw devices, 
as well. But it works fine on ext3 filesystem

We have to install clustered database. So, the only option for us is to have 
ASM-based database running.

If it can help, I filed the same problem with Oracle as TAR #4855265.992

Thanks,
Boris

Comment 1 Jason Baron 2006-02-23 21:19:28 UTC
Unless you can prove otherwise, this looks like it should be filed with Oracle.
Please re-open it, if there is a Red Hat specific issue. thanks.

Comment 2 Boris Mironov 2006-02-24 02:00:32 UTC
What would you consider as a proof?

When I use Oracle on RAW device and it dows not work I can assume that it is 
somewhere between Oracle and RedHat.

What should I do to reopen it? I just did it!

Comment 3 Boris Mironov 2006-02-28 13:00:09 UTC
Hello,

Could you please answer my question?

Regards,
Boris

Comment 4 Jason Baron 2006-03-01 02:18:18 UTC
thanks for re-opening the bug, we are actively looking into it, and will update
it with progress.

Comment 5 Joel Becker 2006-03-01 02:28:23 UTC
The "doesn't work on ext3" thing is something I wonder about.  With raw devices
mapping to chunks of the disk, I wonder if we are accessing parts of the disk
that an ext3 filesystem doesn't touch.  An ext3 would front-load the datafiles.

Comment 6 Boris Mironov 2006-03-14 16:41:25 UTC
Hello,

It seems to be working with kernel parameter "schedule=deadline".

Regards,
Boris

Comment 7 John Sobecki 2006-03-14 18:09:06 UTC
Dup of BZ 151368 per above testing.

Comment 8 Boris Mironov 2006-03-14 19:46:13 UTC
Hi John,

Could you please clue me about connection between /etc/dev.d/default (bugzilla 
bug# 151368) and problem between Oracle and IBM e326 in my case?

Thanks,
Boris

Comment 9 John Sobecki 2006-03-14 20:29:47 UTC
Sorry about that, it's 151368 in novell.bugzilla.com, patched file
is cfq-iosched.c.  So switching from the cfq to the deadline I/O
schedule avoids the code with the problem.  Regards, John

Comment 10 Greg Marsden 2006-03-14 21:38:14 UTC
Yes, it's being tracked as issuetracker 88208 (private, unfortunately) but there
is no bugzilla open for it yet

Comment 11 Jason Baron 2006-03-16 21:51:41 UTC
John, when you mention the issue with the cfq scheduler is this the issue in bug
#184535. thanks.

Comment 12 Jason Baron 2006-03-16 22:09:37 UTC
i'll answer my own question in the affirmative. marking this as a duplicate of
184535. Boris, you can try a test kernel from
http://people.redhat.com/~jbaron/rhel4/ which already has this fix so that you
can use the default cfq elevator. thanks.

*** This bug has been marked as a duplicate of 184535 ***

Comment 13 Boris Mironov 2006-03-17 14:33:42 UTC
Hi Jason,

Which kernel you want me to try SMP or SMP-DEVEL?
If "devel" one then do I need to use special kernel parameters to collect more 
data?

Thanks,
Boris

Comment 14 Boris Mironov 2006-03-20 16:03:31 UTC
Hello,

I tried new kernel with both cfq and deadline schedulers and problem still 
exists.

Regards,
Boris

Comment 15 Jeff Layton 2006-03-21 18:31:52 UTC
Reopening bug since it doesn't sound like it was a duplicate of the other one
after all.

Boris, you mentioned trying the test kernel with both the CFQ and deadline
schedulers and the problem still exists. Does this mean that you've also seen
this when using the deadline scheduler?


Comment 17 Boris Mironov 2006-03-21 18:40:35 UTC
Hi Jeffrey,

Yes, I tried 34.5smp kernel with both IO schedulers (one after another) and in 
both cases Oracle dbca stuck at same point (74%) when script tries to shutdown 
database. At this point alert.log files slightly different but in both cases 
DBWR process was waiting for something. It is exactly the same behaviour as I 
saw under kernel 11smp (with cfq scheduler) that was recommended by Oracle to 
use for x86_64 systems and Oracle 10g R2.

Thanks,
Boris

Comment 18 Jeff Layton 2006-03-21 19:01:26 UTC
Novell BZ 151368 seems to be a private case on their site.

John, can you provide some details about what that BZ is about?


Comment 19 Jeff Layton 2006-03-21 19:04:38 UTC
Novell BZ 151368 seems to be a private case on their site.

John, can you provide some details about what that BZ is about?

Comment 20 Boris Mironov 2006-03-21 19:26:28 UTC
Sorry, forgot to mention that:
1) I was able to use Oracle and ASM with deadline IO scheduler under kernels 
11smp and 34smp (official releases)
2) 11smp 34.5smp kernel did not work for Oracle on raw devices with IO 
deadline scheduler

Thanks,
Boris



Comment 21 John Sobecki 2006-03-21 19:39:29 UTC
Oracle processes hang during I/O during database creation.  The sysrq-t stack
is:

Feb 11 05:42:56 gemini2 kernel: oracle  D 0000010103e05878  0  6528  

Call Trace:
 <ffffffff8023c2a5>{elv_next_request+238}
 <ffffffff802f87b4>{io_schedule+37}
 <ffffffff801933a2>{__blockdev_direct_IO+2899}
 <ffffffff801563fb>{__generic_file_aio_read+266}
 <ffffffff801953b3>{sys_io_getevents+685}
 <ffffffff80193f4d>{__aio_run_iocbs+491}
 <ffffffff801944a2>{timeout_func+0}
 <ffffffff801320ea>{default_wake_function+0}
 <ffffffff80172737>{sys_pread64+86}
 <ffffffff8011003e>{system_call+126}

Makes no sense that deadline is broken in 34.5.  Could you attach the patch
you used to create 34.5?  Thanks, John

Comment 22 Boris Mironov 2006-03-21 19:43:15 UTC
It might be interesting but under each test (kernel 11smp, 34smp) I saw kernel 
errors during database restart phase. But only ASM-based databases were 
restarting. Raw-based databases just hung indefinetely.

Example of kernel errors.
"34smp / Oracle ASM" test:
Mar 21 14:27:07 gemini2 kernel: end_request: I/O error, dev sdb, sector 1552591
Mar 21 14:27:07 gemini2 kernel: end_request: I/O error, dev sdb, sector 1565391
Mar 21 14:27:07 gemini2 kernel: end_request: I/O error, dev sdb, sector 1569871

P.S. I asked Oracle to double check my config files for RAW test. Sorry if it 
is my mistake.

P.P.S. I can not check deadline scheduler at the moment with ASM because there 
is no official release for development kernel. I will try to make it from 
source code. At the moment the only thing I can check is RAW device 
installation.

Thanks,
Boris

Comment 23 Boris Mironov 2006-03-21 19:47:04 UTC
John, Please note that on Feb 11 I had default IO scheduler (cfq). Not 
deadline. I was told to try it on Mar 10 only.

Thanks,
Boris

Comment 24 Joel Becker 2006-03-21 20:21:28 UTC
Boris, please clarify.  I don't see why you cannot check deadline and cfq on the
same kernel, as they ship with every kernel.  I suspect I'm just not
understading you.


Comment 25 Boris Mironov 2006-03-21 20:28:43 UTC
Hi Joel,

I just updated TAR #4855265.992 with request to double check my config files 
for the "raw" test. I tested both IO schedulers under each kernel (11smp, 
34smp, 34.4smp, 34.5smp). Because last two kernels are development therefore 
there is no official ASMlib release. I also have concern that I'm doing "raw" 
test incorrectly. I also noticed that ANY test produces kernel "end_request" 
errors but ASMlib recovers from them and "raw" does not.

Hope it shed some light on my chaotic thoughts,
Boris

Comment 26 Joel Becker 2006-03-21 21:36:18 UTC
3) I used the following raw config file:
system=/dev/raw/raw1
sysaux=/dev/raw/raw2
users=/dev/raw/raw3
temp=/dev/raw/raw4

What do you mean "raw config file"?  The only raw config file I know is
/etc/sysconfig/raw, and it doesn't have the above format.

Comment 27 Boris Mironov 2006-03-22 02:09:34 UTC
Hi Joel,

Sorry for confusion. This raw config file is used in dbca to create new 
database. Proper name is "Raw devices mapping file" and it is used on step 6 
of 12 (Storage options) of Oracle 10g R2 dbca-utility. It is 
NOT /etc/sysconfig/rawdevices.

Regards,
Boris

Comment 28 Boris Mironov 2006-03-24 20:08:11 UTC
Hello,

I tried kernel 2.6.9-34.7.ELsmp with "schedule=deadline" under "Oracle raw" 
test. dbca stopped at usual 74% but /var/log/messages showed just single 
kernel error instead of three:

Mar 24 14:06:11 gemini2 kernel: end_request: I/O error, dev sdc, sector 
32136351

I also see the following error from time to time:

Mar 24 14:49:48 gemini2 kernel: warning: many lost ticks.
Mar 24 14:49:48 gemini2 kernel: Your time source seems to be instable or some 
driver is hogging interupts
Mar 24 14:49:48 gemini2 kernel: rip __do_softirq+0x4d/0xd0

This error existed in all kernels.

Thanks,
Boris

Comment 30 Daniel Riek 2006-11-21 20:13:43 UTC
This request is not planned for inclusion in the next update. The decision is
based on weighting the priority and number of requests for a component as well
as the impact on the Red Hat Enterprise Linux user-base: other components are
considered having higher priority and the number of changes we intend to include
in update cycles is limited.

Comment 31 RHEL Program Management 2006-11-21 20:22:35 UTC
Product Management has reviewed and declined this request.  You may appeal this
decision by reopening this request.