Bug 154055 - RHEL4 U1 Oracle 10G 10.0.3 aio hang running tpc-c
Summary: RHEL4 U1 Oracle 10G 10.0.3 aio hang running tpc-c
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Jeff Moyer
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 156322
TreeView+ depends on / blocked
 
Reported: 2005-04-06 21:21 UTC by John Shakshober
Modified: 2007-11-30 22:07 UTC (History)
1 user (show)

Fixed In Version: RHSA-2005-514
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-10-05 12:56:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
has sysrq m and t (143.84 KB, text/plain)
2005-04-06 21:21 UTC, John Shakshober
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:514 0 qe-ready SHIPPED_LIVE Important: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 2 2005-10-05 04:00:00 UTC

Description John Shakshober 2005-04-06 21:21:20 UTC
Description of problem:

RHEL4 U1  Oracle 10G 10.0.3 aio hang running tpc-c.  This runs fine on RHEL3
U4/U5 and with Oracle 10G 10.0.3 with the 32-bit version of oracle.




Version-Release number of selected component (if applicable):
Fails only with 10G 64-bit and RHEL4 (day0, U1)
2.6.9-6.34.ELsmp libaio-0.3.103-2

After running tpc-C for more than 5 minutes, the oracle database writer (process
5221) will hang.  The jobs can NOT be killed, any other disk access hangs the
process that is connected in.  (ie df, ls, cat, more ... etc).

How reproducible:
every time

Steps to Reproduce:
1. build tpc-c to ext3 file system, set dio and aio in init.ora
2. run tpc-c with 10 users
3. fails within 5 minutes
  
Actual results:

[root@bigbaddell2 common]# uname -a
Linux bigbaddell2.lab.boston.redhat.com 2.6.9-6.34.ELsmp #1 SMP Thu Mar 24
18:37:26 EST 2005 x86_64 x86_64 x86_64 GNU/Linux

Starting up ORACLE RDBMS Version: 10.1.0.3.0 (for the Oracle X86_64 ONLY)
(Oracle 32-bit versions runs fine if linked on an x686 system).



[root@bigbaddell2 common]# top

top - 09:46:52 up 1 day, 1 min,  3 users,  load average: 17.30, 16.58, 26.14
Tasks: 125 total,   3 running, 121 sleeping,   0 stopped,   1 zombie
Cpu0  : 17.6% us, 12.6% sy,  0.0% ni, 52.3% id, 17.5% wa,  0.0% hi,  0.0% si
Cpu1  : 17.5% us, 12.6% sy,  0.0% ni, 53.3% id, 16.5% wa,  0.0% hi,  0.0% si
Cpu2  : 17.4% us, 12.7% sy,  0.0% ni, 37.2% id, 31.3% wa,  0.4% hi,  1.0% si
Cpu3  : 17.5% us, 12.7% sy,  0.0% ni, 36.3% id, 30.0% wa,  0.4% hi,  3.1% si
Mem:   4033796k total,  3442708k used,   591088k free,    19228k buffers
Swap:  4096588k total,      160k used,  4096428k free,  2946184k cached
                                                                               
                                   
 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
5221 oracle    16   0 3605m 682m 678m R 99.9 17.3  29:49.90 oracle
5575 root      16   0  6148 1012  752 R  0.8  0.0   0:00.01 top
   1 root      16   0  4744  556  460 S  0.0  0.0   0:02.70 init
   2 root      RT   0     0    0    0 S  0.0  0.0   0:00.98 migration/0
   3 root      34  19     0    0    0 S  0.0  0.0   0:00.25 ksoftirqd/0
   4 root      RT   0     0    0    0 S  0.0  0.0   0:00.88 migration/1
   5 root      34  19     0    0    0 S  0.0  0.0   0:00.22 ksoftirqd/1
   6 root      RT   0     0    0    0 S  0.0  0.0   0:00.76 migration/2
   7 root      34  19     0    0    0 S  0.0  0.0   0:00.38 ksoftirqd/2
   8 root      RT   0     0    0    0 R  0.0  0.0   0:00.80 migration/3
   9 root      34  19     0    0    0 S  0.0  0.0   0:00.29 ksoftirqd/3
  10 root       5 -10     0    0    0 S  0.0  0.0   0:00.28 events/0
  11 root       5 -10     0    0    0 S  0.0  0.0   0:00.05 events/1
  12 root       5 -10     0    0    0 S  0.0  0.0   0:00.06 events/2
  13 root       5 -10     0    0    0 S  0.0  0.0   0:00.07 events/3
  14 root       5 -10     0    0    0 S  0.0  0.0   0:00.01 khelper
  15 root      15 -10     0    0    0 S  0.0  0.0   0:00.00 kacpid
  64 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 kblockd/0
  65 root       5 -10     0    0    0 S  0.0  0.0   0:00.00 kblockd/1
[root@bigbaddell2 common]# vmstat 3 3
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
2  6    160 591272  19228 2946184    0    0     9  1493   23    78 18 14 45 24
2  6    160 591272  19228 2946184    0    0     0     0 1020    70  0 25 25 50
2  6    160 591272  19228 2946184    0    0     0     0 1013    61  0 25 25 50

[root@bigbaddell2 common]# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Apr04 ?        00:00:02 init [5]
root         2     1  0 Apr04 ?        00:00:00 [migration/0]
root         3     1  0 Apr04 ?        00:00:00 [ksoftirqd/0]
root         4     1  0 Apr04 ?        00:00:00 [migration/1]
root         5     1  0 Apr04 ?        00:00:00 [ksoftirqd/1]
root         6     1  0 Apr04 ?        00:00:00 [migration/2]
root         7     1  0 Apr04 ?        00:00:00 [ksoftirqd/2]
root         8     1  0 Apr04 ?        00:00:00 [migration/3]
root         9     1  0 Apr04 ?        00:00:00 [ksoftirqd/3]
root        10     1  0 Apr04 ?        00:00:00 [events/0]
root        11     1  0 Apr04 ?        00:00:00 [events/1]
root        12     1  0 Apr04 ?        00:00:00 [events/2]
root        13     1  0 Apr04 ?        00:00:00 [events/3]
root        14    10  0 Apr04 ?        00:00:00 [khelper]
root        15    10  0 Apr04 ?        00:00:00 [kacpid]
root        64    10  0 Apr04 ?        00:00:00 [kblockd/0]
root        65    10  0 Apr04 ?        00:00:00 [kblockd/1]
root        66    10  0 Apr04 ?        00:00:00 [kblockd/2]
root        67    10  0 Apr04 ?        00:00:00 [kblockd/3]
root        83    10  0 Apr04 ?        00:00:15 [pdflush]
root        85    10  0 Apr04 ?        00:00:00 [aio/0]
root        86    10  0 Apr04 ?        00:00:00 [aio/1]
root        87    10  0 Apr04 ?        00:00:00 [aio/2]
root        88    10  0 Apr04 ?        00:00:00 [aio/3]
root        68     1  0 Apr04 ?        00:00:00 [khubd]
root        84     1  0 Apr04 ?        00:00:23 [kswapd0]
root       161     1  0 Apr04 ?        00:00:00 [kseriod]
root       232     1  0 Apr04 ?        00:00:00 [scsi_eh_0]
root       245     1  0 Apr04 ?        00:00:00 [scsi_eh_1]
root       259    10  0 Apr04 ?        00:00:00 [kmirrord/0]
root       260    12  0 Apr04 ?        00:00:00 [kmirrord/1]
root       261    10  0 Apr04 ?        00:00:00 [kmirrord/2]
root       262    12  0 Apr04 ?        00:00:00 [kmirrord/3]
root       271     1  0 Apr04 ?        00:00:01 [kjournald]
root      1310     1  0 Apr04 ?        00:00:00 udevd
root      1814     1  0 Apr04 ?        00:00:00 [kjournald]
root      1815     1  0 Apr04 ?        00:00:00 [kjournald]
root      1816     1  0 Apr04 ?        00:00:00 [kjournald]
root      2276     1  0 Apr04 ?        00:00:00 /sbin/dhclient -1 -q -lf
/var/lib/dhcp/dhclient-eth0.leases -pf /var/rroot      2401     1  0 Apr04 ?   
    00:00:00 syslogd -m 0
root      2405     1  0 Apr04 ?        00:00:00 klogd -x
root      2416     1  0 Apr04 ?        00:00:02 irqbalance
rpc       2427     1  0 Apr04 ?        00:00:00 portmap
rpcuser   2447     1  0 Apr04 ?        00:00:00 rpc.statd
root      2480     1  0 Apr04 ?        00:00:00 rpc.idmapd
root      2553     1  0 Apr04 ?        00:00:00 /usr/sbin/acpid
root      2616     1  0 Apr04 ?        00:00:00 /usr/sbin/sshd
root      2631     1  0 Apr04 ?        00:00:00 xinetd -stayalive -pidfile
/var/run/xinetd.pid
root      2650     1  0 Apr04 ?        00:00:00 rpc.rquotad
root      2654     1  0 Apr04 ?        00:00:00 [nfsd]
root      2655     1  0 Apr04 ?        00:00:00 [nfsd]
root      2656     1  0 Apr04 ?        00:00:00 [nfsd]
root      2657     1  0 Apr04 ?        00:00:00 [nfsd]
root      2658     1  0 Apr04 ?        00:00:00 [nfsd]
root      2659     1  0 Apr04 ?        00:00:00 [nfsd]
root      2660     1  0 Apr04 ?        00:00:00 [nfsd]
root      2661     1  0 Apr04 ?        00:00:00 [nfsd]
root      2662     1  0 Apr04 ?        00:00:00 [lockd]
root      2663     1  0 Apr04 ?        00:00:00 [rpciod]
root      2667     1  0 Apr04 ?        00:00:00 rpc.mountd
root      2703     1  0 Apr04 ?        00:00:00 sendmail: accepting connections
smmsp     2711     1  0 Apr04 ?        00:00:00 sendmail: Queue runner@01:00:00
for /var/spool/clientmqueue
root      2722     1  0 Apr04 ?        00:00:00 gpm -m /dev/input/mice -t imps2
htt       2752     1  0 Apr04 ?        00:00:00 /usr/sbin/htt -retryonerror 0
htt       2753  2752  0 Apr04 ?        00:00:00 htt_server -nodaemon
canna     2765     1  0 Apr04 ?        00:00:00 /usr/sbin/cannaserver -syslog -u
canna
root      2777     1  0 Apr04 ?        00:00:00 crond
xfs       2818     1  0 Apr04 ?        00:00:00 xfs -droppriv -daemon
daemon    2837     1  0 Apr04 ?        00:00:00 /usr/sbin/atd
dbus      2856     1  0 Apr04 ?        00:00:00 dbus-daemon-1 --system
root      2870     1  0 Apr04 ?        00:00:00 cups-config-daemon
root      2881     1  0 Apr04 ?        00:00:18 hald
root      2891     1  0 Apr04 tty1     00:00:00 /sbin/mingetty tty1
root      2892     1  0 Apr04 tty2     00:00:00 /sbin/mingetty tty2
root      2893     1  0 Apr04 tty3     00:00:00 /sbin/mingetty tty3
root      2895     1  0 Apr04 tty4     00:00:00 /sbin/mingetty tty4
root      2901     1  0 Apr04 tty5     00:00:00 /sbin/mingetty tty5
root      2902     1  0 Apr04 tty6     00:00:00 /sbin/mingetty tty6
root      2904     1  0 Apr04 ?        00:00:00 /usr/bin/gdm-binary -nodaemon
root      3338  2904  0 Apr04 ?        00:00:00 /usr/bin/gdm-binary -nodaemon
root      3341  3338  0 Apr04 ?        00:00:43 /usr/X11R6/bin/X :0 -audit 0
-auth /var/gdm/:0.Xauth -nolisten tcp vt7gdm       3679  3338  0 Apr04 ?       
00:00:13 /usr/bin/gdmgreeter
root      3947     1  0 Apr04 ?        00:00:00 cupsd
root      6753  2616  0 Apr04 ?        00:00:00 sshd: root@pts/1
root      6755  6753  0 Apr04 pts/1    00:00:00 -bash
root      6721    13  0 07:35 ?        00:00:00 [pdflush]
root      5145  2616  0 08:55 ?        00:00:00 sshd: root@pts/2
root      5147  5145  0 08:55 pts/2    00:00:00 -bash
root      5182  5147  0 08:56 pts/2    00:00:00 su - oracle
oracle    5183  5182  0 08:56 pts/2    00:00:00 -bash
oracle    5217     1  0 09:00 ?        00:00:00 ora_pmon_tpcc
oracle    5219     1  0 09:00 ?        00:00:00 ora_mman_tpcc
oracle    5221     1 65 09:00 ?        00:31:29 ora_dbw0_tpcc
oracle    5223     1  1 09:00 ?        00:00:37 ora_lgwr_tpcc
oracle    5225     1  0 09:00 ?        00:00:00 ora_ckpt_tpcc
oracle    5227     1  0 09:00 ?        00:00:00 ora_smon_tpcc
oracle    5229     1  0 09:00 ?        00:00:00 ora_reco_tpcc
oracle    5235     1  0 09:00 ?        00:00:00 ora_qmnc_tpcc
oracle    5237     1  0 09:00 ?        00:00:00 ora_mmon_tpcc
oracle    5239     1  0 09:00 ?        00:00:00 ora_mmnl_tpcc
oracle    5445  5183  0 09:12 pts/2    00:00:00 /bin/ksh ./nrunc.sh rhel4_u1 5
10 300
oracle    5456  5445  0 09:12 pts/2    00:00:00 /bin/sh ./ustat.sh 300 60 5 95 1
n y post
oracle    5461     1  0 09:12 pts/2    00:00:00 vmstat 30 12
oracle    5462     1  0 09:12 pts/2    00:00:00 mpstat -P ALL 30 12
oracle    5463     1  0 09:12 pts/2    00:00:00 iostat -x -k 30 12
oracle    5468  5445  0 09:12 pts/2    00:00:00 ../bin/runtpb.exe 5
../bin/tpcc.exe 95 300 5 SEQ:1 DELAY:5 1 1 0 0 0 0oracle    5469  5468  0 09:12
pts/2    00:00:08 ../bin/tpcc.exe 95 300 5 1 1112706759 1 1 0 0 0 0 0 255 1023
8191 x10
oracle    5471     1  2 09:12 ?        00:01:04 oracletpcc
(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    5472  5468  0 09:12 pts/2    00:00:08 ../bin/tpcc.exe 95 300 5 2
1112706759 1 1 0 0 0 0 0 255 1023 8191 x10
oracle    5474     1  2 09:12 ?        00:01:03 oracletpcc
(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    5475  5468  0 09:12 pts/2    00:00:08 ../bin/tpcc.exe 95 300 5 3
1112706759 1 1 0 0 0 0 0 255 1023 8191 x10
oracle    5477     1  2 09:12 ?        00:01:04 oracletpcc
(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    5478  5468  0 09:12 pts/2    00:00:08 ../bin/tpcc.exe 95 300 5 4
1112706759 1 1 0 0 0 0 0 255 1023 8191 x10
oracle    5480     1  2 09:12 ?        00:01:04 oracletpcc
(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    5481  5468  0 09:12 pts/2    00:00:08 ../bin/tpcc.exe 95 300 5 5
1112706759 1 1 0 0 0 0 0 255 1023 8191 x10
oracle    5483     1  2 09:12 ?        00:01:03 oracletpcc
(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
root      5491  2616  0 09:17 ?        00:00:00 sshd: root@pts/3
root      5493  5491  0 09:17 pts/3    00:00:01 -bash
oracle    5529  5456  0 09:18 pts/2    00:00:00 date
oracle    5531  5183  0 09:19 pts/2    00:00:00 ls --color=tty -ltr
root      5532  2616  0 09:19 ?        00:00:00 sshd: root [priv]
sshd      5533  5532  0 09:19 ?        00:00:00 [sshd] <defunct>
oracle    5537     1  0 09:21 ?        00:00:00 ora_qmnc_tpcc
root      5577  5493  0 09:48 pts/3    00:00:00 ps -ef

We can complete sysrq m and t and NO logs were written. strace -p 5221 ... hangs
... no output.

Expected results:

Should complete for upto 2 hour runs ...produce tpmC report, *stats -io, vm, mp
and Oracle statspack.


Additional info:
Oracle Alert_log shows CJQ0 process stopped?

Private_strands 0 at log switch
Thread 1 advanced to log sequence 83
 Current log# 1 seq# 83 mem# 0: /raid0_fs/tpcc_datafiles/log_1_1
Tue Apr  5 09:16:29 2005
Stopping background process CJQ0
/oracle/ora2/rdbms/log/tpcc_cjq0_5241.trc
Oracle Database 10g Enterprise Edition Release 10.1.0.3.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /oracle/ora2
System name:    Linux
Node name:      bigbaddell2.lab.boston.redhat.com
Release:        2.6.9-6.34.ELsmp
Version:        #1 SMP Thu Mar 24 18:37:26 EST 2005
Machine:        x86_64
Instance name: tpcc
Redo thread mounted by this instance: 1
Oracle process number: 13
Unix process pid: 5241, image: oracle.boston.redhat.com (CJQ0)

*** SERVICE NAME:(SYS$BACKGROUND) 2005-04-05 09:00:31.116
*** SESSION ID:(162.5) 2005-04-05 09:00:31.116
Before in kgllkdl(), hd=31ee1988.
Before in kgllkdl(), hd=31f00978.

Comment 1 John Shakshober 2005-04-06 21:21:20 UTC
Created attachment 112783 [details]
has sysrq m and t

Comment 3 Kiersten (Kerri) Anderson 2005-05-26 16:15:35 UTC
Adding to RHEL4 U2 Proposed List

Comment 5 Kiersten (Kerri) Anderson 2005-06-08 21:05:48 UTC
Development ACK

Comment 8 Jeff Moyer 2005-07-15 15:17:22 UTC
Shak, correct me if I'm wrong, but this problem has been addressed, right?  Can
we set this to modified?

Comment 9 John Shakshober 2005-07-16 00:30:44 UTC
Yes this has been resolved with your U2 fix from upstream.
Tested on x686, x86_64 (em64t), x86_64 (amd64) by Red Hat Perf (me)
and on IA64 by HP.  Results in 
http:/people.redhat.com/dshaks/.perf/results_oracle
Shak

Comment 11 Jeff Moyer 2005-08-16 18:21:15 UTC
The kernel commit which fixed this problem was this:

* Mon Jun 13 2005 Jason Baron <jbaron> [2.6.9-11.11]
- make up_read, up_write safe to call from interrupt context



Comment 17 Red Hat Bugzilla 2005-10-05 12:56:55 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-514.html



Note You need to log in before you can comment on or make changes to this bug.