Description of problem: RHEL4 U1 Oracle 10G 10.0.3 aio hang running tpc-c. This runs fine on RHEL3 U4/U5 and with Oracle 10G 10.0.3 with the 32-bit version of oracle. Version-Release number of selected component (if applicable): Fails only with 10G 64-bit and RHEL4 (day0, U1) 2.6.9-6.34.ELsmp libaio-0.3.103-2 After running tpc-C for more than 5 minutes, the oracle database writer (process 5221) will hang. The jobs can NOT be killed, any other disk access hangs the process that is connected in. (ie df, ls, cat, more ... etc). How reproducible: every time Steps to Reproduce: 1. build tpc-c to ext3 file system, set dio and aio in init.ora 2. run tpc-c with 10 users 3. fails within 5 minutes Actual results: [root@bigbaddell2 common]# uname -a Linux bigbaddell2.lab.boston.redhat.com 2.6.9-6.34.ELsmp #1 SMP Thu Mar 24 18:37:26 EST 2005 x86_64 x86_64 x86_64 GNU/Linux Starting up ORACLE RDBMS Version: 10.1.0.3.0 (for the Oracle X86_64 ONLY) (Oracle 32-bit versions runs fine if linked on an x686 system). [root@bigbaddell2 common]# top top - 09:46:52 up 1 day, 1 min, 3 users, load average: 17.30, 16.58, 26.14 Tasks: 125 total, 3 running, 121 sleeping, 0 stopped, 1 zombie Cpu0 : 17.6% us, 12.6% sy, 0.0% ni, 52.3% id, 17.5% wa, 0.0% hi, 0.0% si Cpu1 : 17.5% us, 12.6% sy, 0.0% ni, 53.3% id, 16.5% wa, 0.0% hi, 0.0% si Cpu2 : 17.4% us, 12.7% sy, 0.0% ni, 37.2% id, 31.3% wa, 0.4% hi, 1.0% si Cpu3 : 17.5% us, 12.7% sy, 0.0% ni, 36.3% id, 30.0% wa, 0.4% hi, 3.1% si Mem: 4033796k total, 3442708k used, 591088k free, 19228k buffers Swap: 4096588k total, 160k used, 4096428k free, 2946184k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5221 oracle 16 0 3605m 682m 678m R 99.9 17.3 29:49.90 oracle 5575 root 16 0 6148 1012 752 R 0.8 0.0 0:00.01 top 1 root 16 0 4744 556 460 S 0.0 0.0 0:02.70 init 2 root RT 0 0 0 0 S 0.0 0.0 0:00.98 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:00.25 ksoftirqd/0 4 root RT 0 0 0 0 S 0.0 0.0 0:00.88 migration/1 5 root 34 19 0 0 0 S 0.0 0.0 0:00.22 ksoftirqd/1 6 root RT 0 0 0 0 S 0.0 0.0 0:00.76 migration/2 7 root 34 19 0 0 0 S 0.0 0.0 0:00.38 ksoftirqd/2 8 root RT 0 0 0 0 R 0.0 0.0 0:00.80 migration/3 9 root 34 19 0 0 0 S 0.0 0.0 0:00.29 ksoftirqd/3 10 root 5 -10 0 0 0 S 0.0 0.0 0:00.28 events/0 11 root 5 -10 0 0 0 S 0.0 0.0 0:00.05 events/1 12 root 5 -10 0 0 0 S 0.0 0.0 0:00.06 events/2 13 root 5 -10 0 0 0 S 0.0 0.0 0:00.07 events/3 14 root 5 -10 0 0 0 S 0.0 0.0 0:00.01 khelper 15 root 15 -10 0 0 0 S 0.0 0.0 0:00.00 kacpid 64 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 kblockd/0 65 root 5 -10 0 0 0 S 0.0 0.0 0:00.00 kblockd/1 [root@bigbaddell2 common]# vmstat 3 3 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 6 160 591272 19228 2946184 0 0 9 1493 23 78 18 14 45 24 2 6 160 591272 19228 2946184 0 0 0 0 1020 70 0 25 25 50 2 6 160 591272 19228 2946184 0 0 0 0 1013 61 0 25 25 50 [root@bigbaddell2 common]# ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 Apr04 ? 00:00:02 init [5] root 2 1 0 Apr04 ? 00:00:00 [migration/0] root 3 1 0 Apr04 ? 00:00:00 [ksoftirqd/0] root 4 1 0 Apr04 ? 00:00:00 [migration/1] root 5 1 0 Apr04 ? 00:00:00 [ksoftirqd/1] root 6 1 0 Apr04 ? 00:00:00 [migration/2] root 7 1 0 Apr04 ? 00:00:00 [ksoftirqd/2] root 8 1 0 Apr04 ? 00:00:00 [migration/3] root 9 1 0 Apr04 ? 00:00:00 [ksoftirqd/3] root 10 1 0 Apr04 ? 00:00:00 [events/0] root 11 1 0 Apr04 ? 00:00:00 [events/1] root 12 1 0 Apr04 ? 00:00:00 [events/2] root 13 1 0 Apr04 ? 00:00:00 [events/3] root 14 10 0 Apr04 ? 00:00:00 [khelper] root 15 10 0 Apr04 ? 00:00:00 [kacpid] root 64 10 0 Apr04 ? 00:00:00 [kblockd/0] root 65 10 0 Apr04 ? 00:00:00 [kblockd/1] root 66 10 0 Apr04 ? 00:00:00 [kblockd/2] root 67 10 0 Apr04 ? 00:00:00 [kblockd/3] root 83 10 0 Apr04 ? 00:00:15 [pdflush] root 85 10 0 Apr04 ? 00:00:00 [aio/0] root 86 10 0 Apr04 ? 00:00:00 [aio/1] root 87 10 0 Apr04 ? 00:00:00 [aio/2] root 88 10 0 Apr04 ? 00:00:00 [aio/3] root 68 1 0 Apr04 ? 00:00:00 [khubd] root 84 1 0 Apr04 ? 00:00:23 [kswapd0] root 161 1 0 Apr04 ? 00:00:00 [kseriod] root 232 1 0 Apr04 ? 00:00:00 [scsi_eh_0] root 245 1 0 Apr04 ? 00:00:00 [scsi_eh_1] root 259 10 0 Apr04 ? 00:00:00 [kmirrord/0] root 260 12 0 Apr04 ? 00:00:00 [kmirrord/1] root 261 10 0 Apr04 ? 00:00:00 [kmirrord/2] root 262 12 0 Apr04 ? 00:00:00 [kmirrord/3] root 271 1 0 Apr04 ? 00:00:01 [kjournald] root 1310 1 0 Apr04 ? 00:00:00 udevd root 1814 1 0 Apr04 ? 00:00:00 [kjournald] root 1815 1 0 Apr04 ? 00:00:00 [kjournald] root 1816 1 0 Apr04 ? 00:00:00 [kjournald] root 2276 1 0 Apr04 ? 00:00:00 /sbin/dhclient -1 -q -lf /var/lib/dhcp/dhclient-eth0.leases -pf /var/rroot 2401 1 0 Apr04 ? 00:00:00 syslogd -m 0 root 2405 1 0 Apr04 ? 00:00:00 klogd -x root 2416 1 0 Apr04 ? 00:00:02 irqbalance rpc 2427 1 0 Apr04 ? 00:00:00 portmap rpcuser 2447 1 0 Apr04 ? 00:00:00 rpc.statd root 2480 1 0 Apr04 ? 00:00:00 rpc.idmapd root 2553 1 0 Apr04 ? 00:00:00 /usr/sbin/acpid root 2616 1 0 Apr04 ? 00:00:00 /usr/sbin/sshd root 2631 1 0 Apr04 ? 00:00:00 xinetd -stayalive -pidfile /var/run/xinetd.pid root 2650 1 0 Apr04 ? 00:00:00 rpc.rquotad root 2654 1 0 Apr04 ? 00:00:00 [nfsd] root 2655 1 0 Apr04 ? 00:00:00 [nfsd] root 2656 1 0 Apr04 ? 00:00:00 [nfsd] root 2657 1 0 Apr04 ? 00:00:00 [nfsd] root 2658 1 0 Apr04 ? 00:00:00 [nfsd] root 2659 1 0 Apr04 ? 00:00:00 [nfsd] root 2660 1 0 Apr04 ? 00:00:00 [nfsd] root 2661 1 0 Apr04 ? 00:00:00 [nfsd] root 2662 1 0 Apr04 ? 00:00:00 [lockd] root 2663 1 0 Apr04 ? 00:00:00 [rpciod] root 2667 1 0 Apr04 ? 00:00:00 rpc.mountd root 2703 1 0 Apr04 ? 00:00:00 sendmail: accepting connections smmsp 2711 1 0 Apr04 ? 00:00:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue root 2722 1 0 Apr04 ? 00:00:00 gpm -m /dev/input/mice -t imps2 htt 2752 1 0 Apr04 ? 00:00:00 /usr/sbin/htt -retryonerror 0 htt 2753 2752 0 Apr04 ? 00:00:00 htt_server -nodaemon canna 2765 1 0 Apr04 ? 00:00:00 /usr/sbin/cannaserver -syslog -u canna root 2777 1 0 Apr04 ? 00:00:00 crond xfs 2818 1 0 Apr04 ? 00:00:00 xfs -droppriv -daemon daemon 2837 1 0 Apr04 ? 00:00:00 /usr/sbin/atd dbus 2856 1 0 Apr04 ? 00:00:00 dbus-daemon-1 --system root 2870 1 0 Apr04 ? 00:00:00 cups-config-daemon root 2881 1 0 Apr04 ? 00:00:18 hald root 2891 1 0 Apr04 tty1 00:00:00 /sbin/mingetty tty1 root 2892 1 0 Apr04 tty2 00:00:00 /sbin/mingetty tty2 root 2893 1 0 Apr04 tty3 00:00:00 /sbin/mingetty tty3 root 2895 1 0 Apr04 tty4 00:00:00 /sbin/mingetty tty4 root 2901 1 0 Apr04 tty5 00:00:00 /sbin/mingetty tty5 root 2902 1 0 Apr04 tty6 00:00:00 /sbin/mingetty tty6 root 2904 1 0 Apr04 ? 00:00:00 /usr/bin/gdm-binary -nodaemon root 3338 2904 0 Apr04 ? 00:00:00 /usr/bin/gdm-binary -nodaemon root 3341 3338 0 Apr04 ? 00:00:43 /usr/X11R6/bin/X :0 -audit 0 -auth /var/gdm/:0.Xauth -nolisten tcp vt7gdm 3679 3338 0 Apr04 ? 00:00:13 /usr/bin/gdmgreeter root 3947 1 0 Apr04 ? 00:00:00 cupsd root 6753 2616 0 Apr04 ? 00:00:00 sshd: root@pts/1 root 6755 6753 0 Apr04 pts/1 00:00:00 -bash root 6721 13 0 07:35 ? 00:00:00 [pdflush] root 5145 2616 0 08:55 ? 00:00:00 sshd: root@pts/2 root 5147 5145 0 08:55 pts/2 00:00:00 -bash root 5182 5147 0 08:56 pts/2 00:00:00 su - oracle oracle 5183 5182 0 08:56 pts/2 00:00:00 -bash oracle 5217 1 0 09:00 ? 00:00:00 ora_pmon_tpcc oracle 5219 1 0 09:00 ? 00:00:00 ora_mman_tpcc oracle 5221 1 65 09:00 ? 00:31:29 ora_dbw0_tpcc oracle 5223 1 1 09:00 ? 00:00:37 ora_lgwr_tpcc oracle 5225 1 0 09:00 ? 00:00:00 ora_ckpt_tpcc oracle 5227 1 0 09:00 ? 00:00:00 ora_smon_tpcc oracle 5229 1 0 09:00 ? 00:00:00 ora_reco_tpcc oracle 5235 1 0 09:00 ? 00:00:00 ora_qmnc_tpcc oracle 5237 1 0 09:00 ? 00:00:00 ora_mmon_tpcc oracle 5239 1 0 09:00 ? 00:00:00 ora_mmnl_tpcc oracle 5445 5183 0 09:12 pts/2 00:00:00 /bin/ksh ./nrunc.sh rhel4_u1 5 10 300 oracle 5456 5445 0 09:12 pts/2 00:00:00 /bin/sh ./ustat.sh 300 60 5 95 1 n y post oracle 5461 1 0 09:12 pts/2 00:00:00 vmstat 30 12 oracle 5462 1 0 09:12 pts/2 00:00:00 mpstat -P ALL 30 12 oracle 5463 1 0 09:12 pts/2 00:00:00 iostat -x -k 30 12 oracle 5468 5445 0 09:12 pts/2 00:00:00 ../bin/runtpb.exe 5 ../bin/tpcc.exe 95 300 5 SEQ:1 DELAY:5 1 1 0 0 0 0oracle 5469 5468 0 09:12 pts/2 00:00:08 ../bin/tpcc.exe 95 300 5 1 1112706759 1 1 0 0 0 0 0 255 1023 8191 x10 oracle 5471 1 2 09:12 ? 00:01:04 oracletpcc (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) oracle 5472 5468 0 09:12 pts/2 00:00:08 ../bin/tpcc.exe 95 300 5 2 1112706759 1 1 0 0 0 0 0 255 1023 8191 x10 oracle 5474 1 2 09:12 ? 00:01:03 oracletpcc (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) oracle 5475 5468 0 09:12 pts/2 00:00:08 ../bin/tpcc.exe 95 300 5 3 1112706759 1 1 0 0 0 0 0 255 1023 8191 x10 oracle 5477 1 2 09:12 ? 00:01:04 oracletpcc (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) oracle 5478 5468 0 09:12 pts/2 00:00:08 ../bin/tpcc.exe 95 300 5 4 1112706759 1 1 0 0 0 0 0 255 1023 8191 x10 oracle 5480 1 2 09:12 ? 00:01:04 oracletpcc (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) oracle 5481 5468 0 09:12 pts/2 00:00:08 ../bin/tpcc.exe 95 300 5 5 1112706759 1 1 0 0 0 0 0 255 1023 8191 x10 oracle 5483 1 2 09:12 ? 00:01:03 oracletpcc (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) root 5491 2616 0 09:17 ? 00:00:00 sshd: root@pts/3 root 5493 5491 0 09:17 pts/3 00:00:01 -bash oracle 5529 5456 0 09:18 pts/2 00:00:00 date oracle 5531 5183 0 09:19 pts/2 00:00:00 ls --color=tty -ltr root 5532 2616 0 09:19 ? 00:00:00 sshd: root [priv] sshd 5533 5532 0 09:19 ? 00:00:00 [sshd] <defunct> oracle 5537 1 0 09:21 ? 00:00:00 ora_qmnc_tpcc root 5577 5493 0 09:48 pts/3 00:00:00 ps -ef We can complete sysrq m and t and NO logs were written. strace -p 5221 ... hangs ... no output. Expected results: Should complete for upto 2 hour runs ...produce tpmC report, *stats -io, vm, mp and Oracle statspack. Additional info: Oracle Alert_log shows CJQ0 process stopped? Private_strands 0 at log switch Thread 1 advanced to log sequence 83 Current log# 1 seq# 83 mem# 0: /raid0_fs/tpcc_datafiles/log_1_1 Tue Apr 5 09:16:29 2005 Stopping background process CJQ0 /oracle/ora2/rdbms/log/tpcc_cjq0_5241.trc Oracle Database 10g Enterprise Edition Release 10.1.0.3.0 - 64bit Production With the Partitioning, OLAP and Data Mining options ORACLE_HOME = /oracle/ora2 System name: Linux Node name: bigbaddell2.lab.boston.redhat.com Release: 2.6.9-6.34.ELsmp Version: #1 SMP Thu Mar 24 18:37:26 EST 2005 Machine: x86_64 Instance name: tpcc Redo thread mounted by this instance: 1 Oracle process number: 13 Unix process pid: 5241, image: oracle.boston.redhat.com (CJQ0) *** SERVICE NAME:(SYS$BACKGROUND) 2005-04-05 09:00:31.116 *** SESSION ID:(162.5) 2005-04-05 09:00:31.116 Before in kgllkdl(), hd=31ee1988. Before in kgllkdl(), hd=31f00978.
Created attachment 112783 [details] has sysrq m and t
Adding to RHEL4 U2 Proposed List
Development ACK
Shak, correct me if I'm wrong, but this problem has been addressed, right? Can we set this to modified?
Yes this has been resolved with your U2 fix from upstream. Tested on x686, x86_64 (em64t), x86_64 (amd64) by Red Hat Perf (me) and on IA64 by HP. Results in http:/people.redhat.com/dshaks/.perf/results_oracle Shak
The kernel commit which fixed this problem was this: * Mon Jun 13 2005 Jason Baron <jbaron> [2.6.9-11.11] - make up_read, up_write safe to call from interrupt context
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-514.html