Bug 749121 - (Sigar issue:) CPU core numbers on virtual hosts on AIX cause unmonitored CPUs, JVM frequent crashes and agent.log warnings
Summary: (Sigar issue:) CPU core numbers on virtual hosts on AIX cause unmonitored CPU...
Keywords:
Status: NEW
Alias: None
Product: RHQ Project
Classification: Other
Component: Agent, Plugins
Version: unspecified
Hardware: ppc64
OS: Other
medium
medium vote
Target Milestone: ---
: ---
Assignee: RHQ Project Maintainer
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-26 09:01 UTC by Andreas Dietrich
Modified: 2011-10-31 15:40 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)
java core dump likely due to the sigar cpu numbering problem issue (451.15 KB, text/plain)
2011-10-26 09:02 UTC, Andreas Dietrich
no flags Details

Description Andreas Dietrich 2011-10-26 09:01:41 UTC
Hint:

Hi,

sorry to post this probably won't fix issue here, but I could not register on the sigar bugtracker (JIRA) (and it in fact also effects RHQ).
Maybe somebody else can put it there (I saw Ian there?).

Kind regards
Andreas :-)


Description of problem / Actual results / Expected results:

1. Only 2-cores of a 4-core virtualized host (the underlying physical host has 16 numbered cores) are monitored.
2. Agent logs warnings on invalid cpu percentage values.
3. Agent JVM crashes with core dumps after some hours.

(For now we disabled the measurement collection for the effected CPUs and the dependent effects 2. and 3. are solved for now)


Version-Release number of selected component (if applicable):

This problem has been identified as one of the Sigar subsystem.

Even the latest master sigar built fails (downloaded on 2011-10-26 10:00):

*.so files from http://hudson.hyperic.com/job/sigar-master-ppc-aix-5.3/ and http://hudson.hyperic.com/job/sigar-master-ppc64-aix-5.3/
sigar.jar from http://hudson.hyperic.com/view/sigar/job/sigar-master-ppc64-linux-2.6/

It also does not work with (the RHQ 3.0.1 GA delivered) sigar-1.6.5.132.jar, nor the offical stable 1.6.4 version.


How reproducible / Steps to Reproduce:

recognize the "N%" values of CPU 0 and 1

root@host:/opt/rhq-agent/lib> /usr/java5/bin/java -jar sigar.jar
sigar> cpuinfo
Vendor.........IBM
Model..........PowerPC Unknown
Mhz............4204
Total CPUs.....4
Cache size....0

CPU 0.........
User Time.....N%
Sys Time......N%
Idle Time.....N%
Wait Time.....N%
Nice Time.....N%
Combined......N%
Irq Time......N%

CPU 1.........
User Time.....N%
Sys Time......N%
Idle Time.....N%
Wait Time.....N%
Nice Time.....N%
Combined......N%
Irq Time......N%

CPU 2.........
User Time.....19.6%
Sys Time......9.8%
Idle Time.....31.3%
Wait Time.....39.2%
Nice Time.....0.0%
Combined......68.6%
Irq Time......0.0%

CPU 3.........
User Time.....3.9%
Sys Time......3.9%
Idle Time.....76.4%
Wait Time.....15.6%
Nice Time.....0.0%
Combined......23.5%
Irq Time......0.0%
Totals........
User Time.....19.1%
Sys Time......8.8%
Idle Time.....33.3%
Wait Time.....38.7%
Nice Time.....0.0%
Combined......66.6%
Irq Time......0.0%


Additional info:

recognize the CPU numbers not starting from 0 (because the host is virtualized)

> nmon  # afterwards press c

┌─topas_nmon──j=JFS──────────────Host=rz2-db1────────Refresh=2 secs───15:13.43───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CPU-Utilisation-Small-View ───────────EntitledCPU=  0.50 UsedCPU=  1.069───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────│
│Logical  CPUs              0----------25-----------50----------75----------100                                                                                                                                                                                              │
│CPU User%  Sys% Wait% Idle%|           |            |           |            |                                                                                                                                                                                              │
│  2  14.1  15.1  28.1  42.7|UUUUUUUsssssssWWWWWWWWWWWWWW      >              |                                                                                                                                                                                              │
│  3  41.0  11.5  14.0  33.5|UUUUUUUUUUUUUUUUUUUUsssssWWWWWWW>                |                                                                                                                                                                                              │
│  4  15.6  12.6  27.6  44.2|UUUUUUUssssssWWWWWWWWWWWWW                 >     |                                                                                                                                                                                              │
│  5  10.5  11.5  12.0  66.0|UUUUUsssssWWWWWW                  >              |                                                                                                                                                                                              │
│EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+                                                                                                                                                                                              │
│EC+  65.9  29.5   1.5   3.2|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUssssssssssssssi--|                                                                                                                                                                                              │
│ VP  35.2  15.8   0.8   1.7|UUUUUUUUUUUUUUUUUsssssss-------------------------|                                                                                                                                                                                              │
│EC= 213.9%  VP=  53.5%     +--No Cap---|------------|-----------100% VP=2 CPU+            


on host 2 that is basically the same machine all 4 CPUs do not show this problem:

> nmon  # afterwards press c

&┌─topas_nmon──b=Black&White──────Host=rz1-db1────────Refresh=2 secs───15:15.38───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CPU-Utilisation-Small-View ───────────EntitledCPU=  0.50 UsedCPU=  0.732───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────│
│Logical  CPUs              0----------25-----------50----------75----------100                                                                                                                                                                                              │
│CPU User%  Sys% Wait% Idle%|           |            |           |            |                                                                                                                                                                                              │
│  0  34.5  11.0  42.5  12.0|UUUUUUUUUUUUUUUUUsssssWWWWWWWWWWWWWWWWWWWWW    > |                                                                                                                                                                                              │
│  1  43.5   7.0  16.0  33.5|UUUUUUUUUUUUUUUUUUUUUsssWWWWWWWW           >     |                                                                                                                                                                                              │
│  2   2.0   0.0   2.0  96.0|UW                                    >          |                                                                                                                                                                                              │
│  3   1.5   2.5   3.0  93.0|sW                                              >|                                                                                                                                                                                              │
│EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+                                                                                                                                                                                              │
│EC+  78.6  18.1   1.8   1.6|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUsssssssss-|                                                                                                                                                                                              │
│ VP  28.7   6.6   0.6   0.6|UUUUUUUUUUUUUUsss--------------------------------|                                                                                                                                                                                              │
│EC= 146.4%  VP=  36.6%     +--No Cap---|------------|-----------100% VP=2 CPU+    


a lot of agent.log warnings (one line per CPU measurement) like this are logged every 10 min (meas cycle):

2011-10-25 15:14:00,072 WARN [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Numeric CpuPerc.user with id 1087086 is invalid, value was 'NaN'


the more interesting lines from the jvm core dump (see also the full attached dump):

1XHEXCPMODULE  Compiling method: java/io/ObjectOutputStream.writeObject0(Ljava/lang/Object;Z)V

...

0SECTION       LOCKS subcomponent dump routine
NULL           ===============================
NULL
1LKPOOLINFO    Monitor pool info:
2LKPOOLTOTAL     Current total number of monitors: 14
NULL
1LKMONPOOLDUMP Monitor Pool Dump (flat & inflated object-monitors):

...

2LKMONINUSE      sys_mon_t:0x313D33E0 infl_mon_t: 0x313D3408:
3LKMONOBJECT       java/lang/Object@E86704F8/E8670504: owner "RHQ Command Queue Throttle Thread" (0x31685300), entry count 1

...

0SECTION       THREADS subcomponent dump routine
NULL           =================================
NULL
1XMCURTHDINFO  Current Thread Details
NULL           ----------------------
3XMTHREADINFO      "JIT Compilation Thread" (TID:0x302BFF00, sys_thread_t:0x3000DB70, state:R, native ID:0x00286077) prio=11
NULL
1XMTHDINFO     All Thread Details
NULL           ------------------
NULL
2XMFULLTHDDUMP Full thread dump J9 VM (J2RE 5.0 IBM J9 2.3 AIX ppc-32 build 20080314_17962_bHdSMr, native threads):

...

3XMTHREADINFO      "JIT Compilation Thread" (TID:0x302BFF00, sys_thread_t:0x3000DB70, state:R, native ID:0x00286077) prio=11
3XMTHREADINFO      "Signal Dispatcher" (TID:0x302C0300, sys_thread_t:0x3000E0A8, state:R, native ID:0x0043E015) prio=5
4XESTACKTRACE          at com/ibm/misc/SignalDispatcher.waitForSignal(Native Method)
4XESTACKTRACE          at com/ibm/misc/SignalDispatcher.run(SignalDispatcher.java:84)

...

3XMTHREADINFO      "AcceptorThread#0:50176" (TID:0x31A43700, sys_thread_t:0x313C39E8, state:R, native ID:0x00404077) prio=5
4XESTACKTRACE          at java/net/PlainSocketImpl.socketAccept(Native Method)
4XESTACKTRACE          at java/net/PlainSocketImpl.accept(PlainSocketImpl.java:455)
4XESTACKTRACE          at java/net/ServerSocket.implAccept(ServerSocket.java:462)
4XESTACKTRACE          at java/net/ServerSocket.accept(ServerSocket.java:433)
4XESTACKTRACE          at org/jboss/remoting/transport/socket/SocketServerInvoker.run(SocketServerInvoker.java:520)
4XESTACKTRACE          at java/lang/Thread.run(Thread.java:810)


jvm version infos:

java version "1.5.0" 
Java(TM) 2 Runtime Environment, Standard Edition (build pap32dev-20080315 (SR7))
IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 AIX ppc-32 j9vmap3223-20080315 (JIT enabled)
J9VM - 20080314_17962_bHdSMr
JIT  - 20080130_0718ifx2_r8
GC   - 200802_08)
JCL  - 20080314

Comment 1 Andreas Dietrich 2011-10-26 09:02:55 UTC
Created attachment 530261 [details]
java core dump likely due to the sigar cpu numbering problem issue


Note You need to log in before you can comment on or make changes to this bug.