Bug 749121

Summary: (Sigar issue:) CPU core numbers on virtual hosts on AIX cause unmonitored CPUs, JVM frequent crashes and agent.log warnings
Product: [Other] RHQ Project Reporter: Andreas Dietrich <adi>
Component: Agent, PluginsAssignee: Nobody <nobody>
Status: NEW --- QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: unspecifiedCC: adi, hrupp
Target Milestone: ---   
Target Release: ---   
Hardware: ppc64   
OS: Other   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
java core dump likely due to the sigar cpu numbering problem issue none

Description Andreas Dietrich 2011-10-26 09:01:41 UTC
Hint:

Hi,

sorry to post this probably won't fix issue here, but I could not register on the sigar bugtracker (JIRA) (and it in fact also effects RHQ).
Maybe somebody else can put it there (I saw Ian there?).

Kind regards
Andreas :-)


Description of problem / Actual results / Expected results:

1. Only 2-cores of a 4-core virtualized host (the underlying physical host has 16 numbered cores) are monitored.
2. Agent logs warnings on invalid cpu percentage values.
3. Agent JVM crashes with core dumps after some hours.

(For now we disabled the measurement collection for the effected CPUs and the dependent effects 2. and 3. are solved for now)


Version-Release number of selected component (if applicable):

This problem has been identified as one of the Sigar subsystem.

Even the latest master sigar built fails (downloaded on 2011-10-26 10:00):

*.so files from http://hudson.hyperic.com/job/sigar-master-ppc-aix-5.3/ and http://hudson.hyperic.com/job/sigar-master-ppc64-aix-5.3/
sigar.jar from http://hudson.hyperic.com/view/sigar/job/sigar-master-ppc64-linux-2.6/

It also does not work with (the RHQ 3.0.1 GA delivered) sigar-1.6.5.132.jar, nor the offical stable 1.6.4 version.


How reproducible / Steps to Reproduce:

recognize the "N%" values of CPU 0 and 1

root@host:/opt/rhq-agent/lib> /usr/java5/bin/java -jar sigar.jar
sigar> cpuinfo
Vendor.........IBM
Model..........PowerPC Unknown
Mhz............4204
Total CPUs.....4
Cache size....0

CPU 0.........
User Time.....N%
Sys Time......N%
Idle Time.....N%
Wait Time.....N%
Nice Time.....N%
Combined......N%
Irq Time......N%

CPU 1.........
User Time.....N%
Sys Time......N%
Idle Time.....N%
Wait Time.....N%
Nice Time.....N%
Combined......N%
Irq Time......N%

CPU 2.........
User Time.....19.6%
Sys Time......9.8%
Idle Time.....31.3%
Wait Time.....39.2%
Nice Time.....0.0%
Combined......68.6%
Irq Time......0.0%

CPU 3.........
User Time.....3.9%
Sys Time......3.9%
Idle Time.....76.4%
Wait Time.....15.6%
Nice Time.....0.0%
Combined......23.5%
Irq Time......0.0%
Totals........
User Time.....19.1%
Sys Time......8.8%
Idle Time.....33.3%
Wait Time.....38.7%
Nice Time.....0.0%
Combined......66.6%
Irq Time......0.0%


Additional info:

recognize the CPU numbers not starting from 0 (because the host is virtualized)

> nmon  # afterwards press c

┌─topas_nmon──j=JFS──────────────Host=rz2-db1────────Refresh=2 secs───15:13.43───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CPU-Utilisation-Small-View ───────────EntitledCPU=  0.50 UsedCPU=  1.069───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────│
│Logical  CPUs              0----------25-----------50----------75----------100                                                                                                                                                                                              │
│CPU User%  Sys% Wait% Idle%|           |            |           |            |                                                                                                                                                                                              │
│  2  14.1  15.1  28.1  42.7|UUUUUUUsssssssWWWWWWWWWWWWWW      >              |                                                                                                                                                                                              │
│  3  41.0  11.5  14.0  33.5|UUUUUUUUUUUUUUUUUUUUsssssWWWWWWW>                |                                                                                                                                                                                              │
│  4  15.6  12.6  27.6  44.2|UUUUUUUssssssWWWWWWWWWWWWW                 >     |                                                                                                                                                                                              │
│  5  10.5  11.5  12.0  66.0|UUUUUsssssWWWWWW                  >              |                                                                                                                                                                                              │
│EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+                                                                                                                                                                                              │
│EC+  65.9  29.5   1.5   3.2|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUssssssssssssssi--|                                                                                                                                                                                              │
│ VP  35.2  15.8   0.8   1.7|UUUUUUUUUUUUUUUUUsssssss-------------------------|                                                                                                                                                                                              │
│EC= 213.9%  VP=  53.5%     +--No Cap---|------------|-----------100% VP=2 CPU+            


on host 2 that is basically the same machine all 4 CPUs do not show this problem:

> nmon  # afterwards press c

&┌─topas_nmon──b=Black&White──────Host=rz1-db1────────Refresh=2 secs───15:15.38───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CPU-Utilisation-Small-View ───────────EntitledCPU=  0.50 UsedCPU=  0.732───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────│
│Logical  CPUs              0----------25-----------50----------75----------100                                                                                                                                                                                              │
│CPU User%  Sys% Wait% Idle%|           |            |           |            |                                                                                                                                                                                              │
│  0  34.5  11.0  42.5  12.0|UUUUUUUUUUUUUUUUUsssssWWWWWWWWWWWWWWWWWWWWW    > |                                                                                                                                                                                              │
│  1  43.5   7.0  16.0  33.5|UUUUUUUUUUUUUUUUUUUUUsssWWWWWWWW           >     |                                                                                                                                                                                              │
│  2   2.0   0.0   2.0  96.0|UW                                    >          |                                                                                                                                                                                              │
│  3   1.5   2.5   3.0  93.0|sW                                              >|                                                                                                                                                                                              │
│EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+                                                                                                                                                                                              │
│EC+  78.6  18.1   1.8   1.6|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUsssssssss-|                                                                                                                                                                                              │
│ VP  28.7   6.6   0.6   0.6|UUUUUUUUUUUUUUsss--------------------------------|                                                                                                                                                                                              │
│EC= 146.4%  VP=  36.6%     +--No Cap---|------------|-----------100% VP=2 CPU+    


a lot of agent.log warnings (one line per CPU measurement) like this are logged every 10 min (meas cycle):

2011-10-25 15:14:00,072 WARN [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Numeric CpuPerc.user with id 1087086 is invalid, value was 'NaN'


the more interesting lines from the jvm core dump (see also the full attached dump):

1XHEXCPMODULE  Compiling method: java/io/ObjectOutputStream.writeObject0(Ljava/lang/Object;Z)V

...

0SECTION       LOCKS subcomponent dump routine
NULL           ===============================
NULL
1LKPOOLINFO    Monitor pool info:
2LKPOOLTOTAL     Current total number of monitors: 14
NULL
1LKMONPOOLDUMP Monitor Pool Dump (flat & inflated object-monitors):

...

2LKMONINUSE      sys_mon_t:0x313D33E0 infl_mon_t: 0x313D3408:
3LKMONOBJECT       java/lang/Object@E86704F8/E8670504: owner "RHQ Command Queue Throttle Thread" (0x31685300), entry count 1

...

0SECTION       THREADS subcomponent dump routine
NULL           =================================
NULL
1XMCURTHDINFO  Current Thread Details
NULL           ----------------------
3XMTHREADINFO      "JIT Compilation Thread" (TID:0x302BFF00, sys_thread_t:0x3000DB70, state:R, native ID:0x00286077) prio=11
NULL
1XMTHDINFO     All Thread Details
NULL           ------------------
NULL
2XMFULLTHDDUMP Full thread dump J9 VM (J2RE 5.0 IBM J9 2.3 AIX ppc-32 build 20080314_17962_bHdSMr, native threads):

...

3XMTHREADINFO      "JIT Compilation Thread" (TID:0x302BFF00, sys_thread_t:0x3000DB70, state:R, native ID:0x00286077) prio=11
3XMTHREADINFO      "Signal Dispatcher" (TID:0x302C0300, sys_thread_t:0x3000E0A8, state:R, native ID:0x0043E015) prio=5
4XESTACKTRACE          at com/ibm/misc/SignalDispatcher.waitForSignal(Native Method)
4XESTACKTRACE          at com/ibm/misc/SignalDispatcher.run(SignalDispatcher.java:84)

...

3XMTHREADINFO      "AcceptorThread#0:50176" (TID:0x31A43700, sys_thread_t:0x313C39E8, state:R, native ID:0x00404077) prio=5
4XESTACKTRACE          at java/net/PlainSocketImpl.socketAccept(Native Method)
4XESTACKTRACE          at java/net/PlainSocketImpl.accept(PlainSocketImpl.java:455)
4XESTACKTRACE          at java/net/ServerSocket.implAccept(ServerSocket.java:462)
4XESTACKTRACE          at java/net/ServerSocket.accept(ServerSocket.java:433)
4XESTACKTRACE          at org/jboss/remoting/transport/socket/SocketServerInvoker.run(SocketServerInvoker.java:520)
4XESTACKTRACE          at java/lang/Thread.run(Thread.java:810)


jvm version infos:

java version "1.5.0" 
Java(TM) 2 Runtime Environment, Standard Edition (build pap32dev-20080315 (SR7))
IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 AIX ppc-32 j9vmap3223-20080315 (JIT enabled)
J9VM - 20080314_17962_bHdSMr
JIT  - 20080130_0718ifx2_r8
GC   - 200802_08)
JCL  - 20080314

Comment 1 Andreas Dietrich 2011-10-26 09:02:55 UTC
Created attachment 530261 [details]
java core dump likely due to the sigar cpu numbering problem issue