Hint: Hi, sorry to post this probably won't fix issue here, but I could not register on the sigar bugtracker (JIRA) (and it in fact also effects RHQ). Maybe somebody else can put it there (I saw Ian there?). Kind regards Andreas :-) Description of problem / Actual results / Expected results: 1. Only 2-cores of a 4-core virtualized host (the underlying physical host has 16 numbered cores) are monitored. 2. Agent logs warnings on invalid cpu percentage values. 3. Agent JVM crashes with core dumps after some hours. (For now we disabled the measurement collection for the effected CPUs and the dependent effects 2. and 3. are solved for now) Version-Release number of selected component (if applicable): This problem has been identified as one of the Sigar subsystem. Even the latest master sigar built fails (downloaded on 2011-10-26 10:00): *.so files from http://hudson.hyperic.com/job/sigar-master-ppc-aix-5.3/ and http://hudson.hyperic.com/job/sigar-master-ppc64-aix-5.3/ sigar.jar from http://hudson.hyperic.com/view/sigar/job/sigar-master-ppc64-linux-2.6/ It also does not work with (the RHQ 3.0.1 GA delivered) sigar-1.6.5.132.jar, nor the offical stable 1.6.4 version. How reproducible / Steps to Reproduce: recognize the "N%" values of CPU 0 and 1 root@host:/opt/rhq-agent/lib> /usr/java5/bin/java -jar sigar.jar sigar> cpuinfo Vendor.........IBM Model..........PowerPC Unknown Mhz............4204 Total CPUs.....4 Cache size....0 CPU 0......... User Time.....N% Sys Time......N% Idle Time.....N% Wait Time.....N% Nice Time.....N% Combined......N% Irq Time......N% CPU 1......... User Time.....N% Sys Time......N% Idle Time.....N% Wait Time.....N% Nice Time.....N% Combined......N% Irq Time......N% CPU 2......... User Time.....19.6% Sys Time......9.8% Idle Time.....31.3% Wait Time.....39.2% Nice Time.....0.0% Combined......68.6% Irq Time......0.0% CPU 3......... User Time.....3.9% Sys Time......3.9% Idle Time.....76.4% Wait Time.....15.6% Nice Time.....0.0% Combined......23.5% Irq Time......0.0% Totals........ User Time.....19.1% Sys Time......8.8% Idle Time.....33.3% Wait Time.....38.7% Nice Time.....0.0% Combined......66.6% Irq Time......0.0% Additional info: recognize the CPU numbers not starting from 0 (because the host is virtualized) > nmon # afterwards press c ┌─topas_nmon──j=JFS──────────────Host=rz2-db1────────Refresh=2 secs───15:13.43───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ CPU-Utilisation-Small-View ───────────EntitledCPU= 0.50 UsedCPU= 1.069───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────│ │Logical CPUs 0----------25-----------50----------75----------100 │ │CPU User% Sys% Wait% Idle%| | | | | │ │ 2 14.1 15.1 28.1 42.7|UUUUUUUsssssssWWWWWWWWWWWWWW > | │ │ 3 41.0 11.5 14.0 33.5|UUUUUUUUUUUUUUUUUUUUsssssWWWWWWW> | │ │ 4 15.6 12.6 27.6 44.2|UUUUUUUssssssWWWWWWWWWWWWW > | │ │ 5 10.5 11.5 12.0 66.0|UUUUUsssssWWWWWW > | │ │EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+ │ │EC+ 65.9 29.5 1.5 3.2|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUssssssssssssssi--| │ │ VP 35.2 15.8 0.8 1.7|UUUUUUUUUUUUUUUUUsssssss-------------------------| │ │EC= 213.9% VP= 53.5% +--No Cap---|------------|-----------100% VP=2 CPU+ on host 2 that is basically the same machine all 4 CPUs do not show this problem: > nmon # afterwards press c &┌─topas_nmon──b=Black&White──────Host=rz1-db1────────Refresh=2 secs───15:15.38───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ CPU-Utilisation-Small-View ───────────EntitledCPU= 0.50 UsedCPU= 0.732───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────│ │Logical CPUs 0----------25-----------50----------75----------100 │ │CPU User% Sys% Wait% Idle%| | | | | │ │ 0 34.5 11.0 42.5 12.0|UUUUUUUUUUUUUUUUUsssssWWWWWWWWWWWWWWWWWWWWW > | │ │ 1 43.5 7.0 16.0 33.5|UUUUUUUUUUUUUUUUUUUUUsssWWWWWWWW > | │ │ 2 2.0 0.0 2.0 96.0|UW > | │ │ 3 1.5 2.5 3.0 93.0|sW >| │ │EntitleCapacity/VirtualCPU +-----------|------------|-----------|------------+ │ │EC+ 78.6 18.1 1.8 1.6|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUsssssssss-| │ │ VP 28.7 6.6 0.6 0.6|UUUUUUUUUUUUUUsss--------------------------------| │ │EC= 146.4% VP= 36.6% +--No Cap---|------------|-----------100% VP=2 CPU+ a lot of agent.log warnings (one line per CPU measurement) like this are logged every 10 min (meas cycle): 2011-10-25 15:14:00,072 WARN [MeasurementManager.sender-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Numeric CpuPerc.user with id 1087086 is invalid, value was 'NaN' the more interesting lines from the jvm core dump (see also the full attached dump): 1XHEXCPMODULE Compiling method: java/io/ObjectOutputStream.writeObject0(Ljava/lang/Object;Z)V ... 0SECTION LOCKS subcomponent dump routine NULL =============================== NULL 1LKPOOLINFO Monitor pool info: 2LKPOOLTOTAL Current total number of monitors: 14 NULL 1LKMONPOOLDUMP Monitor Pool Dump (flat & inflated object-monitors): ... 2LKMONINUSE sys_mon_t:0x313D33E0 infl_mon_t: 0x313D3408: 3LKMONOBJECT java/lang/Object@E86704F8/E8670504: owner "RHQ Command Queue Throttle Thread" (0x31685300), entry count 1 ... 0SECTION THREADS subcomponent dump routine NULL ================================= NULL 1XMCURTHDINFO Current Thread Details NULL ---------------------- 3XMTHREADINFO "JIT Compilation Thread" (TID:0x302BFF00, sys_thread_t:0x3000DB70, state:R, native ID:0x00286077) prio=11 NULL 1XMTHDINFO All Thread Details NULL ------------------ NULL 2XMFULLTHDDUMP Full thread dump J9 VM (J2RE 5.0 IBM J9 2.3 AIX ppc-32 build 20080314_17962_bHdSMr, native threads): ... 3XMTHREADINFO "JIT Compilation Thread" (TID:0x302BFF00, sys_thread_t:0x3000DB70, state:R, native ID:0x00286077) prio=11 3XMTHREADINFO "Signal Dispatcher" (TID:0x302C0300, sys_thread_t:0x3000E0A8, state:R, native ID:0x0043E015) prio=5 4XESTACKTRACE at com/ibm/misc/SignalDispatcher.waitForSignal(Native Method) 4XESTACKTRACE at com/ibm/misc/SignalDispatcher.run(SignalDispatcher.java:84) ... 3XMTHREADINFO "AcceptorThread#0:50176" (TID:0x31A43700, sys_thread_t:0x313C39E8, state:R, native ID:0x00404077) prio=5 4XESTACKTRACE at java/net/PlainSocketImpl.socketAccept(Native Method) 4XESTACKTRACE at java/net/PlainSocketImpl.accept(PlainSocketImpl.java:455) 4XESTACKTRACE at java/net/ServerSocket.implAccept(ServerSocket.java:462) 4XESTACKTRACE at java/net/ServerSocket.accept(ServerSocket.java:433) 4XESTACKTRACE at org/jboss/remoting/transport/socket/SocketServerInvoker.run(SocketServerInvoker.java:520) 4XESTACKTRACE at java/lang/Thread.run(Thread.java:810) jvm version infos: java version "1.5.0" Java(TM) 2 Runtime Environment, Standard Edition (build pap32dev-20080315 (SR7)) IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 AIX ppc-32 j9vmap3223-20080315 (JIT enabled) J9VM - 20080314_17962_bHdSMr JIT - 20080130_0718ifx2_r8 GC - 200802_08) JCL - 20080314
Created attachment 530261 [details] java core dump likely due to the sigar cpu numbering problem issue