Description of problem:
rhq-agent segfault on windows x64 with jdk 1.6.0
Version-Release number of selected component (if applicable):
agent for jopr 2.3.1
just install an rhq-agent with jdk 1.6.0 and run the bat file.
if it succeed just do a discovery from command line
stack trace and crash of the agent
it should run :-)
for more info on the problem.
I've created a SIGAR issue for this: http://jira.hyperic.com/browse/SIGAR-192
Created attachment 369063 [details]
the JVM crash dump file from the Agent crash
Targetting at 1.3.1 for triage
Would you please try running the following on the Win64 box:
1) Copy junit-3.8.2.jar from http://repo1.maven.org/maven2/junit/junit/3.8.2/ into <rhq-agent-home>\lib\.
2) cd to rhq-agent\lib
3) run: java -jar sigar-18.104.22.168.jar test ProcEnv
And see if it also causes a JVM crash. If so, please attach the output from that command, as well as the crash dump file.
Tried to reproduce but worked on an Windows Server 2008 x64 SP2 environment
JDK 1.6.0_13 - worked
JDK 1.6.0_17 - worked. See attachments.
Created attachment 369318 [details]
Testcase on Win Server 2008 x64 SP2
Keep in mind that Windows Firewall prevents the Jon Agent to start (can't bind its 16163 port).
Created attachment 369342 [details]
a batch file that sets 10 environment variables with large values
i've checked and windows firewall is off, also i'm quite sure it's not a network related problem because one of my previous test was to run the rhq-agent with a jdk 1.5.0 and it went on without problem.
Of course, since jopr is on a jdk 1.6.0 and jboss is on jdk 1.6.0 the agent can't run on jdk 1.5.0 and fetch informations so i changed to jdk 1.6.0 and that's
when the crash happened.
btw i forgot to add on the previous note: this is a window 2003
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)
13/11/2009 10.46 120.640 junit-3.8.2.jar
D:\jboss-4.2.1.GA\rhq-agent\lib> java -jar sigar-22.214.171.124.jar test ProcEnv >error.txt
i have the output in error.txt and the dump, i'll attach here asap
Created attachment 369393 [details]
dump for running "java -jar sigar-126.96.36.199.jar test ProcEnv"
Created attachment 369394 [details]
output command of "java -jar sigar-188.8.131.52.jar test ProcEnv"
As you noticed, yesterday I tested two JDKs: 1.6.0_13 and 1.6.0_17.
When I did an 1st test on 1.6.0_17, the agent had problems to starting. Then I checked and I forgot to set JAVA_HOME pointing to the new JDK.
Just to make sure, when you type "set" at the cmd prompt, what is set on your JAVA_HOME?
here there are :
on rhq-agent-env.bat i don't set JAVA_HOME i just set AGENT_HOME and JAVA_EXE_PATH
i tried setting JAVA_EXE_FILE_PATH to c:\windows\...java.exe (the default conf) but the problem is the same so i keep c:\Programmi... ecc. ecc.
Funny. I just set the JAVA_HOME env var and added JDK's bin to PATH and everything ran fine, without needing to edit the rhq-agent-env.bat
Do you mind trying set these variables and trying again with an fresh new rhq-agent download/install?
i just tried and added jdk
and rhq-agent.bat goes fine and the rhq-agent cli come...However, if you type discovery inside that command prompt the segfault happen again.
I'd like to see if there's something unusual in your system's environment variables that is causing SIGAR to choke. Would you please do the following:
1) Download Process Explorer from http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx and install it.
2) Start up a SIGAR process, e.g.: java -jar sigar-184.108.40.206.jar
3) Run Process Explorer and select the SIGAR process you just started. Go to the Environment tab under the process' Properties.
See if you notice any environment variables with super long values or funky characters in their names or values or anything else unusual. If you can figure out a way to export the environment variables to a text file (I was unable to), please do so and attach it.
i did as you asked but i'm not seeing any "fancy" char in the env, the only one
which is a bit long (but not TOO long) is the path var.
i didn't find a way to export those variables too but i made a screenshot so you can check that too, no worries, it's not a 40MB bmp file inside a doc, it's just a small png of the window :-)
Created attachment 381734 [details]
screenshot of windows env variables
I agree, I don't see anything too fancy. The parens in a couple of the variable names are a bit funky, but that doesn't cause SIGAR to crash on my XP box.
I actually misread the code from the SIGAR TestProcEnv class. It turns out it is not necessarily the SIGAR process itself that is causing the crash. TestProcEnv loops through all processes on the system, so it could be any process. Unfortunately, it doesn't print each process's pid before attempting to retrieve that process's environment and other info.
I'm attaching a jarfile named rhq-core-native-system-1.4.0-SNAPSHOT-tests.jar that contains a test class that will print the pids, so we can figure out exactly which process on your system is causing the crash. Please run it as follows:
1) Copy it to your RHQ Agent's lib dir.
2) cd to the Agent lib dir.
3) Run: java -cp rhq-core-native-system-1.4.0-SNAPSHOT-tests.jar;sigar-220.127.116.11.jar org.rhq.core.system.SigarTest
Do this as the same user you are using to run the Agent.
The last pid that the test class prints before the JVM crashes is the one that is causing the crash. Once you figure out which process is the culprit, send me screenshots of that process's Environment and other basic stats (process command line, etc.) from Process Explorer.
Thanks for helping to nail this one down!
Created attachment 382565 [details]
a jar containing the SigarTest class
ian: i did the test as you suggested and i'm attaching screenshots of the process environment, dump of the log and of the crash.
I used the user which i use to run the agent.
I did some test: the process responsible for the crash is always the same and it's always...surprise surprise.. jboss :-) but environment doesn't seem abnormal.
Since that server is responsible for a lot of jboss instance for various services but each of them is clustered i tried to stop the one responsible and it just crashed on the next one ie:
suppose we have an environment like this :
when i kill app1_node1 the test crash on app1_node2, if i kill app1_node2 the test crash on app1_node3 and so on...(of course when i kill app1_node3, i've already restarted app1_node1, app1_node2) when i've killed/restarted all app1 nodes the test will crash on app2_node1.. so apparently it crashes on the first jboss he finds , but i could be dead wrong .
Created attachment 383203 [details]
Created attachment 383204 [details]
log of the sigar test
Created attachment 383205 [details]
command line of the crasher of the test
Created attachment 383206 [details]
env prop of the crasher of the test
qa -> gneelaka
Thanks for the info on the crash culprit. I added the info to the SIGAR bug ( http://jira.hyperic.com/browse/SIGAR-192) and am still awaiting comment from the SIGAR developers.
Hi, good news - Doug from Hyperic was able to reproduce a similar crash on 64-bit windows. However, he only sees the crash when running the SIGAR JVM from a CygWin shell. Are you using CygWin also when you see the crash?
fixing status -> ON_DEV
no, no sign of cygwin at all in the server maybe it's a environ variable that cygwin sets (and it's present on my server too) that cause that bug ?
I've updated the title on this issue since it appears from previous comments the problem doesn't occur on windows 2008
a) determine which version of Windows Doug is seeing the crash on
b) check, very politely, if Doug has had any luck investigating this issue any further.
Relevant jira: http://jira.hyperic.com/browse/SIGAR-192
(put comment in wrong issue)
Based on Doug M's latest update to http://jira.hyperic.com/browse/SIGAR-192, it sounds like we may have a fix for the issue. Please try replacing your SIGAR Win64 DLL with the following updated version and see if the problem clears up:
Let me know how you make out.
i just tried the new sigar dll and i can confirm discovery is now fully working.
no more segfaults and the inventory is full as expected.
bug closed thanks all of your for your support
Great. Thanks Samuele.
Ian, can you see if Doug will be able to get a new Sigar release out in the next two weeks so we can get it into Sprint9
I am closing this, since master has been upgraded to SIGAR 1.6.4, which includes the fix for http://jira.hyperic.com/browse/SIGAR-192, which has been verified by Samuele.
Mass-closure of verified bugs against JON.