Bug 536877 - rhq-agent segfault on windows 2003 x86_64
Summary: rhq-agent segfault on windows 2003 x86_64
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: RHQ Project
Classification: Other
Component: Agent
Version: 1.3
Hardware: x86_64
OS: Windows
high
high
Target Milestone: ---
: ---
Assignee: Ian Springer
QA Contact: Satish Mohan
URL:
Whiteboard:
Depends On: 591552
Blocks: jon-sprint8-bugs jon-sprint9-bugs jon-sprint10-bugs
TreeView+ depends on / blocked
 
Reported: 2009-11-11 16:52 UTC by samu
Modified: 2013-08-06 00:36 UTC (History)
4 users (show)

Fixed In Version: 2.4
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-08-12 16:50:00 UTC
Embargoed:


Attachments (Terms of Use)
the JVM crash dump file from the Agent crash (16.21 KB, text/plain)
2009-11-11 17:09 UTC, Ian Springer
no flags Details
Testcase on Win Server 2008 x64 SP2 (127.45 KB, application/zip)
2009-11-12 20:49 UTC, Rodrigo A B Freire
no flags Details
a batch file that sets 10 environment variables with large values (59.62 KB, application/octet-stream)
2009-11-12 23:11 UTC, Ian Springer
no flags Details
dump for running "java -jar sigar-1.6.3.82.jar test ProcEnv" (11.56 KB, application/octet-stream)
2009-11-13 10:04 UTC, samu
no flags Details
output command of "java -jar sigar-1.6.3.82.jar test ProcEnv" (3.52 KB, text/plain)
2009-11-13 10:04 UTC, samu
no flags Details
screenshot of windows env variables (22.95 KB, image/png)
2010-01-05 10:47 UTC, samu
no flags Details
a jar containing the SigarTest class (46.20 KB, application/x-java-archive)
2010-01-08 22:22 UTC, Ian Springer
no flags Details
java crash (8.84 KB, text/x-log)
2010-01-12 09:04 UTC, samu
no flags Details
log of the sigar test (69.94 KB, text/plain)
2010-01-12 09:05 UTC, samu
no flags Details
command line of the crasher of the test (15.10 KB, image/png)
2010-01-12 09:06 UTC, samu
no flags Details
env prop of the crasher of the test (31.79 KB, image/png)
2010-01-12 09:06 UTC, samu
no flags Details

Description samu 2009-11-11 16:52:47 UTC
Description of problem:

rhq-agent segfault on windows x64 with jdk 1.6.0


Version-Release number of selected component (if applicable):

agent for jopr 2.3.1
jdk 1.6.0_17


How reproducible:
just install an rhq-agent with jdk 1.6.0 and run the bat file.
if it succeed just do a discovery from command line



Actual results:

stack trace and crash of the agent


Expected results:
it should run :-)


Additional info:
http://www.jboss.org/index.html?module=bb&op=viewtopic&t=163614
for more info on the problem.

Comment 1 Ian Springer 2009-11-11 17:07:18 UTC
I've created a SIGAR issue for this: http://jira.hyperic.com/browse/SIGAR-192

Comment 2 Ian Springer 2009-11-11 17:09:40 UTC
Created attachment 369063 [details]
the JVM crash dump file from the Agent crash

Comment 3 Charles Crouch 2009-11-11 17:54:54 UTC
Targetting at 1.3.1 for triage

Comment 4 Ian Springer 2009-11-12 17:52:08 UTC
Hi Samu,

Would you please try running the following on the Win64 box:

1) Copy junit-3.8.2.jar from http://repo1.maven.org/maven2/junit/junit/3.8.2/ into <rhq-agent-home>\lib\.
2) cd to rhq-agent\lib
3) run: java -jar sigar-1.6.3.82.jar test ProcEnv

And see if it also causes a JVM crash. If so, please attach the output from that command, as well as the crash dump file.

Comment 5 Rodrigo A B Freire 2009-11-12 20:48:05 UTC
Tried to reproduce but worked on an Windows Server 2008 x64 SP2 environment

Tried:

JDK 1.6.0_13 - worked
JDK 1.6.0_17 - worked. See attachments.

Comment 6 Rodrigo A B Freire 2009-11-12 20:49:06 UTC
Created attachment 369318 [details]
Testcase on Win Server 2008 x64 SP2

Results

Comment 7 Rodrigo A B Freire 2009-11-12 20:51:06 UTC
Samu, Ian,

Keep in mind that Windows Firewall prevents the Jon Agent to start (can't bind its 16163 port).

Comment 8 Ian Springer 2009-11-12 23:11:02 UTC
Created attachment 369342 [details]
a batch file that sets 10 environment variables with large values

Comment 9 samu 2009-11-13 10:02:22 UTC
Rodrigo,
i've checked and windows firewall is off, also i'm quite sure it's not a network related problem because one of my previous test was to run the rhq-agent with a jdk 1.5.0 and it went on without problem.
Of course, since jopr is on a jdk 1.6.0 and jboss is on jdk 1.6.0  the agent can't run on jdk 1.5.0 and fetch informations so i changed to jdk 1.6.0 and that's
when the crash happened.

btw i forgot to add on the previous note: this is a window 2003


for ian:

D:\jboss-4.2.1.GA>cd rhq-agent\lib

D:\jboss-4.2.1.GA\rhq-agent\lib>java -version
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)

D:\jboss-4.2.1.GA\rhq-agent\lib>dir junit-3.8.2.jar
13/11/2009  10.46           120.640 junit-3.8.2.jar

D:\jboss-4.2.1.GA\rhq-agent\lib> java -jar sigar-1.6.3.82.jar test ProcEnv >error.txt

i have the output in error.txt and the dump, i'll attach here asap

Comment 10 samu 2009-11-13 10:04:07 UTC
Created attachment 369393 [details]
dump for running "java -jar sigar-1.6.3.82.jar test ProcEnv"

Comment 11 samu 2009-11-13 10:04:44 UTC
Created attachment 369394 [details]
output command of "java -jar sigar-1.6.3.82.jar test ProcEnv"

Comment 12 Rodrigo A B Freire 2009-11-13 11:29:20 UTC
Samu,

As you noticed, yesterday I tested two JDKs: 1.6.0_13 and 1.6.0_17.

When I did an 1st test on 1.6.0_17, the agent had problems to starting. Then I checked and I forgot to set JAVA_HOME pointing to the new JDK.

Just to make sure, when you type "set" at the cmd prompt, what is set on your JAVA_HOME?

RF.

Comment 13 samu 2009-11-13 11:51:07 UTC
here there are :

JAVA_HOME=C:\Programmi\Java\jdk1.6.0_13
JBOSS_HOME=D:\jboss-4.2.1.GA
JBOSS_SERVER_LIB=D:\jboss-4.2.1.GA\server\SIL_SERVER_LIB

on rhq-agent-env.bat i don't set JAVA_HOME i just set AGENT_HOME and JAVA_EXE_PATH

set RHQ_AGENT_HOME=D:\jboss-4.2.1.GA\rhq-agent
set RHQ_AGENT_JAVA_EXE_FILE_PATH=C:\Programmi\Java\jdk1.6.0_13\bin\java.exe


i tried setting JAVA_EXE_FILE_PATH to c:\windows\...java.exe (the default conf) but the problem is the same so i keep c:\Programmi... ecc. ecc.

Comment 14 Rodrigo A B Freire 2009-11-13 12:07:23 UTC
Samu,

Funny. I just set the JAVA_HOME env var and added JDK's bin to PATH and everything ran fine, without needing to edit the rhq-agent-env.bat

Do you mind trying set these variables and trying again with an fresh new rhq-agent download/install?

Grazie,

- RF

Comment 15 samu 2009-11-13 13:36:42 UTC
Rodrigo:
i just tried and added jdk 
and rhq-agent.bat goes fine and the rhq-agent cli come...However, if you type discovery inside that command prompt the segfault happen again.

Comment 16 Ian Springer 2010-01-04 18:56:40 UTC
Hi Samu,

I'd like to see if there's something unusual in your system's environment variables that is causing SIGAR to choke. Would you please do the following:

1) Download Process Explorer from http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx and install it.
2) Start up a SIGAR process, e.g.: java -jar sigar-1.6.3.82.jar
3) Run Process Explorer and select the SIGAR process you just started. Go to the Environment tab under the process' Properties.

See if you notice any environment variables with super long values or funky characters in their names or values or anything else unusual. If you can figure out a way to export the environment variables to a text file (I was unable to), please do so and attach it.

Thanks,
Ian

Comment 17 samu 2010-01-05 10:46:49 UTC
hello ian,

i did as you asked but i'm not seeing any "fancy" char in the env, the only one
which is a bit long (but not TOO long)  is the path var.

i didn't find a way to export those variables too but i made a screenshot so you can check that too, no worries, it's not a 40MB bmp file inside a doc, it's just a small png of the window :-)

Regards
Samuele

Comment 18 samu 2010-01-05 10:47:51 UTC
Created attachment 381734 [details]
screenshot of windows env variables

Comment 19 Ian Springer 2010-01-08 22:21:04 UTC
I agree, I don't see anything too fancy. The parens in a couple of the variable names are a bit funky, but that doesn't cause SIGAR to crash on my XP box.

I actually misread the code from the SIGAR TestProcEnv class. It turns out it is not necessarily the SIGAR process itself that is causing the crash. TestProcEnv loops through all processes on the system, so it could be any process. Unfortunately, it doesn't print each process's pid before attempting to retrieve that process's environment and other info.

I'm attaching a jarfile named rhq-core-native-system-1.4.0-SNAPSHOT-tests.jar that contains a test class that will print the pids, so we can figure out exactly which process on your system is causing the crash. Please run it as follows:

1) Copy it to your RHQ Agent's lib dir.
2) cd to the Agent lib dir.
3) Run: java -cp rhq-core-native-system-1.4.0-SNAPSHOT-tests.jar;sigar-1.6.3.82.jar org.rhq.core.system.SigarTest

Do this as the same user you are using to run the Agent.

The last pid that the test class prints before the JVM crashes is the one that is causing the crash. Once you figure out which process is the culprit, send me screenshots of that process's Environment and other basic stats (process command line, etc.) from Process Explorer.

Thanks for helping to nail this one down!

Comment 20 Ian Springer 2010-01-08 22:22:04 UTC
Created attachment 382565 [details]
a jar containing the SigarTest class

Comment 21 samu 2010-01-12 09:03:46 UTC
hello all,
ian: i did the test as you suggested and i'm attaching screenshots of the process environment, dump of the log and of the crash.
I used the user which i use to run the agent.
I did some test: the process responsible for the crash is always the same and it's always...surprise surprise.. jboss :-) but environment doesn't seem abnormal.

Since that server is responsible for a lot of jboss instance for various services but each of them is clustered i tried to stop the one responsible and it just crashed on the next one ie:
suppose we have an environment like this :

app1_node1
app1_node2
app1_node3
app1_node4

app2_node1
app2_node2
app2_node3
app2_node4

when i kill app1_node1 the test crash on app1_node2, if i kill app1_node2 the test crash on app1_node3 and so on...(of course when i kill app1_node3, i've already restarted app1_node1, app1_node2)  when i've killed/restarted all app1 nodes the test will crash on app2_node1.. so apparently it crashes on the first jboss he finds , but i could be dead wrong .

Comment 22 samu 2010-01-12 09:04:37 UTC
Created attachment 383203 [details]
java crash

Comment 23 samu 2010-01-12 09:05:28 UTC
Created attachment 383204 [details]
log of the sigar test

Comment 24 samu 2010-01-12 09:06:00 UTC
Created attachment 383205 [details]
command line of the crasher of the test

Comment 25 samu 2010-01-12 09:06:35 UTC
Created attachment 383206 [details]
env prop of the crasher of the test

Comment 26 Corey Welton 2010-01-23 00:16:04 UTC
qa -> gneelaka

Comment 27 Ian Springer 2010-02-01 15:52:46 UTC
Thanks for the info on the crash culprit. I added the info to the SIGAR bug (  http://jira.hyperic.com/browse/SIGAR-192) and am still awaiting comment from the SIGAR developers.

Comment 28 Ian Springer 2010-02-04 21:11:14 UTC
Hi, good news - Doug from Hyperic was able to reproduce a similar crash on 64-bit windows. However, he only sees the crash when running the SIGAR JVM from a CygWin shell. Are you using CygWin also when you see the crash?

Comment 29 Corey Welton 2010-02-05 14:21:45 UTC
fixing status -> ON_DEV

Comment 30 samu 2010-02-08 09:33:03 UTC
hello ian,
no, no sign of cygwin at all in the server maybe it's a environ variable that cygwin sets (and it's present on my server too) that cause that bug ?

Regards
Samuele

Comment 31 Charles Crouch 2010-04-01 21:30:27 UTC
I've updated the title on this issue since it appears from previous comments the problem doesn't occur on windows 2008

Next steps..
a) determine which version of Windows Doug is seeing the crash on
b) check, very politely, if Doug has had any luck investigating this issue any further.
Relevant jira: http://jira.hyperic.com/browse/SIGAR-192

Comment 32 Charles Crouch 2010-04-01 23:07:14 UTC
(put comment in wrong issue)

Comment 33 Ian Springer 2010-04-15 19:25:02 UTC
Hi Samuele,

Based on Doug M's latest update to http://jira.hyperic.com/browse/SIGAR-192, it sounds like we may have a fix for the issue. Please try replacing your SIGAR Win64 DLL with the following updated version and see if the problem clears up:

http://hudson.hyperic.com/job/sigar-1.6-amd64-winnt/lastSuccessfulBuild/artifact/bindings/java/sigar-bin/lib/sigar-amd64-winnt.dll

Let me know how you make out. 

Thanks,
Ian

Comment 34 samu 2010-04-27 07:22:11 UTC
hello ian,
i just tried the new sigar dll and i can confirm discovery is now fully working.
no more segfaults and the inventory is full as expected.

bug closed thanks all of your for your support

Cheers
Samuele

Comment 35 Charles Crouch 2010-04-27 14:27:39 UTC
Great. Thanks Samuele.

Ian, can you see if Doug will be able to get a new Sigar release out in the next two weeks so we can get it into Sprint9

Comment 36 Ian Springer 2010-05-12 16:55:46 UTC
I am closing this, since master has been upgraded to SIGAR 1.6.4, which includes the fix for http://jira.hyperic.com/browse/SIGAR-192, which has been verified by Samuele.

Comment 38 Corey Welton 2010-08-12 16:50:00 UTC
Mass-closure of verified bugs against JON.


Note You need to log in before you can comment on or make changes to this bug.