|
Description
Need Real Name
2011-03-23 18:11:26 UTC
Created attachment 487107 [details]
Unit test to see incorrect hotspot loop unrolling.
This is a test case to see incorrect loop unrolling by hotspot.
The matrices m1 & m2 inside runTest must be identical,
if at some point they are not - an exception is thrown
This is hotspot problem, generated .class files are OK From observing the structure of matrix differences (single element differ) - the most likely reason is that during a switch from one type of loop execution to another hotspot either executed first operator of the loop body twice or not executed it at all. I am still analysing the source, but given how close the numbers are, this looks more like a bug related to double handling than loop unrolling. Also, how long does it take to crash usually? I had it running for 10+ minutes and it didn't crash. Had to stop because laptop cpu got to 98C though :) Looking at more of the code, this does look like a double comparison issue, and not related to loop unrolling. Specifically, in the code where you do:
if(m1[i][j]!=m2[i][j])
{
iif=i;
jif=j;
break oloop;
}
Can you please change it to:
if(Double.compare(m1[i][j]!=m2[i][j]) != 0)
{
iif=i;
jif=j;
break oloop;
}
and try again? == is not an equivalence operator for double values. Please see Joe Darcy's blog for a more detailed description of this:
http://blogs.sun.com/darcy/entry/notions_of_floating_point_equality
(In reply to comment #5) > > if(Double.compare(m1[i][j]!=m2[i][j]) != 0) Sorry, that should be: if(Double.compare(m1[i][j], m2[i][j]) != 0) On further thought, this may be more than a rounding error. Nonetheless, can you please try the Double.compare() to eliminate that possibility and see if it still errors out? I've had the code (original with "!=") running on my desktop box (q-core i7 920 @ 3Ghz) for the past 3+ hours and so far it hasn't thrown any error. Created attachment 487241 [details]
updated test
I am attaching updated TF.java where the problem is
1. easier to reproduce
2. numbers difference is not small , for example
m1[1][1]=40377.57397780233 m2[1][1]=36281.57397780233
the code is started as
n=0 ; while [ $n -lt 20 ]; do java TF & n=$[ $n + 1 ] ; done
the attached test supersedes old TF.java
Exception in thread "main" java.lang.RuntimeException: TEST=29469 Elements [1][1] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415
m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344
m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512
m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=38329.57397780235
m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945
m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925
m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661
m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718
m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451
m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644
m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703
m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323
m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=30044 Elements [1][1] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415
m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344
m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512
m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=38329.57397780235
m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945
m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925
m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661
m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718
m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451
m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644
m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703
m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323
m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=31890 Elements [0][0] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.888586395702
m1[0][1]=5716.6127269548415 m2[0][1]=5716.5952705629015
m1[0][2]=5624.415186604344 m2[0][2]=5624.397987441957
m1[0][3]=5751.072547006512 m2[0][3]=5751.054968098871
m1[0][4]=6057.8359619079365 m2[0][4]=6057.817467518348
m1[1][0]=43190.181781667496 m2[1][0]=43190.040840718015
m1[1][1]=40377.57397780233 m2[1][1]=40377.45114596819
m1[1][2]=39371.04889321945 m2[1][2]=39370.93072927575
m1[1][3]=40845.525454463925 m2[1][3]=40845.40313214674
m1[1][4]=44402.20207354617 m2[1][4]=44402.06916947626
m1[2][0]=103991.08438810863 m2[2][0]=103990.76428588785
m1[2][1]=97000.16811187661 m2[2][1]=96999.8766918958
m1[2][2]=94491.77180358718 m2[2][2]=94491.4892293681
m1[2][3]=98165.6040876451 m2[2][3]=98165.31074992764
m1[2][4]=107029.22874182602 m2[2][4]=107028.90898282987
m1[3][0]=192961.576069273 m2[3][0]=192960.99331829284
m1[3][1]=179643.57896832644 m2[3][1]=179643.04135665382
m1[3][2]=174920.49439552703 m2[3][2]=174919.97190336738
m1[3][3]=181800.08676866323 m2[3][3]=181799.543903905
m1[3][4]=198405.9217295142 m2[3][4]=198405.32933814573
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=33184 Elements [3][3] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415
m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344
m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512
m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=40377.57397780233
m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945
m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925
m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661
m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718
m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451
m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644
m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703
m1[3][3]=181800.08676866323 m2[3][3]=165416.08676866323
m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=34764 Elements [0][0] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.888586395702
m1[0][1]=5716.6127269548415 m2[0][1]=5716.5952705629015
m1[0][2]=5624.415186604344 m2[0][2]=5624.397987441957
m1[0][3]=5751.072547006512 m2[0][3]=5751.054968098871
m1[0][4]=6057.8359619079365 m2[0][4]=6057.817467518348
m1[1][0]=43190.181781667496 m2[1][0]=43190.040840718015
m1[1][1]=40377.57397780233 m2[1][1]=40377.45114596819
m1[1][2]=39371.04889321945 m2[1][2]=39370.93072927575
m1[1][3]=40845.525454463925 m2[1][3]=40845.40313214674
m1[1][4]=44402.20207354617 m2[1][4]=44402.06916947626
m1[2][0]=103991.08438810863 m2[2][0]=103990.76428588785
m1[2][1]=97000.16811187661 m2[2][1]=96999.8766918958
m1[2][2]=94491.77180358718 m2[2][2]=94491.4892293681
m1[2][3]=98165.6040876451 m2[2][3]=98165.31074992764
m1[2][4]=107029.22874182602 m2[2][4]=107028.90898282987
m1[3][0]=192961.576069273 m2[3][0]=192960.99331829284
m1[3][1]=179643.57896832644 m2[3][1]=179643.04135665382
m1[3][2]=174920.49439552703 m2[3][2]=174919.97190336738
m1[3][3]=181800.08676866323 m2[3][3]=181799.543903905
m1[3][4]=198405.9217295142 m2[3][4]=198405.32933814573
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=35158 Elements [1][1] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415
m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344
m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512
m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=40373.57397780233
m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945
m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925
m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661
m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718
m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451
m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644
m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703
m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323
m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=41942 Elements [0][1] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.608241494241
m1[0][2]=5624.415186604344 m2[0][2]=5624.414714771186
m1[0][3]=5751.072547006512 m2[0][3]=5751.07251441535
m1[0][4]=6057.8359619079365 m2[0][4]=6057.835960128156
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=40377.56976564972
m1[1][2]=39371.04889321945 m2[1][2]=39371.04845013602
m1[1][3]=40845.525454463925 m2[1][3]=40845.525423858606
m1[1][4]=44402.20207354617 m2[1][4]=44402.20207187484
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16217862988
m1[2][2]=94491.77180358718 m2[2][2]=94491.77117945894
m1[2][3]=98165.6040876451 m2[2][3]=98165.60404453438
m1[2][4]=107029.22874182602 m2[2][4]=107029.22873947177
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.56782488193
m1[3][2]=174920.49439552703 m2[3][2]=174920.4932233293
m1[3][3]=181800.08676866323 m2[3][3]=181800.08668769544
m1[3][4]=198405.9217295142 m2[3][4]=198405.92172509263
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=44978 Elements [1][4] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415
m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344
m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512
m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=40377.57397780233
m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945
m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925
m1[1][4]=44402.20207354617 m2[1][4]=44402.20205538556
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661
m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718
m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451
m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644
m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703
m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323
m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=55166 Elements [1][1] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415
m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344
m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512
m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=38329.57397780235
m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945
m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925
m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661
m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718
m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451
m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644
m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703
m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323
m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=63456 Elements [1][1] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415
m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344
m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512
m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=38329.57397780235
m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945
m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925
m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661
m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718
m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451
m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644
m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703
m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323
m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=64653 Elements [0][2] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415
m1[0][2]=5624.415186604344 m2[0][2]=5496.415186604344
m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512
m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=40377.57397780233
m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945
m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925
m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661
m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718
m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451
m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644
m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703
m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323
m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=76351 Elements [0][1] differ
m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842
m1[0][1]=5716.6127269548415 m2[0][1]=5716.608241494241
m1[0][2]=5624.415186604344 m2[0][2]=5624.414714771186
m1[0][3]=5751.072547006512 m2[0][3]=5751.07251441535
m1[0][4]=6057.8359619079365 m2[0][4]=6057.835960128156
m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496
m1[1][1]=40377.57397780233 m2[1][1]=40377.56976564972
m1[1][2]=39371.04889321945 m2[1][2]=39371.04845013602
m1[1][3]=40845.525454463925 m2[1][3]=40845.525423858606
m1[1][4]=44402.20207354617 m2[1][4]=44402.20207187484
m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863
m1[2][1]=97000.16811187661 m2[2][1]=97000.16217862988
m1[2][2]=94491.77180358718 m2[2][2]=94491.77117945894
m1[2][3]=98165.6040876451 m2[2][3]=98165.60404453438
m1[2][4]=107029.22874182602 m2[2][4]=107029.22873947177
m1[3][0]=192961.576069273 m2[3][0]=192961.576069273
m1[3][1]=179643.57896832644 m2[3][1]=179643.56782488193
m1[3][2]=174920.49439552703 m2[3][2]=174920.4932233293
m1[3][3]=181800.08676866323 m2[3][3]=181800.08668769544
m1[3][4]=198405.9217295142 m2[3][4]=198405.92172509263
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Exception in thread "main" java.lang.RuntimeException: TEST=25721 Elements [3][3] differ
m1[0][0]=7021.774830987034 m2[0][0]=7021.774830987034
m1[0][1]=6671.004497011981 m2[0][1]=6671.004497011981
m1[0][2]=6562.907481207491 m2[0][2]=6562.907481207491
m1[0][3]=6711.11205485301 m2[0][3]=6711.11205485301
m1[0][4]=7070.144759580944 m2[0][4]=7070.144759580944
m1[1][0]=50533.65411709927 m2[1][0]=50533.65411709927
m1[1][1]=47238.60662216227 m2[1][1]=47238.60662216227
m1[1][2]=46060.29654012994 m2[1][2]=46060.29654012994
m1[1][3]=47785.9864299652 m2[1][3]=47785.9864299652
m1[1][4]=51948.73401721851 m2[1][4]=51948.73401721851
m1[2][0]=121696.16159646476 m2[2][0]=121696.16159646476
m1[2][1]=113509.99686878055 m2[2][1]=113509.99686878055
m1[2][2]=110574.13409857947 m2[2][2]=110574.13409857947
m1[2][3]=114873.9804417375 m2[2][3]=114873.9804417375
m1[2][4]=125248.01961714865 m2[2][4]=125248.01961714865
m1[3][0]=225825.05076021465 m2[3][0]=225825.05076021465
m1[3][1]=210234.96043019896 m2[3][1]=210234.96043019896
m1[3][2]=204708.61835611783 m2[3][2]=204708.61835611783
m1[3][3]=212760.64466901263 m2[3][3]=196376.64466901263
m1[3][4]=232196.23583147067 m2[3][4]=232196.23583147067
at TF.checkm(TF.java:138)
at TF.runTest(TF.java:18)
at TF.multipleTests(TF.java:149)
at TF.main(TF.java:160)
Created attachment 487244 [details]
proper latest test case
proper latest test case
Created attachment 487246 [details]
Latest test
The test is started as
n=0 ; while [ $n -lt 20 ]; do java TF & n=$[ $n + 1 ] ; done
the results are in test.txt
the number differences are not small
m1[0][2]=5624.415186604344 m2[0][2]=5368.415186604342
comparison code - as you suggested.
As a side note: there was no error reporting for long time,
then, in another window I started the
find /
command - errors immediately came out
Created attachment 487247 [details]
test results
test results for a test started as
n=0 ; while [ $n -lt 20 ]; do java TF & n=$[ $n + 1 ] ; done
Created attachment 487295 [details]
Same problem does exist with long
This is a much wider & much stronger problem.
It does exist not only with double, but with long also.
(it takes much more time to trigger the problem with longs)
Test code & results are attached.
Created attachment 487304 [details]
the results with longs instead of doublea
the results with longs instead of double
Created attachment 487310 [details]
simplified test
This is TFsimple.java a much simplified TF.java
which nevertheless, show the same problem with incorrect calculations.
But it takes much longer to trigger.
the function calc is just this:
static double [][] calc(final int a,final int b,final F [] f,
final double xx,final double yy,
final double [][] mx)
{
final double [][] m=new double [a][b];
double wa=1;
for(int k=0;k<a;k++)
{
double wb=wa;
for(int l=0;l<b;l++)
{
m[k][l]+=wb*mx[k][l];
wb*=(l+0.7822332)*2;
}
wa*=(k+0.67221333);
}
return m;
}
I've run the code for over 12 hours now, and still no error. I am trying the latest test case now (so far no errors there either). What sort of hardware are you running on test on? And have you been able to reproduce this on other hardware? I am running Fedora on AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ stepping : 3 cpu MHz : 3007.708 cache size : 1024 KB 8Gb memory, 64 bit java last tested on java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) but various other versions are also tried. I just run memtest for about 2 hours - no problem found. I just started the TFsimple.java on a very different computer - 20 instances at once: n=0 ; while [ $n -lt 20 ]; do java TFsimple & n=$[ $n + 1 ] ; done >diag_1.txt 2>&1 Not that many for the computer: Debian, 8 way Xeon 32 Gb memory Linux forum 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) The system got destabilized & became non-responding, so I did power reset. diag1 show nothing special. I just started 2 instances instead of 20. I've been running the last version posted version of the test (with double) for over 3 hours now and it has still not error'd (8 instances). Any other machine you can try on to eliminate hardware as the problem? Another one of my co-workers (Denis Lila, on cc) tried it for 2.5+ hours on his machine too -- there were no errors. Created attachment 487401 [details] errors My computer (AMX athlon) got destabilized with 3 instances of TFsimple running after 1 hour. Debian is still running 3 instances. looks like TF.java - is more likely give just an error and TFsimple.java is more likely to give destabilization after an error. attached a trace for java -version java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) cat /tmp/diag_simple.txt [1] 2748 [2] 2749 [3] 2750 [4] 2751 [5] 2752 [6] 2753 [7] 2754 [8] 2755 [9] 2756 [10] 2757 [11] 2758 [12] 2759 [13] 2760 [14] 2761 [15] 2762 Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f62a7d6d5ea, pid=2749, tid=140061692933888 # # JRE version: 6.0_24-b07 # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x6295ea] # # An error report file with more information is saved as: # /home/mal/hs_err_pid2749.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # Exception in thread "main" java.lang.RuntimeException: TEST=62725 Elements [2][3] differ m1[0][0]=0.1700774336283186 m2[0][0]=0.1700774336283186 m1[0][1]=0.11166978133704736 m2[0][1]=0.11166978133704736 m1[0][2]=0.2518757394821027 m2[0][2]=0.2518757394821027 m1[0][3]=1.0251160550185991 m2[0][3]=1.0251160550185991 m1[0][4]=6.11267581491528 m2[0][4]=6.11267581491528 m1[1][0]=1.043826838481748 m2[1][0]=1.043826838481748 m1[1][1]=0.6853579121009847 m2[1][1]=0.6853579121009847 m1[1][2]=1.5458526814816618 m2[1][2]=1.5458526814816618 m1[1][3]=6.291508684952188 m2[1][3]=6.291508684952188 m1[1][4]=37.51570643105274 m2[1][4]=37.51570643105274 m1[2][0]=3.2998209696571212 m2[2][0]=3.2998209696571212 m1[2][1]=2.1666030482227314 m2[2][1]=2.1666030482227314 m1[2][2]=4.886861408711588 m2[2][2]=4.886861408711588 m1[2][3]=19.88917272867012 m2[2][3]=17.70913952378951 m1[2][4]=118.59736711957807 m2[2][4]=118.59736711957807 m1[3][0]=12.971299713493547 m2[3][0]=12.971299713493547 m1[3][1]=8.516721894032319 m2[3][1]=8.516721894032319 m1[3][2]=19.209813069734487 m2[3][2]=19.209813069734487 m1[3][3]=78.18255077754401 m2[3][3]=78.18255077754401 m1[3][4]=466.1955931200476 m2[3][4]=466.1955931200476 at TFsimple.checkm(TFsimple.java:108) at TFsimple.runTest(TFsimple.java:18) at TFsimple.multipleTests(TFsimple.java:119) at TFsimple.main(TFsimple.java:130) The SIGSEGV is a different issue as it implies a crash in the VM. Additionally it is in closed source code so we cannot do anything about it (although feel free to open a bug if you see it with OpenJDK). Is the error posted in comment 18 on the same AMD 6000+ box where the rest have been or a different AMD box? Hi.
Could you try modifying your test (preferably the long version) to keep track of the last F[] (just pass f as an argument to the checker and on failure do
for (F ff : f) {
out.printf("new F(%dL, %dL, %dL),%n", ff.x, ff.y, ff.z);
}
This will make it easy to initialize the failing array - just copy what's printed into TF.java)? The F[] is the only random part of the program, and if the specific numbers produced are a problem (perhaps they reveal a bad assumption about runtime ranges in hotspot...?) running the test over and over again with that same F[] should be enough to reproduce it. If the numbers aren't what's causing it, then removing the randomness shouldn't be a problem*
Also, try running with -XX:+PrintCompilation. I suspect OSR compilation may be to blame here, and that should tell us whether any of that is going on.
Another thing that would be very helpful would be if you ran with -XX:+PrintAssembly, but this won't just work. You'll have to build openjdk7 and then follow the instructions here: http://wikis.sun.com/display/HotSpotInternals/PrintAssembly
* this is very important: if you do remove the randomness and go with a constant F[] that crashed on a previous run, make sure to run the test with -XX:CompileCommand=exclude,TF,multipleTests
If you don't do that, hotspot will notice that F[] is constant, and will do a whole lot of constant propagation throughout the program, and it will almost certainly prevent the crash. Right now, I don't think it can propagate constants from interpreted methods to compiled methods.
Is there any place I can download the latest pre-build openjdk 7 with PrintAssembly compiled-in. preferrably in .tar.gz so I can try it on Debian. I think this would testing simplify a lot. Hi. Before you do any of what I suggested, could you please clarify whether you were able to reproduce the wrong computation on more than 1 machine? I don't think there are any pre built jdks with PrintAssembly, but PrintCompilation should be available on all of them. Till now I observed the problem on 1 computer with various jdks (openjdk, Oracle/Sun, various versions). I tested on 2 computers Fedora - observer error messages & crashes. Debian - in 10 minutes with just 15-20 TFsimple.java started (nothing for 8way 32Gb Debian) the computer become completely non-responding & requires remote reboot. No error messages observed, but they may be missed, the computer is remote. It is running OK with 3 instances of TFsimple.java One more Fedora 15 is available to me, but this is a production server, I cannot run experiments there. If the debian machine is rebootable, can you try with a Fedora Livedisk? It would be really nice to be able to eliminate hardware as the problem before attempting deeper and more cumbersome diagnosis. Ooops, scratch that. Just noticed that you mentioned that the computer is remote :( This is the results with TF.java (in contrast with TFsimle.java
the TF.java does not cause operating system destabilization).
(some messages from error format are from the code code inserted to checker
for(int i=0;i<f1.length;i++)
{
final F ff1=f1[i],ff2=f2[i];
System.out.printf("new F(%dL, %dL, %dL),%n", ff1.x, ff1.y, ff1.z);
System.out.printf("new F(%dL, %dL, %dL),%n", ff2.x, ff2.y,ff2.z);
}
cat /tmp/hh_A
Running single thread code until test failure
1% TF::calc @ 233 (359 bytes)
1 TF::calc (359 bytes)
--- n java.lang.Double::doubleToRawLongBits (static)
2 java.lang.Double::doubleToLongBits (33 bytes)
3 TF::checkm (418 bytes)
--- n java.lang.Object::clone
2% TF::runTest @ 26 (77 bytes)
4 TF::runTest (77 bytes)
--- n sun.misc.Unsafe::compareAndSwapLong
3 made not entrant TF::checkm (418 bytes)
new F(Exception in thread "main" java.util.IllegalFormatConversionException: d != java.lang.Double
at java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:3999)
at java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2709)
at java.util.Formatter$FormatSpecifier.print(Formatter.java:2661)
at java.util.Formatter.format(Formatter.java:2433)
at java.io.PrintStream.format(PrintStream.java:920)
at java.io.PrintStream.printf(PrintStream.java:821)
at TF.checkm(TF.java:143)
at TF.runTest(TF.java:19)
at TF.multipleTests(TF.java:157)
at TF.main(TF.java:168)
cat /tmp/hh_C
Running single thread code until test failure
1% TF::calc @ 233 (359 bytes)
2% TF::calc @ 218 (359 bytes)
1 TF::calc (359 bytes)
2 java.lang.Double::doubleToLongBits (33 bytes)
--- n java.lang.Double::doubleToRawLongBits (static)
3 TF::checkm (418 bytes)
--- n java.lang.Object::clone
3% TF::runTest @ 26 (77 bytes)
4 TF::runTest (77 bytes)
3 made not entrant TF::checkm (418 bytes)
new F(Exception in thread "main" java.util.IllegalFormatConversionException: d != java.lang.Double
at java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:3999)
at java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2709)
at java.util.Formatter$FormatSpecifier.print(Formatter.java:2661)
at java.util.Formatter.format(Formatter.java:2433)
at java.io.PrintStream.format(PrintStream.java:920)
at java.io.PrintStream.printf(PrintStream.java:821)
at TF.checkm(TF.java:143)
at TF.runTest(TF.java:19)
at TF.multipleTests(TF.java:157)
at TF.main(TF.java:168)
cat /tmp/hh_D
Running single thread code until test failure
1% TF::calc @ 233 (359 bytes)
1 TF::calc (359 bytes)
2% TF::calc @ 218 (359 bytes)
2 java.lang.Double::doubleToLongBits (33 bytes)
--- n java.lang.Double::doubleToRawLongBits (static)
3 TF::checkm (418 bytes)
--- n java.lang.Object::clone
3% TF::runTest @ 26 (77 bytes)
4 TF::runTest (77 bytes)
3 made not entrant TF::checkm (418 bytes)
new F(Exception in thread "main" java.util.IllegalFormatConversionException: d != java.lang.Double
at java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:3999)
at java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2709)
at java.util.Formatter$FormatSpecifier.print(Formatter.java:2661)
at java.util.Formatter.format(Formatter.java:2433)
at java.io.PrintStream.format(PrintStream.java:920)
at java.io.PrintStream.printf(PrintStream.java:821)
at TF.checkm(TF.java:143)
at TF.runTest(TF.java:19)
at TF.multipleTests(TF.java:157)
at TF.main(TF.java:168)
cat /tmp/hh_B
Running single thread code until test failure
1% TF::calc @ 233 (359 bytes)
1 TF::calc (359 bytes)
2 java.lang.Double::doubleToLongBits (33 bytes)
--- n java.lang.Double::doubleToRawLongBits (static)
3 TF::checkm (418 bytes)
--- n java.lang.Object::clone
2% TF::runTest @ 26 (77 bytes)
4 TF::runTest (77 bytes)
3 made not entrant TF::checkm (418 bytes)
new F(Exception in thread "main" java.util.IllegalFormatConversionException: d != java.lang.Double
at java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:3999)
at java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2709)
at java.util.Formatter$FormatSpecifier.print(Formatter.java:2661)
at java.util.Formatter.format(Formatter.java:2433)
at java.io.PrintStream.format(PrintStream.java:920)
at java.io.PrintStream.printf(PrintStream.java:821)
at TF.checkm(TF.java:143)
at TF.runTest(TF.java:19)
at TF.multipleTests(TF.java:157)
at TF.main(TF.java:168)
with commented System.out code: cat /tmp/hh_A Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 2% TF::calc @ 218 (359 bytes) 1 TF::calc (359 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 2 java.lang.Double::doubleToLongBits (33 bytes) 3 TF::checkm (326 bytes) --- n java.lang.Object::clone 3% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) 3 made not entrant TF::checkm (326 bytes) Exception in thread "main" java.lang.RuntimeException: TEST=7162 Elements [1][1] differ m1[0][0]=7706.7278164451545 m2[0][0]=7706.7278164451545 m1[0][1]=7320.609551430136 m2[0][1]=7320.609551430136 m1[0][2]=7201.232432903037 m2[0][2]=7201.232432903037 m1[0][3]=7363.6815207085365 m2[0][3]=7363.6815207085365 m1[0][4]=7757.530582167053 m2[0][4]=7757.530582167053 m1[1][0]=55438.31850924297 m2[1][0]=55438.31850924297 m1[1][1]=51824.39619900315 m2[1][1]=47728.39619900315 m1[1][2]=50530.755773340454 m2[1][2]=50530.755773340454 m1[1][3]=52423.69762827249 m2[1][3]=52423.69762827249 m1[1][4]=56990.29847098225 m2[1][4]=56990.29847098225 m1[2][0]=133505.03622530942 m2[2][0]=133505.03622530942 m1[2][1]=124527.3917768878 m2[2][1]=124527.3917768878 m1[2][2]=121304.10831809939 m2[2][2]=121304.10831809939 m1[2][3]=126020.68723619048 m2[2][3]=126020.68723619048 m1[2][4]=137401.1744967179 m2[2][4]=137401.1744967179 m1[3][0]=247738.9864061055 m2[3][0]=247738.9864061055 m1[3][1]=230644.9616565764 m2[3][1]=230644.9616565764 m1[3][2]=224572.93003040878 m2[3][2]=224572.93003040878 m1[3][3]=233404.58633299996 m2[3][3]=233404.58633299996 m1[3][4]=254725.65094484514 m2[3][4]=254725.65094484514 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169) cat /tmp/hh_B Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (326 bytes) --- n java.lang.Object::clone 2% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) 3 made not entrant TF::checkm (326 bytes) Exception in thread "main" java.lang.RuntimeException: TEST=50827 Elements [0][0] differ m1[0][0]=7706.7278164451545 m2[0][0]=7706.716008346415 m1[0][1]=7320.609551430136 m2[0][1]=7320.598202052335 m1[0][2]=7201.232432903037 m2[0][2]=7201.221250764572 m1[0][3]=7363.6815207085365 m2[0][3]=7363.670091676419 m1[0][4]=7757.530582167053 m2[0][4]=7757.518557928863 m1[1][0]=55438.31850924297 m2[1][0]=55438.22686371967 m1[1][1]=51824.39619900315 m2[1][1]=51824.31721228213 m1[1][2]=50530.755773340454 m2[1][2]=50530.67911095596 m1[1][3]=52423.69762827249 m2[1][3]=52423.61812134418 m1[1][4]=56990.29847098225 m2[1][4]=56990.21206504163 m1[2][0]=133505.03622530942 m2[2][0]=133504.82810230204 m1[2][1]=124527.3917768878 m2[2][1]=124527.20303750761 m1[2][2]=121304.10831809939 m2[2][2]=121303.92473653423 m1[2][3]=126020.68723619048 m2[2][3]=126020.49653899744 m1[2][4]=137401.1744967179 m2[2][4]=137400.96660552503 m1[3][0]=247738.9864061055 m2[3][0]=247738.60752429755 m1[3][1]=230644.9616565764 m2[3][1]=230644.61273405756 m1[3][2]=224572.93003040878 m2[3][2]=224572.59044236693 m1[3][3]=233404.58633299996 m2[3][3]=233404.2334013237 m1[3][4]=254725.65094484514 m2[3][4]=254725.2657997465 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169) cat /tmp/hh_C Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2% TF::calc @ 218 (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (326 bytes) --- n java.lang.Object::clone 3% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) 3 made not entrant TF::checkm (326 bytes) Exception in thread "main" java.lang.RuntimeException: TEST=24262 Elements [3][3] differ m1[0][0]=7612.943180925791 m2[0][0]=7612.943180925791 m1[0][1]=7232.091931343198 m2[0][1]=7232.091931343198 m1[0][2]=7114.651836443036 m2[0][2]=7114.651836443036 m1[0][3]=7275.303415709812 m2[0][3]=7275.303415709812 m1[0][4]=7664.617050258952 m2[0][4]=7664.617050258952 m1[1][0]=54797.25164657019 m2[1][0]=54797.25164657019 m1[1][1]=51222.70137181502 m2[1][1]=51222.70137181502 m1[1][2]=49944.34143399005 m2[1][2]=49944.34143399005 m1[1][3]=51815.50758683226 m2[1][3]=51815.50758683226 m1[1][4]=56329.43295526017 m2[1][4]=56329.43295526017 m1[2][0]=131964.4910261504 m2[2][0]=131964.4910261504 m1[2][1]=123086.10360065899 m2[2][1]=123086.10360065899 m1[2][2]=119901.11262812998 m2[2][2]=119901.11262812998 m1[2][3]=124563.47117926647 m2[2][3]=124563.47117926647 m1[2][4]=135812.6917897872 m2[2][4]=135812.6917897872 m1[3][0]=244880.80619995148 m2[3][0]=244880.80619995148 m1[3][1]=227974.9084076055 m2[3][1]=227974.9084076055 m1[3][2]=221977.9207674185 m2[3][2]=221977.9207674185 m1[3][3]=230708.52176553395 m2[3][3]=214324.52176553395 m1[3][4]=251783.71798452607 m2[3][4]=251783.71798452607 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169) Created attachment 487413 [details] New TF.java file Okay, let us first eliminate loop unrolling being the issue. Can you try running the attached file? It is the same code as in comment 10, but with some extra print statements that get printed if an error occurs. We know that the value at a given index is modified 640 times in total, in 5 sets of 128. If a loop is being skipped, the print statement won't have it. Furthermore, the debug statement can be analysed to see if there was an addition error. OK, i will run your code now. answering your question whether the failure occur on the same F [] ? no - I propagated the index k from static void multipleTests(final java.util.Random r) to checker - the failure occurs at different k below two failures at k=0 & k=3, the Fs would be different. cat /tmp/hh_C Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2% TF::calc @ 156 (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (336 bytes) --- n java.lang.Object::clone 3% TF::runTest @ 26 (79 bytes) 4 TF::runTest (79 bytes) 3 made not entrant TF::checkm (336 bytes) Exception in thread "main" java.lang.RuntimeException: F try=3 calculation try=29260 Elements [0][1] differ m1[0][0]=7021.774830987034 m2[0][0]=7021.774830987034 m1[0][1]=6671.004497011981 m2[0][1]=6671.002964197445 m1[0][2]=6562.907481207491 m2[0][2]=6562.9073157314315 m1[0][3]=6711.11205485301 m2[0][3]=6711.112043123023 m1[0][4]=7070.144759580944 m2[0][4]=7070.144758923587 m1[1][0]=50533.65411709927 m2[1][0]=50533.65411709927 m1[1][1]=47238.60662216227 m2[1][1]=47238.60533837808 m1[1][2]=46060.29654012994 m2[1][2]=46060.29640153812 m1[1][3]=47785.9864299652 m2[1][3]=47785.98642014094 m1[1][4]=51948.73401721851 m2[1][4]=51948.73401666795 m1[2][0]=121696.16159646476 m2[2][0]=121696.16159646476 m1[2][1]=113509.99686878055 m2[2][1]=113509.9952559613 m1[2][2]=110574.13409857947 m2[2][2]=110574.13392446646 m1[2][3]=114873.9804417375 m2[2][3]=114873.98042939528 m1[2][4]=125248.01961714865 m2[2][4]=125248.01961645696 m1[3][0]=225825.05076021465 m2[3][0]=225825.05076021465 m1[3][1]=210234.96043019896 m2[3][1]=210234.95772861727 m1[3][2]=204708.61835611783 m2[3][2]=204708.61806446672 m1[3][3]=212760.64466901263 m2[3][3]=212760.64464833855 m1[3][4]=232196.23583147067 m2[3][4]=232196.23583031207 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169) cat /tmp/hh_A Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (336 bytes) --- n java.lang.Object::clone 2% TF::runTest @ 26 (79 bytes) 3 made not entrant TF::checkm (336 bytes) Exception in thread "main" java.lang.RuntimeException: F try=0 calculation try=75033 Elements [3][3] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=40377.57397780233 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=165416.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169) > answering your question whether the failure occur on the same F [] ?
> no - I propagated the index k from
> static void multipleTests(final java.util.Random r)
> to checker - the failure occurs at different k
> below two failures at k=0 & k=3, the Fs would be different.
I didn't mean "same F within the F array". I meant "same F array". You create many random F arrays and I wanted to print every element of the one that fails and try using that array for all the tests.
1. Your code is running. no error yet. Hopefully there will be some by the morning. 2. Answering your question from 2011-03-24 15:53:26 EDT I meant exactly this. I propagated "k" from multipleTests to checker. Array elements depend on k. The failures occurred at k=0 & k=3, i.e. the failures occurred at different elements of the F[] , i.e. the failure does not depend on F[] elements specifics. Created attachment 487427 [details]
the failure with your version
This is the first failure report of your code.
other 4 instances are still running.
The value was changed 640 times according to the log, which means that loops are not being skipped, nor are extra ones being added. I am trying to verify the additions now. I modified your code to save debugStr corresponding to matrix m1. in case of an error two debugStr (corresponding to m1 & m2) should be now printed. This will simplify the comparison. Created attachment 487428 [details]
New TF.java file
Thanks for the quick responses! The calculations add up as well. Unfortunately that does not eliminate the possibility of hw error during arithmetic.
Can you please try the file attached with this post? In addition to printing the calculations done for the new array, it also prints the original set of calculations.
Doh! Please ignore comment 35.. I didn't see comment 34.. you are already thinking along the same lines as me :) The modified code, left to calculate for entire night lead the computer
non-responsive & power rest was required.
Debian box also also went non-responding to ssh, only to ping, exactly as it was last time. So no results produced.
What else can be the problem?
evidently memory allocator of axb array.
final double [][] m=new double [a][b];
This would explain:
1. Destabilisation of my computer & Debian.
2. Difficulty of reproduction on other people computers.
3. problem with long instead of double.
it is unlikely to have the same problem in both long & double arithmetic.
4. Random nature of the problem.
On my workstation and on Debian I modified TFSimple.java
to replace all
new double [a][b] by a call to:
static double [][] newdouble2(final int a, final int b)
{
final double [][] m=new double [a][];
for(int i=a;--i>=0;)
{
m[i]=new double [b];
}
return m;
}
and started 20 instances of this modified TFSimple1.java on my workstation.
previously 20 instances would destabilise my workstation
in less than 1 minute and
and a reboot would required.
Now for more than 40 minutes it is running OK.
but one TFsimle1.java failed anyway.
Exception in thread "main" java.lang.RuntimeException: TEST=26728 Elements [0][4] differ
m1[0][0]=0.1700774336283186 m2[0][0]=0.1700774336283186
m1[0][1]=0.11166978133704736 m2[0][1]=0.11166978133704736
m1[0][2]=0.2518757394821027 m2[0][2]=0.2518757394821027
m1[0][3]=1.0251160550185991 m2[0][3]=1.0251160550185991
m1[0][4]=6.11267581491528 m2[0][4]=5.654225128796633
m1[1][0]=1.043826838481748 m2[1][0]=1.043826838481748
m1[1][1]=0.6853579121009847 m2[1][1]=0.6853579121009847
m1[1][2]=1.5458526814816618 m2[1][2]=1.5458526814816618
m1[1][3]=6.291508684952188 m2[1][3]=6.291508684952188
m1[1][4]=37.51570643105274 m2[1][4]=37.51570643105274
m1[2][0]=3.2998209696571212 m2[2][0]=3.2998209696571212
m1[2][1]=2.1666030482227314 m2[2][1]=2.1666030482227314
m1[2][2]=4.886861408711588 m2[2][2]=4.886861408711588
m1[2][3]=19.88917272867012 m2[2][3]=19.88917272867012
m1[2][4]=118.59736711957807 m2[2][4]=118.59736711957807
m1[3][0]=12.971299713493547 m2[3][0]=12.971299713493547
m1[3][1]=8.516721894032319 m2[3][1]=8.516721894032319
m1[3][2]=19.209813069734487 m2[3][2]=19.209813069734487
m1[3][3]=78.18255077754401 m2[3][3]=78.18255077754401
m1[3][4]=466.1955931200476 m2[3][4]=466.1955931200476
at TFsimple1.checkm(TFsimple1.java:118)
at TFsimple1.runTest(TFsimple1.java:28)
at TFsimple1.multipleTests(TFsimple1.java:129)
at TFsimple1.main(TFsimple1.java:140)
[2] Exit 1 java TFsimple1
Created attachment 487461 [details]
updated TF.java
this is the TF.java I just started
Created attachment 487462 [details]
TFsimple1.java
this TFsimple1.java does not destabilize my workstation
(20 instances of the TFsimple.java would destabilize in less than 1 minute)
but this TFsimple1.java did produce one error anyway.
The difference between TFsimple.java & TFsimple1.java is in
new double [a][b]
Created attachment 487493 [details]
test with extra diag
the test with extra diag
Created attachment 487494 [details]
the failure with TFo.java
the failure results with TFo.java
Created attachment 487495 [details]
where the problem started
This is where the problem started:
m1: (original)
Loop 5 Added 95.7778766848038 at [2][1]. sa=2.036801632444434 (waq+1.1)=2.612425807401708 xx=0.101 yy=0.072 New value=60665.69720562991
m2: (repeated)
Loop 5 Added 95.77787662549497 at [2][1]. sa=2.0368016311831787 (waq+1.1)=2.612425807401708 xx=0.101 yy=0.072 New value=60665.6972055706
m2 is incorrect:
java T1
sa=2.036801632444434 RES=95.7778766848038
Created attachment 487503 [details]
repeated sa calculations
The code in T1.java does repeated calculation,
it was optimized away, I will look later on that.
But anyway correct SA=2.036801632444434,
which sometimes incorrectly calculated.
java -XX:+PrintCompilation T1
SA=2.036801632444434
1 T1::getsa (139 bytes)
2 java.util.Random::next (47 bytes)
--- n sun.misc.Unsafe::compareAndSwapLong
3 java.util.concurrent.atomic.AtomicLong::get (5 bytes)
4 java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
--- n java.lang.Double::doubleToRawLongBits (static)
5 java.lang.Double::doubleToLongBits (33 bytes)
--- n java.lang.Object::clone
6 java.lang.Double::compare (54 bytes)
1% T1::runTests @ 53 (131 bytes)
Ah, so it is an arithmetic issue. It is very improbable that a compiler optimization may cause such an error. This is looking more like an hw issue. Can you try running it again, this time disabling the JIT compiler with: -Djava.compiler=NONE That will make it a lot slower, but it will eliminate the compiler as being the issue. Once we know it can happen with interpreter only, we can perhaps retry with something like gij. I just started 5 instances with java -Djava.compiler=NONE -XX:+PrintCompilation TFo But I expect no problem when JIT is not enabled. I think this issue is very unlikely to be an arithmetic. Reasons - the issue is 1) random and strongly linked to loops 2) same problem does also happen with longs, not only with doubles. the value of "sa" is obtained as a sum over two loop : ks & ls (which are easily optimizable by jit), and most likely reason - jit compiler did something bad with loop unrolling/optimization, generating wrong machine code which in some instances give just wrong values result, in other may cause system destabilization via memory or CPU overuse (like in that tests on Debian). To see that I tried to see assembly, by trying to build openjdk-7-ea-src-b135-24_mar_2011.zip with -XX:+PrintAssembly but it gives me an error BUILD FAILED /rdata/o1/openjdk/jaxp/build-defs.xml:70: ERROR: Cannot find source for project jaxp. but jaxp the source is there in openjdk7 The reason I am disinclined to believe that this is a compiler issue is because we have been unable to reproduce this on any other hardware. I had the test running overnight last night (total of another 20+ hours on top of previous 12+) and I was still unable to trigger it. Of course that does not preclude the possibility that it is a bug introduced by the compiler when it optimizes specifically for the hw you have. That's why I'd like to see what happens with -Djava.compiler=NONE. The case -Djava.compiler=NONE is still running still no problem. I am compiling openjdk-7-ea-src-b135-24_mar_2011.zip now, trying to get -XX:+PrintAssembly option working Great! I take it you figured out the jaxp error? I was just looking into it -- you need to do 'make ALLOW_DOWNLOADS=true' so that it will download it. Created attachment 487902 [details]
failure 1 (assembler dump)
failure with assemble dump.
The problem, as we identified earlier, is incorrect calculation
of sa inside the loop:
double sa=0,wa=1;
for(int ks=k;ks>=0;ks--)
{
double sb=0,wb=1;
for(int ls=l;ls>=0;ls--)
{
sb+=mx[ks][ls]*wb;
wb*=xx;
wb*=ls+1;
}
sa+=sb*wa;
wa*=yy;
wa*=ks+1;
}
// HERE sa may be incorrect !!!
m[k][l]+=sa*(l+2)*(k+4)*(waq+1.1);
I am attaching the assembler dump.
Created attachment 487903 [details]
second failure
second failure
Created attachment 487906 [details]
no failure, java terminated by kill
no failure, java terminated by kill
Created attachment 487916 [details]
TFO.java
This is TFO.java & class files.
assembler dump is attached in previous posts.
The goal, as I see now, is to identify the likehood of hardware failure
against hotspot bug.
Same code with JIT disabled was running for >12 hour without a problem.
I attached assembler bug, we probably know where the problem is
(double loop calculating sa) now one familiar with the hotspot
need to check the asm dump.
Regarding hh_A.txt
the problem is with sa calculation:
CORRECT:--Loop 5 Added 1197.8708043516913 at [3][2]. sa=2.3885752977674133 (waq+1.1)=17.910718659530968 xx=0.101 yy=0.072 New value=27577.43591104506
vs
WRONG:-Loop 5 Added 1197.8708008520678 at [3][2]. sa=2.388575290789103 (waq+1.1)=17.910718659530968 xx=0.101 yy=0.072 New value=27577.435907545434
The correct sa=2.3885752977674133
this code is called many times inside other loops,
and once in 3-4 hours give wrong sa.
double sa=0,wa=1;
for(int ks=k;ks>=0;ks--)
{
double sb=0,wb=1;
for(int ls=l;ls>=0;ls--)
{
sb+=mx[ks][ls]*wb;
wb*=xx;
wb*=ls+1;
}
sa+=sb*wa;
wa*=yy;
wa*=ks+1;
}
the matrix mx does not depend on anything, fixed for all tests.
and the value of sa depend on outside loop indices k & l only,
sa does not depend on anything else.
The problem: sa sometimes is calculated INCORRECTLY.
in interpreter mode (no JIT) I did not catch a failure yet.
One more reason why this is unlikely hardware bug: I did run hundreds of unit tests for internal java code - no problem at all. The only issue arise when there are two loops one inside another & JIT is enabled. as in sa calculation above. No problem in any other situation. Forgot to tell: the last tests are run with: java -version openjdk version "1.7.0-internal" OpenJDK Runtime Environment (build 1.7.0-internal-root_2011_03_26_02_25-b00) OpenJDK 64-Bit Server VM (build 21.0-b05, mixed mode) as for x in A B C D E ; do java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly TFo >hh_$x 2>&1 & done and the openjdk source used: openjdk-7-ea-src-b135-24_mar_2011.zip looking at hh_A.txt output at the end - we see that the last run before failure (the data saved into debugStr printed at the end) the sa at[k=3][l=2] was calculated 128 times total. only the 27-th time the sa was calculated incorrectly to [3][2]. sa=2.388575290789103 all other times it was equal to correct value [3][2]. sa=2.3885752977674133 I expect there should be different asm used to calculate sa at k=3,k=2 at 27-th time, probably some loop optimization triggered, and the transition between loop execution strategies may be broken. By the way - the last JIT code generated related to function F.calc is asm inlining 4 loops at once: hh_A.txt:58122: ; - TFo::calc@87 (line 86) openjdk did loop inline, generated new asm, which did loop unrolling near hh_A.txt:58122 and this asm fails in some situation, not always. once this loop inline was done - the 27th calculation of sa at [k=3][l=2] became incorrect & the program failed, right after last loop inline JIT newest generated asm. Given the fact that loop unrolled code calculates sa incorrectly quite seldom and the problem is hard to reproduce, as well as that it popped up right after new asm unrolling was generated - I think the problem is that loop unrolling code is RACY in some sense. It probably has synchronization wrong and variable/registry got clobbered in the middle of asm generation or execution. By the way in other test results: hh_D.txt.gz exactly the same pattern - unroll loops TFo::calc@87 (line 86) and then failure. I do not think that just plain incorrect unrolling would not be caught by other people. This is a race almost for sure. locking is broken in either in 1. JIT generator itself. or 2. The machine code generated by JIT generator. Created attachment 487958 [details]
The results with TFsimple.java
Below are the results with TFsimple.java
(much simpler code (no strings used), much less to analyse, but less data collected).
Same pattern - loop unroll and then test failure.
Created attachment 488194 [details]
new reproducer (hopefully)
Hi.
Can you please try running the attached file?
It is the same as TFsimple.java with a few modifications. I removed the F array and the F class because they weren't being used, and I made mx a static double[][]. Also, when you run it, please add the -XX:CompileOnly=TF::runTest,TF::calc option.
It will be interesting to see whether this fails, and also the assembler printouts should be shorter.
I just started it about 30 minutes ago as for x in A B C D E F ; do java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:CompileOnly=TFD1::runTest,TFD1::calc TFD1 >TFD1_$x 2>&1 & sleep 4 ; done (renamed to TFD1 to avoid confusion with other versions ). No problem yet. The TFsimple.java did give an error faster. Your code is very unlikely to give an error because JIT does not compile the function "calc" regulary (and asm output does not grow). In failed versions JIT did regular compilation of calc, eventually some of them did fail. and the asm output did grow permanently. I think this is a race almost for sure. It is unlikely that plain wrong JIT output would not be caught by other people. From my experience with linux - fixing a race is the organizational problem. One need to reach the maintainer, who is familiar with locking in the specific module. Any other method is just plain ineffective and may take weeks. This is why I never liked SUN - I never being able to reach the mantainer to fix the problem. Created attachment 488225 [details]
new reproducer results.
Unlikely to give an error because asm output does not grow.
In failed versions JIT did compile the calc() regularly,
what eventually produced an error. In this version no new asm generated
for long time.
I think if we tell hotspot JIT to be much more aggressive in loop unrolling e.g. force it to recompile & unroll every 10-th call of calc. In this case JIT compilations would go permanently and there are would be much higher chances to catch a race. Once we identified this is a race - for successful race fixing we need somehow to reach the maintainer/author of loop unrolling code at openjdk or oracle. Anything else is just plain ineffective in case of race fixing. (In reply to comment #64) > Created attachment 488225 [details] > new reproducer results. > > Unlikely to give an error because asm output does not grow. > In failed versions JIT did compile the calc() regularly, I'm not sure what you mean by "compiled regularly". I looked at every attachment you posted. In TFsimple.java, calc() is JIT compiled exactly once, and in the others it's compiled exactly twice. You say that when it crashes the asm output grows. How does it grow? Does it output a lot of asm early, then stop for a while, and then it outputs more near the crash? > what eventually produced an error. In this version no new asm generated > for long time. That's what I expected. Can you try running what I posted again, but this time without the -XX:CompileOnly... command? Also, your hypothesis that it's a race sounded very reasonable at first, but now I'm not so sure. As far as I understand this (and correct me if I'm wrong), the code is getting compiled, it's running for a long time correctly (1 hour+), and then it breaks. If there was some race in the compilation and wrong code was produced, we'd be seeing a mistake immediately after the compilation. And I don't see how there can be a race in generated code, since there's no shared data here. I just started with -XX:CompileOnly. take for example TFo.java, I posted logs above. >it's running for a long time correctly (1 hour+) 10-30 minutes correctly, then failure. right before the failure where was always loop unrolling: https://bugzilla.redhat.com/attachment.cgi?id=487902 file line 58122: ; - TFo::calc@87 (line 86) https://bugzilla.redhat.com/attachment.cgi?id=487903 file 57868: ; - TFo::calc@87 (line 86) after thing compilation a failure occur (probably because of a race), all other asm is fininshing the output. Most likely race I see - the code produced by JIT (compiled code) may be racy by itself. It breaks eventually. In TFo.java it was easy to catch, because the failure occured very soon after the JIT loop unrollng. But it did produce some number of correct result before the failure. for example first 26 calculations at [k=3][l-2]. give correct sa=2.3885752977674133 then the 27-th calculation at [k=3][l=2]. give wrong sa=2.388575290789103 Created attachment 488237 [details]
hotspot failure
One of TFD1 processes produced hotspot failure
in _multianewarray2_Java
But this may be unrelated to problem of different apparence of the race.
tack: [0x00007fb206c21000,0x00007fb206d22000], sp=0x00007fb206d206a0, free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x7d75bf] typeArrayKlass::allocate(int, Thread*)+0x7f
V [libjvm.so+0x69664f] objArrayKlass::multi_allocate(int, int*, Thread*)+0x15f
V [libjvm.so+0x72c67a] OptoRuntime::multianewarray2_C(klassOopDesc*, int, int, JavaThread*)+0x3a
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
v ~RuntimeStub::_multianewarray2_Java
J TFD1.calc(IIDD[[DLTFD1$ComputationLog;)[[D
J TFD1.runTest(III)V
j TFD1.multipleTests()V+12
j TFD1.main([Ljava/lang/String;)V+4
v ~StubRoutines::call_stub
(In reply to comment #69) > Created attachment 488237 [details] > hotspot failure > > One of TFD1 processes produced hotspot failure > in _multianewarray2_Java > But this may be unrelated to problem of different apparence of the race. Now this is definitely a bug. You're probably right that it's unrelated. Anyway, some VM options that may be of interest to you: -XX:+CompileTheWorld -XX:+TraceIterativeGVN -XX:+DeoptimizeALot -XX:+PrintInlining -XX:-Inline // this will turn off inlining. If you can't reproduce it with this, that means something might be wrong with inlining. -XX:+VerifyLoopOptimizations Started TFo (most easy to reproduce) with -XX:-Inline no problem yet (15 minutes). Typically they fail in 5-20 minutes. I will leave TFo running until tomorrow. Our internal unit tests (where I originally found the problem) passed OK with options -XX:-Inline. options -XX:+CompileTheWorld -XX:+TraceIterativeGVN -XX:+DeoptimizeALot -XX:+PrintInlining -XX:+VerifyLoopOptimizations are not recognized by openjdk7 (In reply to comment #71) ... > options -XX:+CompileTheWorld -XX:+TraceIterativeGVN -XX:+DeoptimizeALot > -XX:+PrintInlining -XX:+VerifyLoopOptimizations > are not recognized by openjdk7 That's because you need a debug build of openjdk7. For that, just do "make debug" instead of "make". Also, -XX:+VerifyIterativeGVN is very good for spotting JIT mistakes. Created attachment 488276 [details]
an error with -XX:-Inline
A failure with:
java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:-Inline TFo
In this failure sa is always correct sa=0.1700774336283186 error is in waq calculated incorrect: -Loop 3 Added 8.597971821578138 at [0][0]. New value=15.580084180784024 correct: +Loop 3 Added 9.295708174139355 at [0][0]. New value=16.27782053334524 all other waq are correct: -Loop 3 Added 0.0025153796731071836 at [0][0]. New value=29.080013812172588 +Loop 3 Added 0.0025153796731071836 at [0][0]. New value=30.727116245989492 Strangely, the pattern is always: for loop and then two indices array += something
for(....)
{
m[k][l]+=wb;
also, is function inline & loop unroll ins the same -XX:-Inline
or different jdk options?
What is the corresponding source code for the output in comment 74? Also, can you try again, this time only supply -XX:LoopUnrollLimit=0 That should prevent unrolling TFO.java is Comment 52 2011-03-26 13:14:07 EDT https://bugzilla.redhat.com/attachment.cgi?id=487493 Created attachment 488415 [details]
a failure with LoopUnrollLimit=0
a failure with LoopUnrollLimit=0
My second guess about the problem is multiarray memory allocator. because: 1. there was a failure in ~RuntimeStub::_multianewarray2_Java ( comment 69 ) This may be unrelated, but may be a hint. 2. The problem started to show when the 2D & 3D arrays were added to the code. 3. All failing sums are the sums over a double array. Also note my comment 37 above which did not produce any changes in behavior. That is odd. Can you try again, this time printing the value of sa each time it is modified in any of the loops, and what components make it up? I'd do it myself if it was reproducible here :/
e.g.
double sa=0,wa=1;
for(int ks=k;ks>=0;ks--)
{
double sb=0,wb=1;
for(int ls=l;ls>=0;ls--)
{
sb+=mx[ks][ls]*wb;
wb*=xx;
wb*=ls+1;
}
sa+=sb*wa;
wa*=yy;
wa*=ks+1;
}
becomes:
double sa=0,wa=1;
for(int ks=k;ks>=0;ks--)
{
double sb=0,wb=1;
for(int ls=l;ls>=0;ls--)
{
sb+=mx[ks][ls]*wb;
wb*=xx;
wb*=ls+1;
}
debugStr[k][l].append("Incrementing sa=" + sa + " by " + sb*wa + " to be " + sa+sb*wa);
sa+=sb*wa;
wa*=yy;
wa*=ks+1;
}
This will tell us which specific part is incorrect, and then we can try to narrow it down from there by printing when sb, wa, etc. are modified. If you want, you can also print the subcomponents initially too.. it will make the log bigger, but then it wont have to be narrowed down later
I will do this tomorrow. One more thing: I just looked at that my old workstation (where failures occur) which for >4 hours was running 8 instances of java -Djava.compiler=NONE TFo Kernel crashed in one of java processes. uname -a Linux comp90 2.6.34.8-68.fc13.x86_64 #1 SMP Thu Feb 17 15:03:58 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux I do not have a serial port nulmodem cable handy to catch the trace. but anyway - can this problem be Linux smp race? this would explain strange behaviour. Interestingly - there was no single sign of problem with that old workstation, besides java. I think then the most straighforward way for me - trying to update it to F15, then try to reproduce java failure. By the way - Debian (8core 32Gb) uname -a Linux forum 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux java -version java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) locks when trying to run 20 instances of TFSimple only pressing "reset" helps. I am not sure why the Debian box is locking up just on something like high load. It is unfortunate that the error cannot be verified there. As for the F15 update -- I am curious to see that too now. Hopefully it doesn't freeze, and it doesn't trigger there either. Is the box capable of KVM virtualization? You might want to give that a shot too. I was very busy with other things, so I did not at this much more. But I some kind managed to get some kind abnormal (but not exact failure) on a 4 cores, 24GB, Fedora (old, but this does not matter) Linux cc 2.6.27.41-170.2.117.fc10.x86_64 #1 SMP Thu Dec 10 10:36:29 EST 2009 x86_64 x86_64 x86_64 GNU/Linux java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) with TFsimple.java from Comment 14 2011-03-24 08:33:13 EDT The results: start 20 instances of TFsimple n=0 ; while [ $n -lt 20 ]; do java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly TFsimple >diag_$n & n=$[ $n + 1 ] ; done it works perfectly fine for about 10-15 minutes (just high load), then the computer goes out of memory freezes, messages about >120 sec in interrupt in console, out of memory, etc. No reaction on the keyboard, only pressing "reset" button helps. The question is: why after about 10 minutes java processes start consuming that much memory on a such a non-memory intensive code as TFsimple.java Created attachment 489908 [details]
TFSsimple.java
Code simplification on the same computer where all other tests were run,
the OS is not updated yet.
[][] arrays are removed, only [] left.
The test was run with hotspot disabled
java -XX:+PrintAssembly -Djava.compiler=NONE
The result (after few hours)
OpenJDK 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output
VM option '+PrintAssembly'
Running single thread code until test failure
Exception in thread "main" java.lang.RuntimeException: TEST=38917 Elements [3] differ
m1[0]=287.5015945415851 m2[0]=287.5015945415851
m1[1]=193.26240424710878 m2[1]=193.26240424710878
m1[2]=323.1759685698639 m2[2]=323.1759685698639
m1[3]=863.5951311480515 m2[3]=863.5951311480515
at TFSsimple.checkm(TFSsimple.java:102)
at TFSsimple.runTest(TFSsimple.java:19)
at TFSsimple.multipleTests(TFSsimple.java:113)
at TFSsimple.main(TFSsimple.java:124)
strange thing - m1[3] & m2[3] are identical.
they were checked at
line 87: if(Double.compare(m1[i], m2[i]) != 0)
for some reason the value of array element were was taken wrong (most probably)
at line 87, but it was OK on lines 99 & 100.
Does this make any sense?
Created attachment 489909 [details]
proper TFSsimple.java
proper file TFSsimple.java
Hi.
I think what you have above is definitely a case of 2 different doubles seeming identical but having the same string representation. To confirm this, please run the test again, but replace
s.append("m1["+i+"]="+m1[i]+"\t");
s.append("m2["+i+"]="+m2[i]+"\n");
with
s.append("m1["+i+"]=" + Double.doubleToRawLongBits(m1[i]) + "\t");
s.append("m2["+i+"]=" + Double.doubleToRawLongBits(m2[i]) + "\n");
If I'm right and this inconsistency is due to simply the imprecise nature of finite space binary->decimal rational number conversion, then this last test conclusively rules out the optimizer or code generator as the problem. Since you can reproduce it with both the compiler and interpreter, and you can only reproduce it on one machine, the problem is almost certainly your hardware.
Perhaps you could try running some other computationally intensive task, like prime95 for a few hours?
This message is a notice that Fedora 15 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 15. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At this time, all open bugs with a Fedora 'version' of '15' have been closed as WONTFIX. (Please note: Our normal process is to give advanced warning of this occurring, but we forgot to do that. A thousand apologies.) Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, feel free to reopen this bug and simply change the 'version' to a later Fedora version. Bug Reporter: Thank you for reporting this issue and we are sorry that we were unable to fix it before Fedora 15 reached end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" (top right of this page) and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping |