Hide Forgot
JDK loop unrolling is wrong on all JDKs I tried, including , but not limited to java-1.6.0-openjdk-devel-1.6.0.0-50.1.8.7.fc13.x86_64 java version "1.6.0_23" Java(TM) SE Runtime Environment (build 1.6.0_23-b05) Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode) java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) ava version "1.7.0-ea" Java(TM) SE Runtime Environment (build 1.7.0-ea-b134) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b04, mixed mode) start the unit test attached TF.java start several instances (there was no failures with 1 instance) shell command to run: for x in A B C D E F G H I J K L ; do java TF & done the code would eventually randomly fail: Exception in thread "main" java.lang.RuntimeException: TEST=3080 Elements [3][3] differ m1[0][0]=70.80493904142715 m2[0][0]=70.80493904142715 m1[0][1]=5.505800716979796 m2[0][1]=5.505800716979796 m1[0][2]=1.6437479313186212 m2[0][2]=1.6437479313186212 m1[0][3]=1.4019663061342367 m2[0][3]=1.4019663061342367 m1[0][4]=1.534046915882465 m2[0][4]=1.534046915882465 m1[1][0]=67.71546800405181 m2[1][0]=67.71546800405181 m1[1][1]=18.46031213801821 m2[1][1]=18.46031213801821 m1[1][2]=15.461729052867444 m2[1][2]=15.461729052867444 m1[1][3]=16.116727575946555 m2[1][3]=16.116727575946555 m1[1][4]=17.944598246769843 m2[1][4]=17.944598246769843 m1[2][0]=139.25078042329778 m2[2][0]=139.25078042329778 m1[2][1]=46.3761808781519 m2[2][1]=46.3761808781519 m1[2][2]=40.648023996975034 m2[2][2]=40.648023996975034 m1[2][3]=42.57028104530857 m2[2][3]=42.57028104530857 m1[2][4]=47.41650732453458 m2[2][4]=47.41650732453458 m1[3][0]=342.5093615611222 m2[3][0]=342.5093615611222 m1[3][1]=96.6730861908069 m2[3][1]=96.6730861908069 m1[3][2]=81.46054748408591 m2[3][2]=81.46054748408591 m1[3][3]=84.98153472770015 m2[3][3]=84.89837472770016 m1[3][4]=94.62787551705485 m2[3][4]=94.62787551705485 at TF.checkm(TF.java:99) at TF.runTest(TF.java:119) at TF.main(TF.java:131) Exception in thread "main" java.lang.RuntimeException: TEST=25440 Elements [1][0] differ m1[0][0]=70.80493904142715 m2[0][0]=70.80493904142715 m1[0][1]=5.505800716979796 m2[0][1]=5.505800716979796 m1[0][2]=1.6437479313186212 m2[0][2]=1.6437479313186212 m1[0][3]=1.4019663061342367 m2[0][3]=1.4019663061342367 m1[0][4]=1.534046915882465 m2[0][4]=1.534046915882465 m1[1][0]=67.71546800405181 m2[1][0]=67.70929080334516 m1[1][1]=18.46031213801821 m2[1][1]=18.459956042302434 m1[1][2]=15.461729052867444 m2[1][2]=15.461711457635538 m1[1][3]=16.116727575946555 m2[1][3]=16.11672681521524 m1[1][4]=17.944598246769843 m2[1][4]=17.94459821753404 m1[2][0]=139.25078042329778 m2[2][0]=139.25155700955827 m1[2][1]=46.3761808781519 m2[2][1]=46.376225645847406 m1[2][2]=40.648023996975034 m2[2][2]=40.64802620901508 m1[2][3]=42.57028104530857 m2[2][3]=42.5702811409463 m1[2][4]=47.41650732453458 m2[2][4]=47.416507328210045 m1[3][0]=342.5093615611222 m2[3][0]=342.50926279905303 m1[3][1]=96.6730861908069 m2[3][1]=96.67308049749185 m1[3][2]=81.46054748408591 m2[3][2]=81.46054720277053 m1[3][3]=84.98153472770015 m2[3][3]=84.98153471553745 m1[3][4]=94.62787551705485 m2[3][4]=94.62787551658742 at TF.checkm(TF.java:99) at TF.runTest(TF.java:119) at TF.main(TF.java:131) Exception in thread "main" java.lang.RuntimeException: TEST=77816 Elements [1][0] differ m1[0][0]=70.80493904142715 m2[0][0]=70.80493904142715 m1[0][1]=5.505800716979796 m2[0][1]=5.505800716979796 m1[0][2]=1.6437479313186212 m2[0][2]=1.6437479313186212 m1[0][3]=1.4019663061342367 m2[0][3]=1.4019663061342367 m1[0][4]=1.534046915882465 m2[0][4]=1.534046915882465 m1[1][0]=67.71546800405181 m2[1][0]=67.70929080334516 m1[1][1]=18.46031213801821 m2[1][1]=18.459956042302434 m1[1][2]=15.461729052867444 m2[1][2]=15.461711457635538 m1[1][3]=16.116727575946555 m2[1][3]=16.11672681521524 m1[1][4]=17.944598246769843 m2[1][4]=17.94459821753404 m1[2][0]=139.25078042329778 m2[2][0]=139.25155700955827 m1[2][1]=46.3761808781519 m2[2][1]=46.376225645847406 m1[2][2]=40.648023996975034 m2[2][2]=40.64802620901508 m1[2][3]=42.57028104530857 m2[2][3]=42.5702811409463 m1[2][4]=47.41650732453458 m2[2][4]=47.416507328210045 m1[3][0]=342.5093615611222 m2[3][0]=342.50926279905303 m1[3][1]=96.6730861908069 m2[3][1]=96.67308049749185 m1[3][2]=81.46054748408591 m2[3][2]=81.46054720277053 m1[3][3]=84.98153472770015 m2[3][3]=84.98153471553745 m1[3][4]=94.62787551705485 m2[3][4]=94.62787551658742 at TF.checkm(TF.java:99) at TF.runTest(TF.java:119) at TF.main(TF.java:131) [10] Exit 1 java TF [12] Exit 1 java TF [15] Exit 1 java TF
Created attachment 487107 [details] Unit test to see incorrect hotspot loop unrolling. This is a test case to see incorrect loop unrolling by hotspot. The matrices m1 & m2 inside runTest must be identical, if at some point they are not - an exception is thrown
This is hotspot problem, generated .class files are OK
From observing the structure of matrix differences (single element differ) - the most likely reason is that during a switch from one type of loop execution to another hotspot either executed first operator of the loop body twice or not executed it at all.
I am still analysing the source, but given how close the numbers are, this looks more like a bug related to double handling than loop unrolling. Also, how long does it take to crash usually? I had it running for 10+ minutes and it didn't crash. Had to stop because laptop cpu got to 98C though :)
Looking at more of the code, this does look like a double comparison issue, and not related to loop unrolling. Specifically, in the code where you do: if(m1[i][j]!=m2[i][j]) { iif=i; jif=j; break oloop; } Can you please change it to: if(Double.compare(m1[i][j]!=m2[i][j]) != 0) { iif=i; jif=j; break oloop; } and try again? == is not an equivalence operator for double values. Please see Joe Darcy's blog for a more detailed description of this: http://blogs.sun.com/darcy/entry/notions_of_floating_point_equality
(In reply to comment #5) > > if(Double.compare(m1[i][j]!=m2[i][j]) != 0) Sorry, that should be: if(Double.compare(m1[i][j], m2[i][j]) != 0)
On further thought, this may be more than a rounding error. Nonetheless, can you please try the Double.compare() to eliminate that possibility and see if it still errors out? I've had the code (original with "!=") running on my desktop box (q-core i7 920 @ 3Ghz) for the past 3+ hours and so far it hasn't thrown any error.
Created attachment 487241 [details] updated test I am attaching updated TF.java where the problem is 1. easier to reproduce 2. numbers difference is not small , for example m1[1][1]=40377.57397780233 m2[1][1]=36281.57397780233 the code is started as n=0 ; while [ $n -lt 20 ]; do java TF & n=$[ $n + 1 ] ; done the attached test supersedes old TF.java Exception in thread "main" java.lang.RuntimeException: TEST=29469 Elements [1][1] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=38329.57397780235 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=30044 Elements [1][1] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=38329.57397780235 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=31890 Elements [0][0] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.888586395702 m1[0][1]=5716.6127269548415 m2[0][1]=5716.5952705629015 m1[0][2]=5624.415186604344 m2[0][2]=5624.397987441957 m1[0][3]=5751.072547006512 m2[0][3]=5751.054968098871 m1[0][4]=6057.8359619079365 m2[0][4]=6057.817467518348 m1[1][0]=43190.181781667496 m2[1][0]=43190.040840718015 m1[1][1]=40377.57397780233 m2[1][1]=40377.45114596819 m1[1][2]=39371.04889321945 m2[1][2]=39370.93072927575 m1[1][3]=40845.525454463925 m2[1][3]=40845.40313214674 m1[1][4]=44402.20207354617 m2[1][4]=44402.06916947626 m1[2][0]=103991.08438810863 m2[2][0]=103990.76428588785 m1[2][1]=97000.16811187661 m2[2][1]=96999.8766918958 m1[2][2]=94491.77180358718 m2[2][2]=94491.4892293681 m1[2][3]=98165.6040876451 m2[2][3]=98165.31074992764 m1[2][4]=107029.22874182602 m2[2][4]=107028.90898282987 m1[3][0]=192961.576069273 m2[3][0]=192960.99331829284 m1[3][1]=179643.57896832644 m2[3][1]=179643.04135665382 m1[3][2]=174920.49439552703 m2[3][2]=174919.97190336738 m1[3][3]=181800.08676866323 m2[3][3]=181799.543903905 m1[3][4]=198405.9217295142 m2[3][4]=198405.32933814573 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=33184 Elements [3][3] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=40377.57397780233 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=165416.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=34764 Elements [0][0] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.888586395702 m1[0][1]=5716.6127269548415 m2[0][1]=5716.5952705629015 m1[0][2]=5624.415186604344 m2[0][2]=5624.397987441957 m1[0][3]=5751.072547006512 m2[0][3]=5751.054968098871 m1[0][4]=6057.8359619079365 m2[0][4]=6057.817467518348 m1[1][0]=43190.181781667496 m2[1][0]=43190.040840718015 m1[1][1]=40377.57397780233 m2[1][1]=40377.45114596819 m1[1][2]=39371.04889321945 m2[1][2]=39370.93072927575 m1[1][3]=40845.525454463925 m2[1][3]=40845.40313214674 m1[1][4]=44402.20207354617 m2[1][4]=44402.06916947626 m1[2][0]=103991.08438810863 m2[2][0]=103990.76428588785 m1[2][1]=97000.16811187661 m2[2][1]=96999.8766918958 m1[2][2]=94491.77180358718 m2[2][2]=94491.4892293681 m1[2][3]=98165.6040876451 m2[2][3]=98165.31074992764 m1[2][4]=107029.22874182602 m2[2][4]=107028.90898282987 m1[3][0]=192961.576069273 m2[3][0]=192960.99331829284 m1[3][1]=179643.57896832644 m2[3][1]=179643.04135665382 m1[3][2]=174920.49439552703 m2[3][2]=174919.97190336738 m1[3][3]=181800.08676866323 m2[3][3]=181799.543903905 m1[3][4]=198405.9217295142 m2[3][4]=198405.32933814573 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=35158 Elements [1][1] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=40373.57397780233 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=41942 Elements [0][1] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.608241494241 m1[0][2]=5624.415186604344 m2[0][2]=5624.414714771186 m1[0][3]=5751.072547006512 m2[0][3]=5751.07251441535 m1[0][4]=6057.8359619079365 m2[0][4]=6057.835960128156 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=40377.56976564972 m1[1][2]=39371.04889321945 m2[1][2]=39371.04845013602 m1[1][3]=40845.525454463925 m2[1][3]=40845.525423858606 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207187484 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16217862988 m1[2][2]=94491.77180358718 m2[2][2]=94491.77117945894 m1[2][3]=98165.6040876451 m2[2][3]=98165.60404453438 m1[2][4]=107029.22874182602 m2[2][4]=107029.22873947177 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.56782488193 m1[3][2]=174920.49439552703 m2[3][2]=174920.4932233293 m1[3][3]=181800.08676866323 m2[3][3]=181800.08668769544 m1[3][4]=198405.9217295142 m2[3][4]=198405.92172509263 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=44978 Elements [1][4] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=40377.57397780233 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20205538556 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=55166 Elements [1][1] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=38329.57397780235 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=63456 Elements [1][1] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=38329.57397780235 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=64653 Elements [0][2] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5496.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=40377.57397780233 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=181800.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=76351 Elements [0][1] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.608241494241 m1[0][2]=5624.415186604344 m2[0][2]=5624.414714771186 m1[0][3]=5751.072547006512 m2[0][3]=5751.07251441535 m1[0][4]=6057.8359619079365 m2[0][4]=6057.835960128156 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=40377.56976564972 m1[1][2]=39371.04889321945 m2[1][2]=39371.04845013602 m1[1][3]=40845.525454463925 m2[1][3]=40845.525423858606 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207187484 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16217862988 m1[2][2]=94491.77180358718 m2[2][2]=94491.77117945894 m1[2][3]=98165.6040876451 m2[2][3]=98165.60404453438 m1[2][4]=107029.22874182602 m2[2][4]=107029.22873947177 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.56782488193 m1[3][2]=174920.49439552703 m2[3][2]=174920.4932233293 m1[3][3]=181800.08676866323 m2[3][3]=181800.08668769544 m1[3][4]=198405.9217295142 m2[3][4]=198405.92172509263 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160) Exception in thread "main" java.lang.RuntimeException: TEST=25721 Elements [3][3] differ m1[0][0]=7021.774830987034 m2[0][0]=7021.774830987034 m1[0][1]=6671.004497011981 m2[0][1]=6671.004497011981 m1[0][2]=6562.907481207491 m2[0][2]=6562.907481207491 m1[0][3]=6711.11205485301 m2[0][3]=6711.11205485301 m1[0][4]=7070.144759580944 m2[0][4]=7070.144759580944 m1[1][0]=50533.65411709927 m2[1][0]=50533.65411709927 m1[1][1]=47238.60662216227 m2[1][1]=47238.60662216227 m1[1][2]=46060.29654012994 m2[1][2]=46060.29654012994 m1[1][3]=47785.9864299652 m2[1][3]=47785.9864299652 m1[1][4]=51948.73401721851 m2[1][4]=51948.73401721851 m1[2][0]=121696.16159646476 m2[2][0]=121696.16159646476 m1[2][1]=113509.99686878055 m2[2][1]=113509.99686878055 m1[2][2]=110574.13409857947 m2[2][2]=110574.13409857947 m1[2][3]=114873.9804417375 m2[2][3]=114873.9804417375 m1[2][4]=125248.01961714865 m2[2][4]=125248.01961714865 m1[3][0]=225825.05076021465 m2[3][0]=225825.05076021465 m1[3][1]=210234.96043019896 m2[3][1]=210234.96043019896 m1[3][2]=204708.61835611783 m2[3][2]=204708.61835611783 m1[3][3]=212760.64466901263 m2[3][3]=196376.64466901263 m1[3][4]=232196.23583147067 m2[3][4]=232196.23583147067 at TF.checkm(TF.java:138) at TF.runTest(TF.java:18) at TF.multipleTests(TF.java:149) at TF.main(TF.java:160)
Created attachment 487244 [details] proper latest test case proper latest test case
Created attachment 487246 [details] Latest test The test is started as n=0 ; while [ $n -lt 20 ]; do java TF & n=$[ $n + 1 ] ; done the results are in test.txt the number differences are not small m1[0][2]=5624.415186604344 m2[0][2]=5368.415186604342 comparison code - as you suggested. As a side note: there was no error reporting for long time, then, in another window I started the find / command - errors immediately came out
Created attachment 487247 [details] test results test results for a test started as n=0 ; while [ $n -lt 20 ]; do java TF & n=$[ $n + 1 ] ; done
Created attachment 487295 [details] Same problem does exist with long This is a much wider & much stronger problem. It does exist not only with double, but with long also. (it takes much more time to trigger the problem with longs) Test code & results are attached.
Created attachment 487304 [details] the results with longs instead of doublea the results with longs instead of double
Created attachment 487310 [details] simplified test This is TFsimple.java a much simplified TF.java which nevertheless, show the same problem with incorrect calculations. But it takes much longer to trigger. the function calc is just this: static double [][] calc(final int a,final int b,final F [] f, final double xx,final double yy, final double [][] mx) { final double [][] m=new double [a][b]; double wa=1; for(int k=0;k<a;k++) { double wb=wa; for(int l=0;l<b;l++) { m[k][l]+=wb*mx[k][l]; wb*=(l+0.7822332)*2; } wa*=(k+0.67221333); } return m; }
I've run the code for over 12 hours now, and still no error. I am trying the latest test case now (so far no errors there either). What sort of hardware are you running on test on? And have you been able to reproduce this on other hardware?
I am running Fedora on AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ stepping : 3 cpu MHz : 3007.708 cache size : 1024 KB 8Gb memory, 64 bit java last tested on java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) but various other versions are also tried. I just run memtest for about 2 hours - no problem found. I just started the TFsimple.java on a very different computer - 20 instances at once: n=0 ; while [ $n -lt 20 ]; do java TFsimple & n=$[ $n + 1 ] ; done >diag_1.txt 2>&1 Not that many for the computer: Debian, 8 way Xeon 32 Gb memory Linux forum 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) The system got destabilized & became non-responding, so I did power reset. diag1 show nothing special. I just started 2 instances instead of 20.
I've been running the last version posted version of the test (with double) for over 3 hours now and it has still not error'd (8 instances). Any other machine you can try on to eliminate hardware as the problem? Another one of my co-workers (Denis Lila, on cc) tried it for 2.5+ hours on his machine too -- there were no errors.
Created attachment 487401 [details] errors My computer (AMX athlon) got destabilized with 3 instances of TFsimple running after 1 hour. Debian is still running 3 instances. looks like TF.java - is more likely give just an error and TFsimple.java is more likely to give destabilization after an error. attached a trace for java -version java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) cat /tmp/diag_simple.txt [1] 2748 [2] 2749 [3] 2750 [4] 2751 [5] 2752 [6] 2753 [7] 2754 [8] 2755 [9] 2756 [10] 2757 [11] 2758 [12] 2759 [13] 2760 [14] 2761 [15] 2762 Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure Running single thread code until test failure # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f62a7d6d5ea, pid=2749, tid=140061692933888 # # JRE version: 6.0_24-b07 # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x6295ea] # # An error report file with more information is saved as: # /home/mal/hs_err_pid2749.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # Exception in thread "main" java.lang.RuntimeException: TEST=62725 Elements [2][3] differ m1[0][0]=0.1700774336283186 m2[0][0]=0.1700774336283186 m1[0][1]=0.11166978133704736 m2[0][1]=0.11166978133704736 m1[0][2]=0.2518757394821027 m2[0][2]=0.2518757394821027 m1[0][3]=1.0251160550185991 m2[0][3]=1.0251160550185991 m1[0][4]=6.11267581491528 m2[0][4]=6.11267581491528 m1[1][0]=1.043826838481748 m2[1][0]=1.043826838481748 m1[1][1]=0.6853579121009847 m2[1][1]=0.6853579121009847 m1[1][2]=1.5458526814816618 m2[1][2]=1.5458526814816618 m1[1][3]=6.291508684952188 m2[1][3]=6.291508684952188 m1[1][4]=37.51570643105274 m2[1][4]=37.51570643105274 m1[2][0]=3.2998209696571212 m2[2][0]=3.2998209696571212 m1[2][1]=2.1666030482227314 m2[2][1]=2.1666030482227314 m1[2][2]=4.886861408711588 m2[2][2]=4.886861408711588 m1[2][3]=19.88917272867012 m2[2][3]=17.70913952378951 m1[2][4]=118.59736711957807 m2[2][4]=118.59736711957807 m1[3][0]=12.971299713493547 m2[3][0]=12.971299713493547 m1[3][1]=8.516721894032319 m2[3][1]=8.516721894032319 m1[3][2]=19.209813069734487 m2[3][2]=19.209813069734487 m1[3][3]=78.18255077754401 m2[3][3]=78.18255077754401 m1[3][4]=466.1955931200476 m2[3][4]=466.1955931200476 at TFsimple.checkm(TFsimple.java:108) at TFsimple.runTest(TFsimple.java:18) at TFsimple.multipleTests(TFsimple.java:119) at TFsimple.main(TFsimple.java:130)
The SIGSEGV is a different issue as it implies a crash in the VM. Additionally it is in closed source code so we cannot do anything about it (although feel free to open a bug if you see it with OpenJDK). Is the error posted in comment 18 on the same AMD 6000+ box where the rest have been or a different AMD box?
Hi. Could you try modifying your test (preferably the long version) to keep track of the last F[] (just pass f as an argument to the checker and on failure do for (F ff : f) { out.printf("new F(%dL, %dL, %dL),%n", ff.x, ff.y, ff.z); } This will make it easy to initialize the failing array - just copy what's printed into TF.java)? The F[] is the only random part of the program, and if the specific numbers produced are a problem (perhaps they reveal a bad assumption about runtime ranges in hotspot...?) running the test over and over again with that same F[] should be enough to reproduce it. If the numbers aren't what's causing it, then removing the randomness shouldn't be a problem* Also, try running with -XX:+PrintCompilation. I suspect OSR compilation may be to blame here, and that should tell us whether any of that is going on. Another thing that would be very helpful would be if you ran with -XX:+PrintAssembly, but this won't just work. You'll have to build openjdk7 and then follow the instructions here: http://wikis.sun.com/display/HotSpotInternals/PrintAssembly * this is very important: if you do remove the randomness and go with a constant F[] that crashed on a previous run, make sure to run the test with -XX:CompileCommand=exclude,TF,multipleTests If you don't do that, hotspot will notice that F[] is constant, and will do a whole lot of constant propagation throughout the program, and it will almost certainly prevent the crash. Right now, I don't think it can propagate constants from interpreted methods to compiled methods.
Is there any place I can download the latest pre-build openjdk 7 with PrintAssembly compiled-in. preferrably in .tar.gz so I can try it on Debian. I think this would testing simplify a lot.
Hi. Before you do any of what I suggested, could you please clarify whether you were able to reproduce the wrong computation on more than 1 machine? I don't think there are any pre built jdks with PrintAssembly, but PrintCompilation should be available on all of them.
Till now I observed the problem on 1 computer with various jdks (openjdk, Oracle/Sun, various versions). I tested on 2 computers Fedora - observer error messages & crashes. Debian - in 10 minutes with just 15-20 TFsimple.java started (nothing for 8way 32Gb Debian) the computer become completely non-responding & requires remote reboot. No error messages observed, but they may be missed, the computer is remote. It is running OK with 3 instances of TFsimple.java One more Fedora 15 is available to me, but this is a production server, I cannot run experiments there.
If the debian machine is rebootable, can you try with a Fedora Livedisk? It would be really nice to be able to eliminate hardware as the problem before attempting deeper and more cumbersome diagnosis.
Ooops, scratch that. Just noticed that you mentioned that the computer is remote :(
This is the results with TF.java (in contrast with TFsimle.java the TF.java does not cause operating system destabilization). (some messages from error format are from the code code inserted to checker for(int i=0;i<f1.length;i++) { final F ff1=f1[i],ff2=f2[i]; System.out.printf("new F(%dL, %dL, %dL),%n", ff1.x, ff1.y, ff1.z); System.out.printf("new F(%dL, %dL, %dL),%n", ff2.x, ff2.y,ff2.z); } cat /tmp/hh_A Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 2 java.lang.Double::doubleToLongBits (33 bytes) 3 TF::checkm (418 bytes) --- n java.lang.Object::clone 2% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) --- n sun.misc.Unsafe::compareAndSwapLong 3 made not entrant TF::checkm (418 bytes) new F(Exception in thread "main" java.util.IllegalFormatConversionException: d != java.lang.Double at java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:3999) at java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2709) at java.util.Formatter$FormatSpecifier.print(Formatter.java:2661) at java.util.Formatter.format(Formatter.java:2433) at java.io.PrintStream.format(PrintStream.java:920) at java.io.PrintStream.printf(PrintStream.java:821) at TF.checkm(TF.java:143) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:157) at TF.main(TF.java:168) cat /tmp/hh_C Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 2% TF::calc @ 218 (359 bytes) 1 TF::calc (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (418 bytes) --- n java.lang.Object::clone 3% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) 3 made not entrant TF::checkm (418 bytes) new F(Exception in thread "main" java.util.IllegalFormatConversionException: d != java.lang.Double at java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:3999) at java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2709) at java.util.Formatter$FormatSpecifier.print(Formatter.java:2661) at java.util.Formatter.format(Formatter.java:2433) at java.io.PrintStream.format(PrintStream.java:920) at java.io.PrintStream.printf(PrintStream.java:821) at TF.checkm(TF.java:143) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:157) at TF.main(TF.java:168) cat /tmp/hh_D Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2% TF::calc @ 218 (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (418 bytes) --- n java.lang.Object::clone 3% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) 3 made not entrant TF::checkm (418 bytes) new F(Exception in thread "main" java.util.IllegalFormatConversionException: d != java.lang.Double at java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:3999) at java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2709) at java.util.Formatter$FormatSpecifier.print(Formatter.java:2661) at java.util.Formatter.format(Formatter.java:2433) at java.io.PrintStream.format(PrintStream.java:920) at java.io.PrintStream.printf(PrintStream.java:821) at TF.checkm(TF.java:143) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:157) at TF.main(TF.java:168) cat /tmp/hh_B Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (418 bytes) --- n java.lang.Object::clone 2% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) 3 made not entrant TF::checkm (418 bytes) new F(Exception in thread "main" java.util.IllegalFormatConversionException: d != java.lang.Double at java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:3999) at java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2709) at java.util.Formatter$FormatSpecifier.print(Formatter.java:2661) at java.util.Formatter.format(Formatter.java:2433) at java.io.PrintStream.format(PrintStream.java:920) at java.io.PrintStream.printf(PrintStream.java:821) at TF.checkm(TF.java:143) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:157) at TF.main(TF.java:168)
with commented System.out code: cat /tmp/hh_A Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 2% TF::calc @ 218 (359 bytes) 1 TF::calc (359 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 2 java.lang.Double::doubleToLongBits (33 bytes) 3 TF::checkm (326 bytes) --- n java.lang.Object::clone 3% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) 3 made not entrant TF::checkm (326 bytes) Exception in thread "main" java.lang.RuntimeException: TEST=7162 Elements [1][1] differ m1[0][0]=7706.7278164451545 m2[0][0]=7706.7278164451545 m1[0][1]=7320.609551430136 m2[0][1]=7320.609551430136 m1[0][2]=7201.232432903037 m2[0][2]=7201.232432903037 m1[0][3]=7363.6815207085365 m2[0][3]=7363.6815207085365 m1[0][4]=7757.530582167053 m2[0][4]=7757.530582167053 m1[1][0]=55438.31850924297 m2[1][0]=55438.31850924297 m1[1][1]=51824.39619900315 m2[1][1]=47728.39619900315 m1[1][2]=50530.755773340454 m2[1][2]=50530.755773340454 m1[1][3]=52423.69762827249 m2[1][3]=52423.69762827249 m1[1][4]=56990.29847098225 m2[1][4]=56990.29847098225 m1[2][0]=133505.03622530942 m2[2][0]=133505.03622530942 m1[2][1]=124527.3917768878 m2[2][1]=124527.3917768878 m1[2][2]=121304.10831809939 m2[2][2]=121304.10831809939 m1[2][3]=126020.68723619048 m2[2][3]=126020.68723619048 m1[2][4]=137401.1744967179 m2[2][4]=137401.1744967179 m1[3][0]=247738.9864061055 m2[3][0]=247738.9864061055 m1[3][1]=230644.9616565764 m2[3][1]=230644.9616565764 m1[3][2]=224572.93003040878 m2[3][2]=224572.93003040878 m1[3][3]=233404.58633299996 m2[3][3]=233404.58633299996 m1[3][4]=254725.65094484514 m2[3][4]=254725.65094484514 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169) cat /tmp/hh_B Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (326 bytes) --- n java.lang.Object::clone 2% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) 3 made not entrant TF::checkm (326 bytes) Exception in thread "main" java.lang.RuntimeException: TEST=50827 Elements [0][0] differ m1[0][0]=7706.7278164451545 m2[0][0]=7706.716008346415 m1[0][1]=7320.609551430136 m2[0][1]=7320.598202052335 m1[0][2]=7201.232432903037 m2[0][2]=7201.221250764572 m1[0][3]=7363.6815207085365 m2[0][3]=7363.670091676419 m1[0][4]=7757.530582167053 m2[0][4]=7757.518557928863 m1[1][0]=55438.31850924297 m2[1][0]=55438.22686371967 m1[1][1]=51824.39619900315 m2[1][1]=51824.31721228213 m1[1][2]=50530.755773340454 m2[1][2]=50530.67911095596 m1[1][3]=52423.69762827249 m2[1][3]=52423.61812134418 m1[1][4]=56990.29847098225 m2[1][4]=56990.21206504163 m1[2][0]=133505.03622530942 m2[2][0]=133504.82810230204 m1[2][1]=124527.3917768878 m2[2][1]=124527.20303750761 m1[2][2]=121304.10831809939 m2[2][2]=121303.92473653423 m1[2][3]=126020.68723619048 m2[2][3]=126020.49653899744 m1[2][4]=137401.1744967179 m2[2][4]=137400.96660552503 m1[3][0]=247738.9864061055 m2[3][0]=247738.60752429755 m1[3][1]=230644.9616565764 m2[3][1]=230644.61273405756 m1[3][2]=224572.93003040878 m2[3][2]=224572.59044236693 m1[3][3]=233404.58633299996 m2[3][3]=233404.2334013237 m1[3][4]=254725.65094484514 m2[3][4]=254725.2657997465 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169) cat /tmp/hh_C Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2% TF::calc @ 218 (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (326 bytes) --- n java.lang.Object::clone 3% TF::runTest @ 26 (77 bytes) 4 TF::runTest (77 bytes) 3 made not entrant TF::checkm (326 bytes) Exception in thread "main" java.lang.RuntimeException: TEST=24262 Elements [3][3] differ m1[0][0]=7612.943180925791 m2[0][0]=7612.943180925791 m1[0][1]=7232.091931343198 m2[0][1]=7232.091931343198 m1[0][2]=7114.651836443036 m2[0][2]=7114.651836443036 m1[0][3]=7275.303415709812 m2[0][3]=7275.303415709812 m1[0][4]=7664.617050258952 m2[0][4]=7664.617050258952 m1[1][0]=54797.25164657019 m2[1][0]=54797.25164657019 m1[1][1]=51222.70137181502 m2[1][1]=51222.70137181502 m1[1][2]=49944.34143399005 m2[1][2]=49944.34143399005 m1[1][3]=51815.50758683226 m2[1][3]=51815.50758683226 m1[1][4]=56329.43295526017 m2[1][4]=56329.43295526017 m1[2][0]=131964.4910261504 m2[2][0]=131964.4910261504 m1[2][1]=123086.10360065899 m2[2][1]=123086.10360065899 m1[2][2]=119901.11262812998 m2[2][2]=119901.11262812998 m1[2][3]=124563.47117926647 m2[2][3]=124563.47117926647 m1[2][4]=135812.6917897872 m2[2][4]=135812.6917897872 m1[3][0]=244880.80619995148 m2[3][0]=244880.80619995148 m1[3][1]=227974.9084076055 m2[3][1]=227974.9084076055 m1[3][2]=221977.9207674185 m2[3][2]=221977.9207674185 m1[3][3]=230708.52176553395 m2[3][3]=214324.52176553395 m1[3][4]=251783.71798452607 m2[3][4]=251783.71798452607 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169)
Created attachment 487413 [details] New TF.java file Okay, let us first eliminate loop unrolling being the issue. Can you try running the attached file? It is the same code as in comment 10, but with some extra print statements that get printed if an error occurs. We know that the value at a given index is modified 640 times in total, in 5 sets of 128. If a loop is being skipped, the print statement won't have it. Furthermore, the debug statement can be analysed to see if there was an addition error.
OK, i will run your code now. answering your question whether the failure occur on the same F [] ? no - I propagated the index k from static void multipleTests(final java.util.Random r) to checker - the failure occurs at different k below two failures at k=0 & k=3, the Fs would be different. cat /tmp/hh_C Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2% TF::calc @ 156 (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (336 bytes) --- n java.lang.Object::clone 3% TF::runTest @ 26 (79 bytes) 4 TF::runTest (79 bytes) 3 made not entrant TF::checkm (336 bytes) Exception in thread "main" java.lang.RuntimeException: F try=3 calculation try=29260 Elements [0][1] differ m1[0][0]=7021.774830987034 m2[0][0]=7021.774830987034 m1[0][1]=6671.004497011981 m2[0][1]=6671.002964197445 m1[0][2]=6562.907481207491 m2[0][2]=6562.9073157314315 m1[0][3]=6711.11205485301 m2[0][3]=6711.112043123023 m1[0][4]=7070.144759580944 m2[0][4]=7070.144758923587 m1[1][0]=50533.65411709927 m2[1][0]=50533.65411709927 m1[1][1]=47238.60662216227 m2[1][1]=47238.60533837808 m1[1][2]=46060.29654012994 m2[1][2]=46060.29640153812 m1[1][3]=47785.9864299652 m2[1][3]=47785.98642014094 m1[1][4]=51948.73401721851 m2[1][4]=51948.73401666795 m1[2][0]=121696.16159646476 m2[2][0]=121696.16159646476 m1[2][1]=113509.99686878055 m2[2][1]=113509.9952559613 m1[2][2]=110574.13409857947 m2[2][2]=110574.13392446646 m1[2][3]=114873.9804417375 m2[2][3]=114873.98042939528 m1[2][4]=125248.01961714865 m2[2][4]=125248.01961645696 m1[3][0]=225825.05076021465 m2[3][0]=225825.05076021465 m1[3][1]=210234.96043019896 m2[3][1]=210234.95772861727 m1[3][2]=204708.61835611783 m2[3][2]=204708.61806446672 m1[3][3]=212760.64466901263 m2[3][3]=212760.64464833855 m1[3][4]=232196.23583147067 m2[3][4]=232196.23583031207 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169) cat /tmp/hh_A Running single thread code until test failure 1% TF::calc @ 233 (359 bytes) 1 TF::calc (359 bytes) 2 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 3 TF::checkm (336 bytes) --- n java.lang.Object::clone 2% TF::runTest @ 26 (79 bytes) 3 made not entrant TF::checkm (336 bytes) Exception in thread "main" java.lang.RuntimeException: F try=0 calculation try=75033 Elements [3][3] differ m1[0][0]=6015.906748342842 m2[0][0]=6015.906748342842 m1[0][1]=5716.6127269548415 m2[0][1]=5716.6127269548415 m1[0][2]=5624.415186604344 m2[0][2]=5624.415186604344 m1[0][3]=5751.072547006512 m2[0][3]=5751.072547006512 m1[0][4]=6057.8359619079365 m2[0][4]=6057.8359619079365 m1[1][0]=43190.181781667496 m2[1][0]=43190.181781667496 m1[1][1]=40377.57397780233 m2[1][1]=40377.57397780233 m1[1][2]=39371.04889321945 m2[1][2]=39371.04889321945 m1[1][3]=40845.525454463925 m2[1][3]=40845.525454463925 m1[1][4]=44402.20207354617 m2[1][4]=44402.20207354617 m1[2][0]=103991.08438810863 m2[2][0]=103991.08438810863 m1[2][1]=97000.16811187661 m2[2][1]=97000.16811187661 m1[2][2]=94491.77180358718 m2[2][2]=94491.77180358718 m1[2][3]=98165.6040876451 m2[2][3]=98165.6040876451 m1[2][4]=107029.22874182602 m2[2][4]=107029.22874182602 m1[3][0]=192961.576069273 m2[3][0]=192961.576069273 m1[3][1]=179643.57896832644 m2[3][1]=179643.57896832644 m1[3][2]=174920.49439552703 m2[3][2]=174920.49439552703 m1[3][3]=181800.08676866323 m2[3][3]=165416.08676866323 m1[3][4]=198405.9217295142 m2[3][4]=198405.9217295142 at TF.checkm(TF.java:147) at TF.runTest(TF.java:19) at TF.multipleTests(TF.java:158) at TF.main(TF.java:169)
> answering your question whether the failure occur on the same F [] ? > no - I propagated the index k from > static void multipleTests(final java.util.Random r) > to checker - the failure occurs at different k > below two failures at k=0 & k=3, the Fs would be different. I didn't mean "same F within the F array". I meant "same F array". You create many random F arrays and I wanted to print every element of the one that fails and try using that array for all the tests.
1. Your code is running. no error yet. Hopefully there will be some by the morning. 2. Answering your question from 2011-03-24 15:53:26 EDT I meant exactly this. I propagated "k" from multipleTests to checker. Array elements depend on k. The failures occurred at k=0 & k=3, i.e. the failures occurred at different elements of the F[] , i.e. the failure does not depend on F[] elements specifics.
Created attachment 487427 [details] the failure with your version This is the first failure report of your code. other 4 instances are still running.
The value was changed 640 times according to the log, which means that loops are not being skipped, nor are extra ones being added. I am trying to verify the additions now.
I modified your code to save debugStr corresponding to matrix m1. in case of an error two debugStr (corresponding to m1 & m2) should be now printed. This will simplify the comparison.
Created attachment 487428 [details] New TF.java file Thanks for the quick responses! The calculations add up as well. Unfortunately that does not eliminate the possibility of hw error during arithmetic. Can you please try the file attached with this post? In addition to printing the calculations done for the new array, it also prints the original set of calculations.
Doh! Please ignore comment 35.. I didn't see comment 34.. you are already thinking along the same lines as me :)
The modified code, left to calculate for entire night lead the computer non-responsive & power rest was required. Debian box also also went non-responding to ssh, only to ping, exactly as it was last time. So no results produced. What else can be the problem? evidently memory allocator of axb array. final double [][] m=new double [a][b]; This would explain: 1. Destabilisation of my computer & Debian. 2. Difficulty of reproduction on other people computers. 3. problem with long instead of double. it is unlikely to have the same problem in both long & double arithmetic. 4. Random nature of the problem. On my workstation and on Debian I modified TFSimple.java to replace all new double [a][b] by a call to: static double [][] newdouble2(final int a, final int b) { final double [][] m=new double [a][]; for(int i=a;--i>=0;) { m[i]=new double [b]; } return m; } and started 20 instances of this modified TFSimple1.java on my workstation. previously 20 instances would destabilise my workstation in less than 1 minute and and a reboot would required. Now for more than 40 minutes it is running OK. but one TFsimle1.java failed anyway. Exception in thread "main" java.lang.RuntimeException: TEST=26728 Elements [0][4] differ m1[0][0]=0.1700774336283186 m2[0][0]=0.1700774336283186 m1[0][1]=0.11166978133704736 m2[0][1]=0.11166978133704736 m1[0][2]=0.2518757394821027 m2[0][2]=0.2518757394821027 m1[0][3]=1.0251160550185991 m2[0][3]=1.0251160550185991 m1[0][4]=6.11267581491528 m2[0][4]=5.654225128796633 m1[1][0]=1.043826838481748 m2[1][0]=1.043826838481748 m1[1][1]=0.6853579121009847 m2[1][1]=0.6853579121009847 m1[1][2]=1.5458526814816618 m2[1][2]=1.5458526814816618 m1[1][3]=6.291508684952188 m2[1][3]=6.291508684952188 m1[1][4]=37.51570643105274 m2[1][4]=37.51570643105274 m1[2][0]=3.2998209696571212 m2[2][0]=3.2998209696571212 m1[2][1]=2.1666030482227314 m2[2][1]=2.1666030482227314 m1[2][2]=4.886861408711588 m2[2][2]=4.886861408711588 m1[2][3]=19.88917272867012 m2[2][3]=19.88917272867012 m1[2][4]=118.59736711957807 m2[2][4]=118.59736711957807 m1[3][0]=12.971299713493547 m2[3][0]=12.971299713493547 m1[3][1]=8.516721894032319 m2[3][1]=8.516721894032319 m1[3][2]=19.209813069734487 m2[3][2]=19.209813069734487 m1[3][3]=78.18255077754401 m2[3][3]=78.18255077754401 m1[3][4]=466.1955931200476 m2[3][4]=466.1955931200476 at TFsimple1.checkm(TFsimple1.java:118) at TFsimple1.runTest(TFsimple1.java:28) at TFsimple1.multipleTests(TFsimple1.java:129) at TFsimple1.main(TFsimple1.java:140) [2] Exit 1 java TFsimple1
Created attachment 487461 [details] updated TF.java this is the TF.java I just started
Created attachment 487462 [details] TFsimple1.java this TFsimple1.java does not destabilize my workstation (20 instances of the TFsimple.java would destabilize in less than 1 minute) but this TFsimple1.java did produce one error anyway. The difference between TFsimple.java & TFsimple1.java is in new double [a][b]
Created attachment 487493 [details] test with extra diag the test with extra diag
Created attachment 487494 [details] the failure with TFo.java the failure results with TFo.java
Created attachment 487495 [details] where the problem started This is where the problem started: m1: (original) Loop 5 Added 95.7778766848038 at [2][1]. sa=2.036801632444434 (waq+1.1)=2.612425807401708 xx=0.101 yy=0.072 New value=60665.69720562991 m2: (repeated) Loop 5 Added 95.77787662549497 at [2][1]. sa=2.0368016311831787 (waq+1.1)=2.612425807401708 xx=0.101 yy=0.072 New value=60665.6972055706 m2 is incorrect: java T1 sa=2.036801632444434 RES=95.7778766848038
Created attachment 487503 [details] repeated sa calculations The code in T1.java does repeated calculation, it was optimized away, I will look later on that. But anyway correct SA=2.036801632444434, which sometimes incorrectly calculated. java -XX:+PrintCompilation T1 SA=2.036801632444434 1 T1::getsa (139 bytes) 2 java.util.Random::next (47 bytes) --- n sun.misc.Unsafe::compareAndSwapLong 3 java.util.concurrent.atomic.AtomicLong::get (5 bytes) 4 java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes) --- n java.lang.Double::doubleToRawLongBits (static) 5 java.lang.Double::doubleToLongBits (33 bytes) --- n java.lang.Object::clone 6 java.lang.Double::compare (54 bytes) 1% T1::runTests @ 53 (131 bytes)
Ah, so it is an arithmetic issue. It is very improbable that a compiler optimization may cause such an error. This is looking more like an hw issue. Can you try running it again, this time disabling the JIT compiler with: -Djava.compiler=NONE That will make it a lot slower, but it will eliminate the compiler as being the issue. Once we know it can happen with interpreter only, we can perhaps retry with something like gij.
I just started 5 instances with java -Djava.compiler=NONE -XX:+PrintCompilation TFo But I expect no problem when JIT is not enabled. I think this issue is very unlikely to be an arithmetic. Reasons - the issue is 1) random and strongly linked to loops 2) same problem does also happen with longs, not only with doubles. the value of "sa" is obtained as a sum over two loop : ks & ls (which are easily optimizable by jit), and most likely reason - jit compiler did something bad with loop unrolling/optimization, generating wrong machine code which in some instances give just wrong values result, in other may cause system destabilization via memory or CPU overuse (like in that tests on Debian). To see that I tried to see assembly, by trying to build openjdk-7-ea-src-b135-24_mar_2011.zip with -XX:+PrintAssembly but it gives me an error BUILD FAILED /rdata/o1/openjdk/jaxp/build-defs.xml:70: ERROR: Cannot find source for project jaxp. but jaxp the source is there in openjdk7
The reason I am disinclined to believe that this is a compiler issue is because we have been unable to reproduce this on any other hardware. I had the test running overnight last night (total of another 20+ hours on top of previous 12+) and I was still unable to trigger it. Of course that does not preclude the possibility that it is a bug introduced by the compiler when it optimizes specifically for the hw you have. That's why I'd like to see what happens with -Djava.compiler=NONE.
The case -Djava.compiler=NONE is still running still no problem. I am compiling openjdk-7-ea-src-b135-24_mar_2011.zip now, trying to get -XX:+PrintAssembly option working
Great! I take it you figured out the jaxp error? I was just looking into it -- you need to do 'make ALLOW_DOWNLOADS=true' so that it will download it.
Created attachment 487902 [details] failure 1 (assembler dump) failure with assemble dump. The problem, as we identified earlier, is incorrect calculation of sa inside the loop: double sa=0,wa=1; for(int ks=k;ks>=0;ks--) { double sb=0,wb=1; for(int ls=l;ls>=0;ls--) { sb+=mx[ks][ls]*wb; wb*=xx; wb*=ls+1; } sa+=sb*wa; wa*=yy; wa*=ks+1; } // HERE sa may be incorrect !!! m[k][l]+=sa*(l+2)*(k+4)*(waq+1.1); I am attaching the assembler dump.
Created attachment 487903 [details] second failure second failure
Created attachment 487906 [details] no failure, java terminated by kill no failure, java terminated by kill
Created attachment 487916 [details] TFO.java This is TFO.java & class files. assembler dump is attached in previous posts. The goal, as I see now, is to identify the likehood of hardware failure against hotspot bug. Same code with JIT disabled was running for >12 hour without a problem. I attached assembler bug, we probably know where the problem is (double loop calculating sa) now one familiar with the hotspot need to check the asm dump.
Regarding hh_A.txt the problem is with sa calculation: CORRECT:--Loop 5 Added 1197.8708043516913 at [3][2]. sa=2.3885752977674133 (waq+1.1)=17.910718659530968 xx=0.101 yy=0.072 New value=27577.43591104506 vs WRONG:-Loop 5 Added 1197.8708008520678 at [3][2]. sa=2.388575290789103 (waq+1.1)=17.910718659530968 xx=0.101 yy=0.072 New value=27577.435907545434 The correct sa=2.3885752977674133 this code is called many times inside other loops, and once in 3-4 hours give wrong sa. double sa=0,wa=1; for(int ks=k;ks>=0;ks--) { double sb=0,wb=1; for(int ls=l;ls>=0;ls--) { sb+=mx[ks][ls]*wb; wb*=xx; wb*=ls+1; } sa+=sb*wa; wa*=yy; wa*=ks+1; } the matrix mx does not depend on anything, fixed for all tests. and the value of sa depend on outside loop indices k & l only, sa does not depend on anything else. The problem: sa sometimes is calculated INCORRECTLY. in interpreter mode (no JIT) I did not catch a failure yet.
One more reason why this is unlikely hardware bug: I did run hundreds of unit tests for internal java code - no problem at all. The only issue arise when there are two loops one inside another & JIT is enabled. as in sa calculation above. No problem in any other situation.
Forgot to tell: the last tests are run with: java -version openjdk version "1.7.0-internal" OpenJDK Runtime Environment (build 1.7.0-internal-root_2011_03_26_02_25-b00) OpenJDK 64-Bit Server VM (build 21.0-b05, mixed mode) as for x in A B C D E ; do java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly TFo >hh_$x 2>&1 & done
and the openjdk source used: openjdk-7-ea-src-b135-24_mar_2011.zip
looking at hh_A.txt output at the end - we see that the last run before failure (the data saved into debugStr printed at the end) the sa at[k=3][l=2] was calculated 128 times total. only the 27-th time the sa was calculated incorrectly to [3][2]. sa=2.388575290789103 all other times it was equal to correct value [3][2]. sa=2.3885752977674133 I expect there should be different asm used to calculate sa at k=3,k=2 at 27-th time, probably some loop optimization triggered, and the transition between loop execution strategies may be broken.
By the way - the last JIT code generated related to function F.calc is asm inlining 4 loops at once: hh_A.txt:58122: ; - TFo::calc@87 (line 86) openjdk did loop inline, generated new asm, which did loop unrolling near hh_A.txt:58122 and this asm fails in some situation, not always. once this loop inline was done - the 27th calculation of sa at [k=3][l=2] became incorrect & the program failed, right after last loop inline JIT newest generated asm.
Given the fact that loop unrolled code calculates sa incorrectly quite seldom and the problem is hard to reproduce, as well as that it popped up right after new asm unrolling was generated - I think the problem is that loop unrolling code is RACY in some sense. It probably has synchronization wrong and variable/registry got clobbered in the middle of asm generation or execution.
By the way in other test results: hh_D.txt.gz exactly the same pattern - unroll loops TFo::calc@87 (line 86) and then failure. I do not think that just plain incorrect unrolling would not be caught by other people. This is a race almost for sure. locking is broken in either in 1. JIT generator itself. or 2. The machine code generated by JIT generator.
Created attachment 487958 [details] The results with TFsimple.java Below are the results with TFsimple.java (much simpler code (no strings used), much less to analyse, but less data collected). Same pattern - loop unroll and then test failure.
Created attachment 488194 [details] new reproducer (hopefully) Hi. Can you please try running the attached file? It is the same as TFsimple.java with a few modifications. I removed the F array and the F class because they weren't being used, and I made mx a static double[][]. Also, when you run it, please add the -XX:CompileOnly=TF::runTest,TF::calc option. It will be interesting to see whether this fails, and also the assembler printouts should be shorter.
I just started it about 30 minutes ago as for x in A B C D E F ; do java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:CompileOnly=TFD1::runTest,TFD1::calc TFD1 >TFD1_$x 2>&1 & sleep 4 ; done (renamed to TFD1 to avoid confusion with other versions ). No problem yet. The TFsimple.java did give an error faster. Your code is very unlikely to give an error because JIT does not compile the function "calc" regulary (and asm output does not grow). In failed versions JIT did regular compilation of calc, eventually some of them did fail. and the asm output did grow permanently. I think this is a race almost for sure. It is unlikely that plain wrong JIT output would not be caught by other people. From my experience with linux - fixing a race is the organizational problem. One need to reach the maintainer, who is familiar with locking in the specific module. Any other method is just plain ineffective and may take weeks. This is why I never liked SUN - I never being able to reach the mantainer to fix the problem.
Created attachment 488225 [details] new reproducer results. Unlikely to give an error because asm output does not grow. In failed versions JIT did compile the calc() regularly, what eventually produced an error. In this version no new asm generated for long time.
I think if we tell hotspot JIT to be much more aggressive in loop unrolling e.g. force it to recompile & unroll every 10-th call of calc. In this case JIT compilations would go permanently and there are would be much higher chances to catch a race. Once we identified this is a race - for successful race fixing we need somehow to reach the maintainer/author of loop unrolling code at openjdk or oracle. Anything else is just plain ineffective in case of race fixing.
(In reply to comment #64) > Created attachment 488225 [details] > new reproducer results. > > Unlikely to give an error because asm output does not grow. > In failed versions JIT did compile the calc() regularly, I'm not sure what you mean by "compiled regularly". I looked at every attachment you posted. In TFsimple.java, calc() is JIT compiled exactly once, and in the others it's compiled exactly twice. You say that when it crashes the asm output grows. How does it grow? Does it output a lot of asm early, then stop for a while, and then it outputs more near the crash? > what eventually produced an error. In this version no new asm generated > for long time. That's what I expected. Can you try running what I posted again, but this time without the -XX:CompileOnly... command? Also, your hypothesis that it's a race sounded very reasonable at first, but now I'm not so sure. As far as I understand this (and correct me if I'm wrong), the code is getting compiled, it's running for a long time correctly (1 hour+), and then it breaks. If there was some race in the compilation and wrong code was produced, we'd be seeing a mistake immediately after the compilation. And I don't see how there can be a race in generated code, since there's no shared data here.
I just started with -XX:CompileOnly. take for example TFo.java, I posted logs above. >it's running for a long time correctly (1 hour+) 10-30 minutes correctly, then failure. right before the failure where was always loop unrolling: https://bugzilla.redhat.com/attachment.cgi?id=487902 file line 58122: ; - TFo::calc@87 (line 86) https://bugzilla.redhat.com/attachment.cgi?id=487903 file 57868: ; - TFo::calc@87 (line 86) after thing compilation a failure occur (probably because of a race), all other asm is fininshing the output.
Most likely race I see - the code produced by JIT (compiled code) may be racy by itself. It breaks eventually. In TFo.java it was easy to catch, because the failure occured very soon after the JIT loop unrollng. But it did produce some number of correct result before the failure. for example first 26 calculations at [k=3][l-2]. give correct sa=2.3885752977674133 then the 27-th calculation at [k=3][l=2]. give wrong sa=2.388575290789103
Created attachment 488237 [details] hotspot failure One of TFD1 processes produced hotspot failure in _multianewarray2_Java But this may be unrelated to problem of different apparence of the race. tack: [0x00007fb206c21000,0x00007fb206d22000], sp=0x00007fb206d206a0, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x7d75bf] typeArrayKlass::allocate(int, Thread*)+0x7f V [libjvm.so+0x69664f] objArrayKlass::multi_allocate(int, int*, Thread*)+0x15f V [libjvm.so+0x72c67a] OptoRuntime::multianewarray2_C(klassOopDesc*, int, int, JavaThread*)+0x3a Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) v ~RuntimeStub::_multianewarray2_Java J TFD1.calc(IIDD[[DLTFD1$ComputationLog;)[[D J TFD1.runTest(III)V j TFD1.multipleTests()V+12 j TFD1.main([Ljava/lang/String;)V+4 v ~StubRoutines::call_stub
(In reply to comment #69) > Created attachment 488237 [details] > hotspot failure > > One of TFD1 processes produced hotspot failure > in _multianewarray2_Java > But this may be unrelated to problem of different apparence of the race. Now this is definitely a bug. You're probably right that it's unrelated. Anyway, some VM options that may be of interest to you: -XX:+CompileTheWorld -XX:+TraceIterativeGVN -XX:+DeoptimizeALot -XX:+PrintInlining -XX:-Inline // this will turn off inlining. If you can't reproduce it with this, that means something might be wrong with inlining. -XX:+VerifyLoopOptimizations
Started TFo (most easy to reproduce) with -XX:-Inline no problem yet (15 minutes). Typically they fail in 5-20 minutes. I will leave TFo running until tomorrow. Our internal unit tests (where I originally found the problem) passed OK with options -XX:-Inline. options -XX:+CompileTheWorld -XX:+TraceIterativeGVN -XX:+DeoptimizeALot -XX:+PrintInlining -XX:+VerifyLoopOptimizations are not recognized by openjdk7
(In reply to comment #71) ... > options -XX:+CompileTheWorld -XX:+TraceIterativeGVN -XX:+DeoptimizeALot > -XX:+PrintInlining -XX:+VerifyLoopOptimizations > are not recognized by openjdk7 That's because you need a debug build of openjdk7. For that, just do "make debug" instead of "make".
Also, -XX:+VerifyIterativeGVN is very good for spotting JIT mistakes.
Created attachment 488276 [details] an error with -XX:-Inline A failure with: java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:-Inline TFo
In this failure sa is always correct sa=0.1700774336283186 error is in waq calculated incorrect: -Loop 3 Added 8.597971821578138 at [0][0]. New value=15.580084180784024 correct: +Loop 3 Added 9.295708174139355 at [0][0]. New value=16.27782053334524 all other waq are correct: -Loop 3 Added 0.0025153796731071836 at [0][0]. New value=29.080013812172588 +Loop 3 Added 0.0025153796731071836 at [0][0]. New value=30.727116245989492
Strangely, the pattern is always: for loop and then two indices array += something for(....) { m[k][l]+=wb; also, is function inline & loop unroll ins the same -XX:-Inline or different jdk options?
What is the corresponding source code for the output in comment 74? Also, can you try again, this time only supply -XX:LoopUnrollLimit=0 That should prevent unrolling
TFO.java is Comment 52 2011-03-26 13:14:07 EDT https://bugzilla.redhat.com/attachment.cgi?id=487493
Created attachment 488415 [details] a failure with LoopUnrollLimit=0 a failure with LoopUnrollLimit=0
My second guess about the problem is multiarray memory allocator. because: 1. there was a failure in ~RuntimeStub::_multianewarray2_Java ( comment 69 ) This may be unrelated, but may be a hint. 2. The problem started to show when the 2D & 3D arrays were added to the code. 3. All failing sums are the sums over a double array. Also note my comment 37 above which did not produce any changes in behavior.
That is odd. Can you try again, this time printing the value of sa each time it is modified in any of the loops, and what components make it up? I'd do it myself if it was reproducible here :/ e.g. double sa=0,wa=1; for(int ks=k;ks>=0;ks--) { double sb=0,wb=1; for(int ls=l;ls>=0;ls--) { sb+=mx[ks][ls]*wb; wb*=xx; wb*=ls+1; } sa+=sb*wa; wa*=yy; wa*=ks+1; } becomes: double sa=0,wa=1; for(int ks=k;ks>=0;ks--) { double sb=0,wb=1; for(int ls=l;ls>=0;ls--) { sb+=mx[ks][ls]*wb; wb*=xx; wb*=ls+1; } debugStr[k][l].append("Incrementing sa=" + sa + " by " + sb*wa + " to be " + sa+sb*wa); sa+=sb*wa; wa*=yy; wa*=ks+1; } This will tell us which specific part is incorrect, and then we can try to narrow it down from there by printing when sb, wa, etc. are modified. If you want, you can also print the subcomponents initially too.. it will make the log bigger, but then it wont have to be narrowed down later
I will do this tomorrow.
One more thing: I just looked at that my old workstation (where failures occur) which for >4 hours was running 8 instances of java -Djava.compiler=NONE TFo Kernel crashed in one of java processes. uname -a Linux comp90 2.6.34.8-68.fc13.x86_64 #1 SMP Thu Feb 17 15:03:58 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux I do not have a serial port nulmodem cable handy to catch the trace. but anyway - can this problem be Linux smp race? this would explain strange behaviour. Interestingly - there was no single sign of problem with that old workstation, besides java. I think then the most straighforward way for me - trying to update it to F15, then try to reproduce java failure.
By the way - Debian (8core 32Gb) uname -a Linux forum 2.6.26-2-amd64 #1 SMP Tue Jan 25 05:59:43 UTC 2011 x86_64 GNU/Linux java -version java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) locks when trying to run 20 instances of TFSimple only pressing "reset" helps.
I am not sure why the Debian box is locking up just on something like high load. It is unfortunate that the error cannot be verified there. As for the F15 update -- I am curious to see that too now. Hopefully it doesn't freeze, and it doesn't trigger there either. Is the box capable of KVM virtualization? You might want to give that a shot too.
I was very busy with other things, so I did not at this much more. But I some kind managed to get some kind abnormal (but not exact failure) on a 4 cores, 24GB, Fedora (old, but this does not matter) Linux cc 2.6.27.41-170.2.117.fc10.x86_64 #1 SMP Thu Dec 10 10:36:29 EST 2009 x86_64 x86_64 x86_64 GNU/Linux java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) with TFsimple.java from Comment 14 2011-03-24 08:33:13 EDT The results: start 20 instances of TFsimple n=0 ; while [ $n -lt 20 ]; do java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly TFsimple >diag_$n & n=$[ $n + 1 ] ; done it works perfectly fine for about 10-15 minutes (just high load), then the computer goes out of memory freezes, messages about >120 sec in interrupt in console, out of memory, etc. No reaction on the keyboard, only pressing "reset" button helps. The question is: why after about 10 minutes java processes start consuming that much memory on a such a non-memory intensive code as TFsimple.java
Created attachment 489908 [details] TFSsimple.java Code simplification on the same computer where all other tests were run, the OS is not updated yet. [][] arrays are removed, only [] left. The test was run with hotspot disabled java -XX:+PrintAssembly -Djava.compiler=NONE The result (after few hours) OpenJDK 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output VM option '+PrintAssembly' Running single thread code until test failure Exception in thread "main" java.lang.RuntimeException: TEST=38917 Elements [3] differ m1[0]=287.5015945415851 m2[0]=287.5015945415851 m1[1]=193.26240424710878 m2[1]=193.26240424710878 m1[2]=323.1759685698639 m2[2]=323.1759685698639 m1[3]=863.5951311480515 m2[3]=863.5951311480515 at TFSsimple.checkm(TFSsimple.java:102) at TFSsimple.runTest(TFSsimple.java:19) at TFSsimple.multipleTests(TFSsimple.java:113) at TFSsimple.main(TFSsimple.java:124) strange thing - m1[3] & m2[3] are identical. they were checked at line 87: if(Double.compare(m1[i], m2[i]) != 0) for some reason the value of array element were was taken wrong (most probably) at line 87, but it was OK on lines 99 & 100. Does this make any sense?
Created attachment 489909 [details] proper TFSsimple.java proper file TFSsimple.java
Hi. I think what you have above is definitely a case of 2 different doubles seeming identical but having the same string representation. To confirm this, please run the test again, but replace s.append("m1["+i+"]="+m1[i]+"\t"); s.append("m2["+i+"]="+m2[i]+"\n"); with s.append("m1["+i+"]=" + Double.doubleToRawLongBits(m1[i]) + "\t"); s.append("m2["+i+"]=" + Double.doubleToRawLongBits(m2[i]) + "\n"); If I'm right and this inconsistency is due to simply the imprecise nature of finite space binary->decimal rational number conversion, then this last test conclusively rules out the optimizer or code generator as the problem. Since you can reproduce it with both the compiler and interpreter, and you can only reproduce it on one machine, the problem is almost certainly your hardware. Perhaps you could try running some other computationally intensive task, like prime95 for a few hours?
This message is a notice that Fedora 15 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 15. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At this time, all open bugs with a Fedora 'version' of '15' have been closed as WONTFIX. (Please note: Our normal process is to give advanced warning of this occurring, but we forgot to do that. A thousand apologies.) Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, feel free to reopen this bug and simply change the 'version' to a later Fedora version. Bug Reporter: Thank you for reporting this issue and we are sorry that we were unable to fix it before Fedora 15 reached end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" (top right of this page) and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping