Bug 197109
Summary: | octave panic: Illegal instruction | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | P Chang <pcyc.uk> |
Component: | octave | Assignee: | Quentin Spencer <qspencer> |
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | alex, dasergatskov, extras-qa |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-07-07 16:33:54 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
P Chang
2006-06-28 15:59:14 UTC
Unfortunately, I don't have the hardware to debug this directly by myself, but I solicited some input on the octave mailing lists and one user was able to verify that on FC5 this problem doesn't happen with one change in the spec file: Change line 62 from %define enable64 yes to %define enable64 no This change is already planned for the next octave release, because enabling the 64-bit features was causing problems with libraries that were not compiled with the same assumptions. If you don't mind recompiling octave, could you tell me whether this change fixes the problem? My rpms (recompiled without "--enable-64") that do not show this problem on Athlon64 are available at: ftp://coffee.phys.unm.edu/pub/dima/incoming/octave/ I bumped up the version number to distinguish them from official release. Hope that helps. Dmitri. I've compiled the 2.9.5-1 src.rpm with the suggested change to the spec file. (I had to add in "export F77=gfortran" to avoid g77 getting picked up during the configuration step as g77 doesn't have the -mtune=nocona switch causing fortran compilation to fail.) It still bombs out with toeplitz command. So no change with gcc-4.0.2-8. Had a quick browse of the octave-bugs archive and saw a mention of the fact that there are two versions of toeplitz. It seems like the core one works but the octave-forge-2006.03.17-3.fc4 version bombs out. So something in the vectorized version is causing the crash. Further investigation shows that in the octave-forge version, the index magic line causes the crash. Ie, retval = c ( [1:nr]' * ones (1, nc) + ones (nr, 1) * [nc-1:-1:0] ); Checking this out shows that both [1:180]'*ones(1,180); and ones(180,1)*[179:-1:0]; provoke crashes. I've compiled and installed the debuginfo rpm. Running octave under gdb gives the trace below. It seems to crash in the atlas library in the ATL_dupKBmm1_1_1_b0() function. $ gdb octave GNU gdb Red Hat Linux (6.3.0.0-1.84rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found) Using host libthread_db library "/lib64/libthread_db.so.1". (gdb) r Starting program: /usr/bin/octave (no debugging symbols found) [Thread debugging using libthread_db enabled] [New Thread 46912496335136 (LWP 27909)] GNU Octave, version 2.9.5 (x86_64-redhat-linux-gnu). Copyright (C) 2006 John W. Eaton. This is free software; see the source code for copying conditions. There is ABSOLUTELY NO WARRANTY; not even for MERCHANTIBILITY or FITNESS FOR A PARTICULAR PURPOSE. For details, type `warranty'. Additional information about Octave is available at http://www.octave.org. Please contribute if you find this software useful. For more information, visit http://www.octave.org/help-wanted.html Report bugs to <bug> (but first, please read http://www.octave.org/bugs.html to learn how to write a helpful report). octave:1> [1:180]'*ones(1,180) Program received signal SIGILL, Illegal instruction. [Switching to Thread 46912496335136 (LWP 27909)] 0x000000377175b0bb in ATL_dupKBmm1_1_1_b0 () from /usr/lib64/atlas/libblas.so.3 (gdb) bt #0 0x000000377175b0bb in ATL_dupKBmm1_1_1_b0 () from /usr/lib64/atlas/libblas.so.3 #1 0x00000037718c055f in ATL_dpKBmm_b0 () from /usr/lib64/atlas/libblas.so.3 #2 0x00000037718c06c8 in ATL_dpKBmm () from /usr/lib64/atlas/libblas.so.3 #3 0x0000003771707b55 in ATL_dmmJIK2 () from /usr/lib64/atlas/libblas.so.3 #4 0x0000003771708a14 in ATL_dmmJIK () from /usr/lib64/atlas/libblas.so.3 #5 0x00000037716cf098 in ATL_dgecopy () from /usr/lib64/atlas/libblas.so.3 #6 0x00000037716cfc3e in ATL_dgemm () from /usr/lib64/atlas/libblas.so.3 #7 0x000000377155c242 in atl_f77wrap_dgemm_ () from /usr/lib64/atlas/libblas.so.3 #8 0x0000003771c8524b in dgemm_ () from /usr/lib64/atlas/libblas.so.3 #9 0x00000037af6c0365 in operator* (m=Variable "m" is not available. ) at dMatrix.cc:2569 #10 0x00000037aee0228f in oct_binop_mul (a1=Variable "a1" is not available. ) at ./OPERATORS/op-m-m.cc:64 #11 0x00000037aecd5b44 in do_binary_op (op=Variable "op" is not available. ) at ov.cc:1653 #12 0x00000037aedc47c9 in tree_binary_expression::rvalue (this=Variable "this" is not available. ) at pt-binop.cc:75 #13 0x00000037aedc2f20 in tree_binary_expression::rvalue (this=Variable "this" is not available. ) at pt-binop.cc:46 #14 0x00000037aedec714 in tree_statement::eval (this=Variable "this" is not available. ) at pt-stmt.cc:133 #15 0x00000037aedecce8 in tree_statement_list::eval (this=Variable "this" is not available. ) at pt-stmt.cc:168 #16 0x00000037aec39c4b in main_loop () at toplev.cc:149 #17 0x00000037aebcf3a5 in octave_main (argc=Variable "argc" is not available. ) at octave.cc:739 #18 0x000000376df1c40f in __libc_start_main () from /lib64/libc.so.6 #19 0x0000000000400789 in _start () #20 0x00007fffff980b98 in ?? () #21 0x0000000000000000 in ?? () (gdb) list 739 int retval = main_loop (); 740 741 if (retval == 1 && ! error_state) 742 retval = 0; 743 744 clean_up_and_exit (retval); 745 746 return 0; 747 } 748 Thanks for digging into this. I haven't had time to look any further, and it doesn't help that I lack the hardware. As a short-term solution you should be able to get things working (but with a performance penalty) by removing atlas and using just blas and lapack instead. I am also the maintainer of atlas, so I will look into this. I recently found a similar problem running atlas on an old Pentium-MMX CPU. I think the problem has to do with getting atlas to respect the CPU flags passed to it when building the RPM. Another possible workaround is to use 3-d party lapack/blas library, e.g. Intel's MKL: LD_PRELOAD=/opt/intel/mkl/8.0.2/lib/em64t/libmkl.so octave It is appears not as fast as ATLAS, but still faster than generic lapack: octave:1> a=rand(3000); octave:2> tic; inv(a)*a; toc Elapsed time is 42.155060 seconds. (I get about 34 sec with ATLAS) AMD has ACML library (which also includes optimized lapack/blas), but I cannot get it to work (it appears that it does not have all symbols resolved). Again I am testing it on AMD64 / FC5. Sincerely, Dmitri. Thanks for the heads-up about Intel's MKL. Unfortunately, its non-commercial license doesn't allow me to use it. Anyway, removing the atlas library enables my code to work. It might take me a little time for me to get to looking at atlas, and not having hardware on which it fails won't help either. In the mean time another suggestion to try if you want the improved performance of atlas is to try custom compiling your own. If the problem is indeed caused by compilation on a different processor with wrong compiler flags, this might solve it for you, as well as give better performance. Instructions on building a customized atlas rpm are in the README.Fedora file that is packaged with the documentation in the atlas RPM. Because rebuilding using the customized method enables all of the compile-time optimizations of atlas, be warned that the process can take several hours. If this fixes your problems please inform me here as that will be helpful in trying to fix the package. I've compiled a custom version of ATLAS using the src.rpm (which took 13.5 mins). Apart from the use of -DATL_ARCH_HAMMER64, the compilation looks fine. Does ATLAS 3.6.0 actually work on Xeon (or P4) EM64T chips? Alas, installing this doesn't help - I get the same illegal instruction error. I think Atlas should work on EM64T chips--it's just a question of getting the spec file to compile it correctly. With regards to your recompilation, 13 minutes--even on high-end hardware--sounds more like what I would expect from a standard compilation. The command for custom compilation is: rpmbuild -D "enable_custom_atlas 1" --rebuild atlas-3.6.0-10.src.rpm The result should be an RPM called atlas-custom. I'm not well versed in 64-bit processors. Is this an IA64 architecture? I don't know how much that differs from amd64, but the Debian atlas package has a separate version for each of these. I'm using a modified version of the Debian packaging system for my RPMS, and I somewhat arbitrarily chose amd64 as the model for my x86_64 rpms. If you want to try ia64 instead, try changing line 139 of the spec file from %define archt amd64 to %define archt ia64 That is the command I used to recompile atlas-3.6.0-9.fc4.src.rpm. I got two atlas-custom rpms as a result. I can attach the build output if you want to see the details. Is version 10 much different? No, Xeons are derived from Pentium 4s (like Opterons are to Athlons) and EM64T is Intel's equivalent of AMD's 64-bit extensions to the x86 ISA - it is not an Itanium (ia64). I note that the release notes on ATLAS https://sourceforge.net/project/shownotes.php?group_id=23725&release_id=350637 mentions that support for EM64T was added only in version 3.7.10. OK, since we have determined (1) that the bug is in atlas and not octave, and (2) that the problem with atlas is that the current version doesn't support the architecture in question, I'm going to close this bug if there are no objections. According to the atlas mailing lists, the author is currently working on a new stable release (3.8.0) which will hopefully be released soon. I'll work on a Fedora release of it when it is available, which should fix this problem. Unfortunately I don't know what kind of time frame that will be. I'm not so sure if atlas works or not. Are there unit tests for atlas (or even lapack) that you can run to show correctness? Nonetheless, I'm happy for you to close this bug and open one against atlas. This same error seems to be have cropped up as part of building the octave-forge package, see bug #510841 comment #16 onwards. |