Bug 197109 - octave panic: Illegal instruction
Summary: octave panic: Illegal instruction
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: octave
Version: 4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Quentin Spencer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-06-28 15:59 UTC by P Chang
Modified: 2009-08-20 23:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-07-07 16:33:54 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description P Chang 2006-06-28 15:59:14 UTC
Description of problem:

Using toeplitz() crashes octave on a dual Xeon 3.4GHz (cpu family: 15,
model: 4, stepping 1) dell box running under EM64T.

Version-Release number of selected component (if applicable):
octave-2.9.5-1.fc4

How reproducible:
Always

Steps to Reproduce:
1. start octave
2. enter toeplitz(ones(180,1),zeros(1,180));
3. watch it crash
  
Actual results:
$ octave
GNU Octave, version 2.9.5 (x86_64-redhat-linux-gnu).
Copyright (C) 2006 John W. Eaton.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTIBILITY or
FITNESS FOR A PARTICULAR PURPOSE.  For details, type `warranty'.

Additional information about Octave is available at http://www.octave.org.

Please contribute if you find this software useful.
For more information, visit http://www.octave.org/help-wanted.html

Report bugs to <bug> (but first, please read
http://www.octave.org/bugs.html to learn how to write a helpful report).

octave:1> toeplitz(ones(180,1),zeros(1,180));
panic: Illegal instruction -- stopping myself...
attempting to save variables to `octave-core'...
save to `octave-core' complete
Illegal instruction

Expected results:
No crash!

Additional info:
Methinks it may be a 64-bit, gcc, or stack protection issue... Seems to work
when using 179 rather than 180.

Comment 1 Quentin Spencer 2006-06-28 17:51:26 UTC
Unfortunately, I don't have the hardware to debug this directly by myself, but I
solicited some input on the octave mailing lists and one user was able to verify
that on FC5 this problem doesn't happen with one change in the spec file:

Change line 62 from
%define enable64 yes
to 
%define enable64 no

This change is already planned for the next octave release, because enabling the
64-bit features was causing problems with libraries that were not compiled with
the same assumptions. If you don't mind recompiling octave, could you tell me
whether this change fixes the problem?


Comment 2 Dmitri A. Sergatskov 2006-06-28 18:51:51 UTC
My rpms (recompiled without "--enable-64") that do not show this
problem on Athlon64 are available at:

ftp://coffee.phys.unm.edu/pub/dima/incoming/octave/

I bumped up the version number to distinguish them from 
official release.

Hope that helps.

Dmitri.


Comment 3 P Chang 2006-06-28 19:05:50 UTC
I've compiled the 2.9.5-1 src.rpm with the suggested change to the spec file.
(I had to add in "export F77=gfortran" to avoid g77 getting picked up during the
configuration step as g77 doesn't have the -mtune=nocona switch causing fortran
compilation to fail.)

It still bombs out with toeplitz command. So no change with gcc-4.0.2-8.


Comment 4 P Chang 2006-06-30 00:34:26 UTC
Had a quick browse of the octave-bugs archive and saw a mention of the fact that
there are two versions of toeplitz.

It seems like the core one works but the octave-forge-2006.03.17-3.fc4 version
bombs out. So something in the vectorized version is causing the crash.

Comment 5 P Chang 2006-06-30 01:13:12 UTC
Further investigation shows that in the octave-forge version, the
index magic line causes the crash. Ie,
 retval = c ( [1:nr]' * ones (1, nc) + ones (nr, 1) * [nc-1:-1:0] );

Checking this out shows that both [1:180]'*ones(1,180); and
ones(180,1)*[179:-1:0]; provoke crashes.

Comment 6 P Chang 2006-06-30 15:17:41 UTC
I've compiled and installed the debuginfo rpm. Running octave under gdb gives
the trace below. It seems to crash in the atlas library in the
ATL_dupKBmm1_1_1_b0() function.

$ gdb octave
GNU gdb Red Hat Linux (6.3.0.0-1.84rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib64/libthread_db.so.1".

(gdb) r
Starting program: /usr/bin/octave 
(no debugging symbols found)

[Thread debugging using libthread_db enabled]
[New Thread 46912496335136 (LWP 27909)]
GNU Octave, version 2.9.5 (x86_64-redhat-linux-gnu).
Copyright (C) 2006 John W. Eaton.
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; not even for MERCHANTIBILITY or
FITNESS FOR A PARTICULAR PURPOSE.  For details, type `warranty'.

Additional information about Octave is available at http://www.octave.org.

Please contribute if you find this software useful.
For more information, visit http://www.octave.org/help-wanted.html

Report bugs to <bug> (but first, please read
http://www.octave.org/bugs.html to learn how to write a helpful report).

octave:1> [1:180]'*ones(1,180)

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 46912496335136 (LWP 27909)]
0x000000377175b0bb in ATL_dupKBmm1_1_1_b0 () from /usr/lib64/atlas/libblas.so.3
(gdb) bt
#0  0x000000377175b0bb in ATL_dupKBmm1_1_1_b0 () from /usr/lib64/atlas/libblas.so.3
#1  0x00000037718c055f in ATL_dpKBmm_b0 () from /usr/lib64/atlas/libblas.so.3
#2  0x00000037718c06c8 in ATL_dpKBmm () from /usr/lib64/atlas/libblas.so.3
#3  0x0000003771707b55 in ATL_dmmJIK2 () from /usr/lib64/atlas/libblas.so.3
#4  0x0000003771708a14 in ATL_dmmJIK () from /usr/lib64/atlas/libblas.so.3
#5  0x00000037716cf098 in ATL_dgecopy () from /usr/lib64/atlas/libblas.so.3
#6  0x00000037716cfc3e in ATL_dgemm () from /usr/lib64/atlas/libblas.so.3
#7  0x000000377155c242 in atl_f77wrap_dgemm_ () from /usr/lib64/atlas/libblas.so.3
#8  0x0000003771c8524b in dgemm_ () from /usr/lib64/atlas/libblas.so.3
#9  0x00000037af6c0365 in operator* (m=Variable "m" is not available.
) at dMatrix.cc:2569
#10 0x00000037aee0228f in oct_binop_mul (a1=Variable "a1" is not available.
) at ./OPERATORS/op-m-m.cc:64
#11 0x00000037aecd5b44 in do_binary_op (op=Variable "op" is not available.
) at ov.cc:1653
#12 0x00000037aedc47c9 in tree_binary_expression::rvalue (this=Variable "this"
is not available.
) at pt-binop.cc:75
#13 0x00000037aedc2f20 in tree_binary_expression::rvalue (this=Variable "this"
is not available.
) at pt-binop.cc:46
#14 0x00000037aedec714 in tree_statement::eval (this=Variable "this" is not
available.
) at pt-stmt.cc:133
#15 0x00000037aedecce8 in tree_statement_list::eval (this=Variable "this" is not
available.
) at pt-stmt.cc:168
#16 0x00000037aec39c4b in main_loop () at toplev.cc:149
#17 0x00000037aebcf3a5 in octave_main (argc=Variable "argc" is not available.
) at octave.cc:739
#18 0x000000376df1c40f in __libc_start_main () from /lib64/libc.so.6
#19 0x0000000000400789 in _start ()
#20 0x00007fffff980b98 in ?? ()
#21 0x0000000000000000 in ?? ()
(gdb) list
739       int retval = main_loop ();
740
741       if (retval == 1 && ! error_state)
742         retval = 0;
743
744       clean_up_and_exit (retval);
745
746       return 0;
747     }
748


Comment 7 Quentin Spencer 2006-06-30 15:37:18 UTC
Thanks for digging into this. I haven't had time to look any further, and it
doesn't help that I lack the hardware. As a short-term solution you should be
able to get things working (but with a performance penalty) by removing atlas
and using just blas and lapack instead.

I am also the maintainer of atlas, so I will look into this. I recently found a
similar problem running atlas on an old Pentium-MMX CPU. I think the problem has
to do with getting atlas to respect the CPU flags passed to it when building the
RPM.

Comment 8 Dmitri A. Sergatskov 2006-06-30 19:58:10 UTC
Another possible workaround is to use 3-d party lapack/blas library, e.g. 
Intel's MKL:

LD_PRELOAD=/opt/intel/mkl/8.0.2/lib/em64t/libmkl.so octave

It is appears not as fast as ATLAS, but still faster than generic lapack:

octave:1> a=rand(3000);
octave:2> tic; inv(a)*a; toc
Elapsed time is 42.155060 seconds.

(I get about 34 sec with ATLAS)

AMD has ACML library (which also includes optimized lapack/blas),
but I cannot get it to work (it appears that 
it does not have all symbols resolved).

Again I am testing it on AMD64 / FC5.

Sincerely,
Dmitri.


Comment 9 P Chang 2006-07-01 16:43:21 UTC
Thanks for the heads-up about Intel's MKL. Unfortunately, its non-commercial
license doesn't allow me to use it.

Anyway, removing the atlas library enables my code to work.


Comment 10 Quentin Spencer 2006-07-03 17:27:29 UTC
It might take me a little time for me to get to looking at atlas, and not having
hardware on which it fails won't help either. In the mean time another
suggestion to try if you want the improved performance of atlas is to try custom
compiling your own. If the problem is indeed caused by compilation on a
different processor with wrong compiler flags, this might solve it for you, as
well as give better performance. Instructions on building a customized atlas rpm
are in the README.Fedora file that is packaged with the documentation in the
atlas RPM. Because rebuilding using the customized method enables all of the
compile-time optimizations of atlas, be warned that the process can take several
hours. If this fixes your problems please inform me here as that will be helpful
in trying to fix the package.

Comment 11 P Chang 2006-07-05 14:59:51 UTC
I've compiled a custom version of ATLAS using the src.rpm (which took 13.5 mins).
Apart from the use of -DATL_ARCH_HAMMER64, the compilation looks fine. Does
ATLAS 3.6.0 actually work on Xeon (or P4) EM64T chips? Alas, installing this
doesn't help - I get the same illegal instruction error.


Comment 12 Quentin Spencer 2006-07-05 15:27:28 UTC
I think Atlas should work on EM64T chips--it's just a question of getting the
spec file to compile it correctly. With regards to your recompilation, 13
minutes--even on high-end hardware--sounds more like what I would expect from a
standard compilation. The command for custom compilation is:

rpmbuild -D "enable_custom_atlas 1" --rebuild atlas-3.6.0-10.src.rpm

The result should be an RPM called atlas-custom.

I'm not well versed in 64-bit processors. Is this an IA64 architecture? I don't
know how much that differs from amd64, but the Debian atlas package has a
separate version for each of these. I'm using a modified version of the Debian
packaging system for my RPMS, and I somewhat arbitrarily chose amd64 as the
model for my x86_64 rpms. If you want to try ia64 instead, try changing line 139
of the spec file from

%define archt amd64
to
%define archt ia64


Comment 13 P Chang 2006-07-05 15:53:55 UTC
That is the command I used to recompile atlas-3.6.0-9.fc4.src.rpm. I got two
atlas-custom rpms as a result. I can attach the build output if you want to see
the details. Is version 10 much different?

No, Xeons are derived from Pentium 4s (like Opterons are to Athlons) and EM64T
is Intel's equivalent of AMD's 64-bit extensions to the x86 ISA - it is not an
Itanium (ia64).

I note that the release notes on ATLAS
https://sourceforge.net/project/shownotes.php?group_id=23725&release_id=350637
mentions that support for EM64T was added only in version 3.7.10.

Comment 14 Quentin Spencer 2006-07-07 16:33:54 UTC
OK, since we have determined (1) that the bug is in atlas and not octave, and
(2) that the problem with atlas is that the current version doesn't support the
architecture in question, I'm going to close this bug if there are no objections.

According to the atlas mailing lists, the author is currently working on a new
stable release (3.8.0) which will hopefully be released soon. I'll work on a
Fedora release of it when it is available, which should fix this problem.
Unfortunately I don't know what kind of time frame that will be.

Comment 15 P Chang 2006-07-07 19:22:16 UTC
I'm not so sure if atlas works or not. Are there unit tests for atlas (or even
lapack) that you can run to show correctness?

Nonetheless, I'm happy for you to close this bug and open one against atlas.

Comment 16 Alex Lancaster 2009-08-20 23:45:26 UTC
This same error seems to be have cropped up as part of building the octave-forge package, see bug #510841 comment #16 onwards.


Note You need to log in before you can comment on or make changes to this bug.