Bug 169399

Summary: memory allocation issues on calling DGESDD routine of LAPACK
Product: [Fedora] Fedora Reporter: Ranjan Maitra <itsme_410>
Component: lapackAssignee: Tom "spot" Callaway <tcallawa>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: extras-qa
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-04-07 14:09:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
file svdd-demo.c
none
header file for svdd-demo.c
none
test.dat (data file for svdd-demo.c) none

Description Ranjan Maitra 2005-09-27 21:13:17 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7

Description of problem:
The attached program shows the problem. In this program code, we are calculating the SVD of a m x n-dimensional matrix where m=5 and n=2. The program allocates space according to the minimal requirements of LAPACK User's manual.

However, the problem goes away when m<=4 above. Most likely, it is an issue when m is sufficiently larger than n. 

The attached code is svdd-demo.c. It uses the attached header file (array.h).The datafile is given in test.dat.

Upon installing LAPACK manually, this is not a problem. The RPMs are the issue: I have verified that every release from 3.0.25 through 3.0.30 presents this problem.

Version-Release number of selected component (if applicable):
lapack-3.0-31.fc4, lapack-devel-3.0-31.fc4

How reproducible:
Always

Steps to Reproduce:
1. Use the following to compile: (I use the tcsh shell)

%gcc -ansi -Wall -pedantic svdd-demo.c -lm -llapack
 
2. 

./a.out

3.
  

Actual Results:  The program always ends as: 
reading in matrix A from test.dat:
    2.117079     2.992091 
    2.106365     1.970807 
    2.131384     3.130390 
    2.344235     2.667219 
    1.838804     3.243629 


 ** On entry to DGESDD parameter number 12 had an illegal value
STOP 0



 
for m>=5 in the program (n=2).

However, using the statistical package R as well as a compiled LAPACK provides the correct result.

Expected Results:  reading in matrix A from test.dat:
    2.117079     2.992091 
    2.106365     1.970807 
    2.131384     3.130390 
    2.344235     2.667219 
    1.838804     3.243629 


svdd returned 0

singular values are:
    7.875120     0.769370 

The left multiplier matrix, u, is:
   -0.465324     0.102357    -0.525449    -0.358025    -0.607221 
   -0.360284    -0.676273     0.043269    -0.489893     0.413500 
   -0.480519     0.194368     0.811234    -0.064990    -0.262676 
   -0.449331    -0.386186    -0.133477     0.791199    -0.071766 
   -0.469971     0.587587    -0.214743     0.040039     0.621410 

The right multiplier matrix, v, is:
   -0.595001    -0.803725 
   -0.803725     0.595001 



Additional info:

My hunch is that this is a problem because of some funky optimization that has been used in the RPMS. Also, the statistical package R uses blas, but gives the correct result: hence my guess is that it is the lapack RPM that is the problem.

This is a pretty SEVERE problem which also happens for RHEL4.

Comment 1 Ranjan Maitra 2005-09-27 21:16:35 UTC
Created attachment 119323 [details]
file svdd-demo.c

Comment 2 Ranjan Maitra 2005-09-27 21:17:10 UTC
Created attachment 119324 [details]
header file for svdd-demo.c

Comment 3 Ranjan Maitra 2005-09-27 21:18:08 UTC
Created attachment 119325 [details]
test.dat (data file for svdd-demo.c)

Comment 4 Tom "spot" Callaway 2005-09-28 17:26:54 UTC
Let me look into this. :/

Comment 5 Tom "spot" Callaway 2005-09-28 22:02:24 UTC
Removing optimization from the rpm does not resolve the issue, thus, it doesn't
seem to be the problem.

Comment 6 Tom "spot" Callaway 2005-09-28 22:10:32 UTC
OK, its this patch (lapack-20010525.patch). I inherited that patch with the rest
of lapack when I took over maintainership in Fedora Extras. Time to figure out
why it causes this failure.

Comment 7 Tom "spot" Callaway 2005-09-28 22:53:26 UTC
Alright, figured this one out.

Once upon a time (Fri May 25 2001), Trond Eivind Glomsrød (aka, the pickled fish
eater), decided that for reasons unknown (or at least undocumented), he would
apply all of the patches that upstream had pending for lapack/blas, but had not
yet rolled into a formal release. Given that lapack hadn't had a major release
since May 31, 2000, he was probably trying to patch bugs or just be better
prepared for a future release.

Upstream had made revised versions of specific files in the lapack/blas tree and
put them here:

http://www.netlib.org/lapack/release_notes.html

Trond took those files, and made a patch diff from the original lapack to these
new updated source files and named it lapack-20010525.patch.

One of these changes is described by upstream as:
10/5/2000 	Corrected error with LQUERY and setting of WORK(1)

This touches (amongst other files): LAPACK/SRC/dgesdd.f

When we look at the diff of that specific file, we see the following:

diff -uNr LAPACK.orig/SRC/dgesdd.f LAPACK/SRC/dgesdd.f
--- LAPACK.orig/SRC/dgesdd.f    Thu Nov 11 20:32:31 1999
+++ LAPACK/SRC/dgesdd.f Fri May 25 16:07:58 2001
@@ -4,7 +4,8 @@
 *  -- LAPACK driver routine (version 3.0) --
 *     Univ. of Tennessee, Univ. of California Berkeley, NAG Ltd.,
 *     Courant Institute, Argonne National Lab, and Rice University
-*     October 31, 1999
+*     June 30, 1999
+*     8-15-00:  Improve consistency of WS calculations (eca)
 *
 *     .. Scalar Arguments ..
       CHARACTER          JOBZ
@@ -116,16 +117,20 @@
 *  LWORK   (input) INTEGER
 *          The dimension of the array WORK. LWORK >= 1.
 *          If JOBZ = 'N',
-*            LWORK >= 3*min(M,N) + max(max(M,N),6*min(M,N)).
+*            LWORK >= max(14*min(M,N)+4, 10*min(M,N)+2+
+*                     SMLSIZ*(SMLSIZ+8)) + max(M,N)
+*          where SMLSIZ is returned by ILAENV and is equal to the
+*          maximum size of the subproblems at the bottom of the
+*          computation tree (usually about 25).
 *          If JOBZ = 'O',
-*            LWORK >= 3*min(M,N)*min(M,N) +
-*                     max(max(M,N),5*min(M,N)*min(M,N)+4*min(M,N)).
+*            LWORK >= 5*min(M,N)*min(M,N) + max(M,N) + 9*min(M,N).
 *          If JOBZ = 'S' or 'A'
-*            LWORK >= 3*min(M,N)*min(M,N) +
-*                     max(max(M,N),4*min(M,N)*min(M,N)+4*min(M,N)).
+*            LWORK >= 4*min(M,N)*min(M,N) + max(M,N) + 9*min(M,N).
 *          For good performance, LWORK should generally be larger.
-*          If LWORK < 0 but other input arguments are legal, WORK(1)
-*          returns the optimal LWORK.
+*
+*          If LWORK = -1, a workspace query is assumed.  The optimal
+*          size for the WORK array is calculated and stored in WORK(1),
+*          and no other work except argument checking is performed.
 *
 *  IWORK   (workspace) INTEGER array, dimension (8*min(M,N))
 *

It goes on, but the notable change is in those comments. It changes the expected
definition of LWORK, supposedly to improve consistency of WC calculations.

So, I went to your test case, and made the same change:
--- svdd-demo.c.orig    2005-09-28 17:30:11.000000000 -0500
+++ svdd-demo.c 2005-09-28 17:31:15.000000000 -0500
@@ -23,8 +23,7 @@ void dgesdd_( char *jobz,
 int svdd(double **a, int m, int n, double *d, double **u, double **v)
 {
        double *A, *U, *VT;
-       int lwork =3*MIN(m,n)*MIN(m,n)
-                             + MAX(MAX(m,n),4*MIN(m,n)*MIN(m,n)+4*MIN(m,n));
+       int lwork = 4*MIN(m,n)*MIN(m,n) + MAX(m,n) + 9*MIN(m,n);
        int liwork = 8*MIN(m,n);
        char jobz = 'A';
        double *work;

With that change, your test case starts working again:

[spot@swoop ~]$ ./svdd-demo
reading in matrix A from test.dat:
    2.117079     2.992091
    2.106365     1.970807
    2.131384     3.130390
    2.344235     2.667219
    1.838804     3.243629


svdd returned 0

singular values are:
    7.875120     0.769370

The left multiplier matrix, u, is:
   -0.465324     0.102357    -0.525449    -0.358025    -0.607221
   -0.360284    -0.676273     0.043269    -0.489893     0.413500
   -0.480519     0.194368     0.811234    -0.064990    -0.262676
   -0.449331    -0.386186    -0.133477     0.791199    -0.071766
   -0.469971     0.587587    -0.214743     0.040039     0.621410

The right multiplier matrix, v, is:
   -0.595001    -0.803725
   -0.803725     0.595001

So, this answers the question of why your test case fails, but it doesn't really
make it clear what we should do.

Unfortunately, whoever wrote that improvement upstream didn't sign their name,
so we can't ask him or her whether its worthwhile to keep it. It doesn't look
likely that lapack will ever have another major release (it hasn't had any touch
since 2001). This has been the Red Hat shipped behavior for lapack since 2001,
and no one has hit this before. It is possible that people are relying on this
patched value for LWORK, and to revert it would break their existing configuration.

So, we can keep this patch and correct the user manual, or we can revert it, and
hope that it doesn't break existing users or make the WC calculations less
consistent. Your feedback here would be very much encouraged. :)


Comment 8 Ranjan Maitra 2005-09-29 13:51:49 UTC
Thanks very much! I have verified that it works. I have also verified that the
querying for optimal lwork which is the ideal way of doing things also works. 

I think the patch should remain, since it corrects for real errors. If possible,
the manpages (user manual could be upated with all the corrections in this patch).

http://www.netlib.org/lapack/html/errata.lug
http://www.netlib.org/lapack/release_notes.html

After digging around, I came up with the above websites. Maybe, if possible, the
errata could also be included in the manpages in a good way so that we can use
lapack correctly?

Thanks a lot for looking into this, and so quickly!! I am so proud of Fedora
Core as a user!!

Comment 9 Tom "spot" Callaway 2006-04-07 14:09:52 UTC
Closing out this bug.