Bug 510225 - Segfault/Infinite loop in TLS double access
Segfault/Infinite loop in TLS double access
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.4
All Linux
low Severity medium
: rc
: ---
Assigned To: Paolo Bonzini
Red Hat Kernel QE team
:
Depends On:
Blocks: 526775 526946
  Show dependency treegraph
 
Reported: 2009-07-08 08:32 EDT by Michal Nowak
Modified: 2013-03-07 21:06 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-03-30 03:45:20 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to fix the bug (4.77 KB, patch)
2009-07-24 12:32 EDT, Paolo Bonzini
no flags Details | Diff
patch matching what was applied upstream (4.96 KB, patch)
2009-07-30 10:09 EDT, Paolo Bonzini
no flags Details | Diff

  None (edit)
Description Michal Nowak 2009-07-08 08:32:42 EDT
Description of problem:

gcc44-gfortran-4.4.0-6.el5

Reproducible with kernel-xen-2.6.18-157.el5 (Linux
athlon5.rhts.bos.redhat.com 2.6.18-157.el5xen #1 SMP Mon Jul 6 18:26:42 EDT
2009 i686 athlon i386 GNU/Linux):

* gfortran44, kernel-xen, *no* -mno-tls-direct-seg-refs:

[root@athlon5 445666-OpenMP-segv]# ./reproducer44-kernel-xen 
           1
Segmentation fault

* gfortran44, kernel-xen, *with* -mno-tls-direct-seg-refs:

[root@athlon5 445666-OpenMP-segv]# gfortran44 -o reproducer reproducer.f90
-mno-tls-direct-seg-refs
[root@athlon5 445666-OpenMP-segv]# ./reproducer
           1
           2


* without OpenMP it works fine

[root@athlon5 445666-OpenMP-segv]# gfortran44 -o reproducer-mnowak reproducer.f90 -O1 
[root@athlon5 445666-OpenMP-segv]# ./reproducer-mnowak 
           1
           2

[root@athlon5 445666-OpenMP-segv]# cat reproducer.f90
program foo
        implicit none
        common /bobcom/ bob(2)
!$omp threadprivate (/bobcom/)

        integer i
        real*8 bob

        do i=1,2
        write(*,*) i
        bob(i)=0.0d0
        enddo

        end program




Testcase: /tools/gcc/Regression/OpenMP/445666-OpenMP-segv

kernel-xen-2.6.18-128.1.18.el5 is the same as well as 2.6.18-126.el5xen.
Comment 1 Chris Lalancette 2009-07-08 08:41:49 EDT
Actually, to really reproduce the bug, you need to use the above program but  compile with:

gfortran -g -fopenmp -o chris rep.f90

(the important bit is the -fopenmp).

Also, it's important to note that this also fails for a -128 kernel, as far as I can tell, so it's not a regression.

Chris Lalancette
Comment 2 Chris Lalancette 2009-07-09 07:30:13 EDT
I've been able to reduce the test case a bit further:

program foo
        implicit none
        common /bobcom/ bob(2)
!$omp threadprivate (/bobcom/)
        real*8 bob

        write(*,*) 1
        bob(1)=0.0d0

        end program

Interestingly, the important piece is the threadprivate stuff for openmp.  That seems to generate this bit of assembly:

        bob(1)=0.0d0
 804865b:	d9 ee                	fldz   
 804865d:	65 dd 1d f0 ff ff ff 	fstpl  %gs:0xfffffff0
 8048664:	c9                   	leave  
 8048665:	c3                   	ret    
 8048666:	90                   	nop    
 8048667:	90                   	nop    
 8048668:	90                   	nop    
 8048669:	90                   	nop    
 804866a:	90                   	nop    
 804866b:	90                   	nop    
 804866c:	90                   	nop    
 804866d:	90                   	nop    
 804866e:	90                   	nop    
 804866f:	90                   	nop    

I'm guessing that we aren't properly emulating the fldz and/or fstpl instructions, and that is what is causing the failure.  Indeed, while the upstream hypervisor emulates those instructions, we do not.  However, after some brief debugging, it doesn't seem we are entering the emulator properly at all.  It will need more looking at.

Chris Lalancette
Comment 3 Paolo Bonzini 2009-07-09 12:48:24 EDT
The problem is the fstpl.

With the following reduced C test case (which does not require -fopenmp BTW):

  __thread double x;
  double y;
  int main()
  {
    x = y * 0.0;
  }

I get a segmentation fault, with just "x = y" (which does not use fstpl) I get the Xen warning.
Comment 4 Paolo Bonzini 2009-07-09 14:33:45 EDT
Actually if I compile the C code without -O2 I get an infinite loop instead.

The encoding of the problematic instruction is 65 dd 1d f8 ff ff ff.
Comment 5 Paolo Bonzini 2009-07-10 10:38:05 EDT
Reproducible with Xen 3.2 on RHEL kernel, but not with Xen 3.2 on XenLiveCD kernel 2.6.26.
Comment 6 Paolo Bonzini 2009-07-16 10:13:43 EDT
Actually I was wrong, it is still reproducible with the XenLiveCD's kernel 2.6.26.
Comment 7 Paolo Bonzini 2009-07-20 09:04:47 EDT
And also with upstream hypervisor, despite a lot of changes went in for x87 emulation (16859 16860 17120 17175 17180 17183 17474 17475 17924)
Comment 8 Paolo Bonzini 2009-07-20 12:29:08 EDT
And also with upstream hypervisor _and_ kernel.
Comment 9 Paolo Bonzini 2009-07-24 12:32:55 EDT
Created attachment 355065 [details]
patch to fix the bug

Aha, I confused instruction emulation with TLS segment fixup.

The patch is trivial since the segment fixup code cares only about the operands of the instruction, not about its semantics.  I'm submitting the patch upstream.
Comment 10 Paolo Bonzini 2009-07-30 05:41:13 EDT
Committed upstream at http://xenbits.xensource.com/xen-unstable.hg?rev/19985
Comment 11 Paolo Bonzini 2009-07-30 10:09:55 EDT
Created attachment 355680 [details]
patch matching what was applied upstream
Comment 12 Chris Lalancette 2009-08-25 06:00:29 EDT
I've uploaded a test kernel that should have a fix for this problem here:

http://people.redhat.com/clalance/virttest/

Can the reporters who are having problems please download and try out this test kernel?

Thanks,
Chris Lalancette
Comment 13 Michal Nowak 2009-08-25 07:02:35 EDT
I'll have a look with 5.4 + your kernel.
Comment 14 Michal Nowak 2009-08-25 08:27:08 EDT
Kernel version: 2.6.18-164.el5xen
=================================

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: [gfortran44] Testing the executable
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   PASS   ] :: [gfortran44] Compile the testcase
           1
/usr/lib/beakerlib//testing.sh: line 575:  2636 Segmentation fault      ./reproducer
:: [   FAIL   ] :: [gfortran44] Checking we have a working executable (Expected 0, got 139)
1
/tools/gcc/Regression/OpenMP/445666-OpenMP-segv/-gfortran44-Testing-the-executable result: FAIL


Kernel version: 2.6.18-164.el5virttest17xen
===========================================

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: [gfortran44] Testing the executable
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   PASS   ] :: [gfortran44] Compile the testcase
           1
           2
:: [   PASS   ] :: [gfortran44] Checking we have a working executable
0
/tools/gcc/Regression/OpenMP/445666-OpenMP-segv/-gfortran44-Testing-the-executable result: PASS



FIXED.

Paolo, Chris, thanks!
Comment 15 Don Zickus 2009-10-21 15:12:13 EDT
in kernel-2.6.18-170.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 21 errata-xmlrpc 2010-03-30 03:45:20 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.