Bug 510225

Summary: Segfault/Infinite loop in TLS double access
Product: Red Hat Enterprise Linux 5 Reporter: Michal Nowak <mnowak>
Component: kernel-xenAssignee: Paolo Bonzini <pbonzini>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.4CC: clalance, cward, dzickus, ohudlick, pbonzini, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 07:45:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 526775, 526946    
Attachments:
Description Flags
patch to fix the bug
none
patch matching what was applied upstream none

Description Michal Nowak 2009-07-08 12:32:42 UTC
Description of problem:

gcc44-gfortran-4.4.0-6.el5

Reproducible with kernel-xen-2.6.18-157.el5 (Linux
athlon5.rhts.bos.redhat.com 2.6.18-157.el5xen #1 SMP Mon Jul 6 18:26:42 EDT
2009 i686 athlon i386 GNU/Linux):

* gfortran44, kernel-xen, *no* -mno-tls-direct-seg-refs:

[root@athlon5 445666-OpenMP-segv]# ./reproducer44-kernel-xen 
           1
Segmentation fault

* gfortran44, kernel-xen, *with* -mno-tls-direct-seg-refs:

[root@athlon5 445666-OpenMP-segv]# gfortran44 -o reproducer reproducer.f90
-mno-tls-direct-seg-refs
[root@athlon5 445666-OpenMP-segv]# ./reproducer
           1
           2


* without OpenMP it works fine

[root@athlon5 445666-OpenMP-segv]# gfortran44 -o reproducer-mnowak reproducer.f90 -O1 
[root@athlon5 445666-OpenMP-segv]# ./reproducer-mnowak 
           1
           2

[root@athlon5 445666-OpenMP-segv]# cat reproducer.f90
program foo
        implicit none
        common /bobcom/ bob(2)
!$omp threadprivate (/bobcom/)

        integer i
        real*8 bob

        do i=1,2
        write(*,*) i
        bob(i)=0.0d0
        enddo

        end program




Testcase: /tools/gcc/Regression/OpenMP/445666-OpenMP-segv

kernel-xen-2.6.18-128.1.18.el5 is the same as well as 2.6.18-126.el5xen.

Comment 1 Chris Lalancette 2009-07-08 12:41:49 UTC
Actually, to really reproduce the bug, you need to use the above program but  compile with:

gfortran -g -fopenmp -o chris rep.f90

(the important bit is the -fopenmp).

Also, it's important to note that this also fails for a -128 kernel, as far as I can tell, so it's not a regression.

Chris Lalancette

Comment 2 Chris Lalancette 2009-07-09 11:30:13 UTC
I've been able to reduce the test case a bit further:

program foo
        implicit none
        common /bobcom/ bob(2)
!$omp threadprivate (/bobcom/)
        real*8 bob

        write(*,*) 1
        bob(1)=0.0d0

        end program

Interestingly, the important piece is the threadprivate stuff for openmp.  That seems to generate this bit of assembly:

        bob(1)=0.0d0
 804865b:	d9 ee                	fldz   
 804865d:	65 dd 1d f0 ff ff ff 	fstpl  %gs:0xfffffff0
 8048664:	c9                   	leave  
 8048665:	c3                   	ret    
 8048666:	90                   	nop    
 8048667:	90                   	nop    
 8048668:	90                   	nop    
 8048669:	90                   	nop    
 804866a:	90                   	nop    
 804866b:	90                   	nop    
 804866c:	90                   	nop    
 804866d:	90                   	nop    
 804866e:	90                   	nop    
 804866f:	90                   	nop    

I'm guessing that we aren't properly emulating the fldz and/or fstpl instructions, and that is what is causing the failure.  Indeed, while the upstream hypervisor emulates those instructions, we do not.  However, after some brief debugging, it doesn't seem we are entering the emulator properly at all.  It will need more looking at.

Chris Lalancette

Comment 3 Paolo Bonzini 2009-07-09 16:48:24 UTC
The problem is the fstpl.

With the following reduced C test case (which does not require -fopenmp BTW):

  __thread double x;
  double y;
  int main()
  {
    x = y * 0.0;
  }

I get a segmentation fault, with just "x = y" (which does not use fstpl) I get the Xen warning.

Comment 4 Paolo Bonzini 2009-07-09 18:33:45 UTC
Actually if I compile the C code without -O2 I get an infinite loop instead.

The encoding of the problematic instruction is 65 dd 1d f8 ff ff ff.

Comment 5 Paolo Bonzini 2009-07-10 14:38:05 UTC
Reproducible with Xen 3.2 on RHEL kernel, but not with Xen 3.2 on XenLiveCD kernel 2.6.26.

Comment 6 Paolo Bonzini 2009-07-16 14:13:43 UTC
Actually I was wrong, it is still reproducible with the XenLiveCD's kernel 2.6.26.

Comment 7 Paolo Bonzini 2009-07-20 13:04:47 UTC
And also with upstream hypervisor, despite a lot of changes went in for x87 emulation (16859 16860 17120 17175 17180 17183 17474 17475 17924)

Comment 8 Paolo Bonzini 2009-07-20 16:29:08 UTC
And also with upstream hypervisor _and_ kernel.

Comment 9 Paolo Bonzini 2009-07-24 16:32:55 UTC
Created attachment 355065 [details]
patch to fix the bug

Aha, I confused instruction emulation with TLS segment fixup.

The patch is trivial since the segment fixup code cares only about the operands of the instruction, not about its semantics.  I'm submitting the patch upstream.

Comment 10 Paolo Bonzini 2009-07-30 09:41:13 UTC
Committed upstream at http://xenbits.xensource.com/xen-unstable.hg?rev/19985

Comment 11 Paolo Bonzini 2009-07-30 14:09:55 UTC
Created attachment 355680 [details]
patch matching what was applied upstream

Comment 12 Chris Lalancette 2009-08-25 10:00:29 UTC
I've uploaded a test kernel that should have a fix for this problem here:

http://people.redhat.com/clalance/virttest/

Can the reporters who are having problems please download and try out this test kernel?

Thanks,
Chris Lalancette

Comment 13 Michal Nowak 2009-08-25 11:02:35 UTC
I'll have a look with 5.4 + your kernel.

Comment 14 Michal Nowak 2009-08-25 12:27:08 UTC
Kernel version: 2.6.18-164.el5xen
=================================

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: [gfortran44] Testing the executable
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   PASS   ] :: [gfortran44] Compile the testcase
           1
/usr/lib/beakerlib//testing.sh: line 575:  2636 Segmentation fault      ./reproducer
:: [   FAIL   ] :: [gfortran44] Checking we have a working executable (Expected 0, got 139)
1
/tools/gcc/Regression/OpenMP/445666-OpenMP-segv/-gfortran44-Testing-the-executable result: FAIL


Kernel version: 2.6.18-164.el5virttest17xen
===========================================

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: [gfortran44] Testing the executable
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   PASS   ] :: [gfortran44] Compile the testcase
           1
           2
:: [   PASS   ] :: [gfortran44] Checking we have a working executable
0
/tools/gcc/Regression/OpenMP/445666-OpenMP-segv/-gfortran44-Testing-the-executable result: PASS



FIXED.

Paolo, Chris, thanks!

Comment 15 Don Zickus 2009-10-21 19:12:13 UTC
in kernel-2.6.18-170.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 21 errata-xmlrpc 2010-03-30 07:45:20 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html