Bug 1641044 - Crash of xorg-x11-server-Xorg on T580 laptop
Summary: Crash of xorg-x11-server-Xorg on T580 laptop
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libpciaccess
Version: 7.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Dave Airlie
QA Contact: Desktop QE
Marie Dolezelova
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-19 13:30 UTC by Jiri Hladky
Modified: 2018-10-30 10:44 UTC (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
X.org X11 crashes on Lenovo T580 Due to a bug in the `libpciaccess` library, the X.org X11 server terminates unexpectedly on Lenovo T580 laptops.
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)
Crash report for xorg-x11-server-Xorg-1.20.1-5.el7 (502.73 KB, text/plain)
2018-10-19 13:30 UTC, Jiri Hladky
no flags Details
scanpci log (2.41 KB, text/plain)
2018-10-23 11:46 UTC, Jiri Hladky
no flags Details


Links
System ID Priority Status Summary Last Updated
FreeDesktop.org 81678 None None None 2018-10-23 12:27:51 UTC

Description Jiri Hladky 2018-10-19 13:30:03 UTC
Created attachment 1495634 [details]
Crash report for xorg-x11-server-Xorg-1.20.1-5.el7

Description of problem:

The crash of xorg-x11-server-Xorg on T580 laptop

https://beaker.cluster-qe.lab.eng.brq.redhat.com/bkr/view/t580.tpb.lab.eng.brq.redhat.com#details


We report the crash of xorg-x11-server-Xorg on T580 laptop. We see the problem both with 
xorg-x11-server-Xorg-1.20.1-5.el7
xorg-x11-server-Xorg-1.20.1-3.el7

RHEL-7.6-20181014.n.0 + xorg-x11-server-Xorg-1.20.1-5.el7
https://beaker.cluster-qe.lab.eng.brq.redhat.com/bkr/jobs/88458

RHEL-7.6-20181010.0
https://beaker.cluster-qe.lab.eng.brq.redhat.com/bkr/jobs/88241

I will upload the crash report.

Comment 5 Jiri Hladky 2018-10-23 11:45:24 UTC
Hi Oliver,

I have tested scanpci from the attachment 1496677 [details] and it works just fine. I will upload log

./scanpci  2>&1 > scanpci_$(uname -r).log

I will also send you access details in an e-mail for t580 so that you can poke around. 

Thanks
Jirka

Comment 6 Jiri Hladky 2018-10-23 11:46:38 UTC
Created attachment 1496681 [details]
scanpci log

scanpci works fine. I'm attaching the log file

./scanpci  2>&1 > scanpci_$(uname -r).log

Comment 8 Jiri Hladky 2018-10-23 12:30:28 UTC
Hi Oliver,

I have run the test in Beaker twice and it has always failed. It's running 3 times in a row

gnome-shell-perf-tool --perf-iters=2 --replace

Based on 
$ grep Segmentation /var/log/Xorg.*
/var/log/Xorg.1.log:[   881.168] (EE) Segmentation fault at address 0x124
/var/log/Xorg.1.log:[   881.168] (EE) Caught signal 11 (Segmentation fault). Server aborting
/var/log/Xorg.2.log:[   881.292] (EE) Segmentation fault at address 0x124
/var/log/Xorg.2.log:[   881.292] (EE) Caught signal 11 (Segmentation fault). Server aborting
/var/log/Xorg.3.log:[   881.417] (EE) Segmentation fault at address 0x124
/var/log/Xorg.3.log:[   881.417] (EE) Caught signal 11 (Segmentation fault). Server aborting
/var/log/Xorg.4.log:[   881.521] (EE) Segmentation fault at address 0x124
/var/log/Xorg.4.log:[   881.521] (EE) Caught signal 11 (Segmentation fault). Server aborting
/var/log/Xorg.5.log:[   881.625] (EE) Segmentation fault at address 0x124
/var/log/Xorg.5.log:[   881.625] (EE) Caught signal 11 (Segmentation fault). Server aborting

it seems it has crashed 5 times out of 6 (=2*3)

I have tried to run the test manually several times now but I'm not able to reproduce the crash. I'm not sure what has changed. 

$ { time dogtail-run-headless-next "turbostat -out turbostat_$(date "+%Y-%b-%d_%Hh%Mm%Ss").log gnome-shell-perf-tool --perf-iters=2 --replace";  } >gnome-shell-perf-tool_$(date "+%Y-%b-%d_%Hh%Mm%Ss").log 2>&1

Jirka

Comment 9 Olivier Fourdan 2018-10-23 13:54:08 UTC
Right, so if I understand correctly, this is dogtail running multiple gnome-shell-perf-tool launching different Xservers simultaneously.

So it could be that these multiple X servers fail to access/parse "/dev/vga_arbiter", maybe a race condition, that could explain why it fails sometimes, not always.

Comment 10 Jiri Hladky 2018-10-23 14:48:29 UTC
There is always only one gnome-shell-perf-tool running. We run them one after another, not in parallel. 

I have restarted the job to get a clear status. I will update you in an hour or so about the outcome. 

Jirka

Comment 11 Adam Jackson 2018-10-23 16:04:43 UTC
(In reply to Olivier Fourdan from comment #9)
> Right, so if I understand correctly, this is dogtail running multiple
> gnome-shell-perf-tool launching different Xservers simultaneously.
> 
> So it could be that these multiple X servers fail to access/parse
> "/dev/vga_arbiter", maybe a race condition, that could explain why it fails
> sometimes, not always.

I don't think that can be it. Reads from /dev/vga_arbiter are atomic (they take the lock around the vga arb state), libpciaccess uses that to construct arguments for pci_device_find_by_slot(), which just walks the in-memory list of PCI devices looking for a match (which we've already built, which is how we know to try vgaarb init at all). The only way that can really crash, afaict, would be if the PCI device list was getting corrupted.

Comment 12 Jiri Hladky 2018-10-23 18:42:11 UTC
The last Beaker shows again the crash (as all previous Beaker runs)

I see the crash in all Xorg.* log files:
Xorg.0.log:[   898.241] (EE) Segmentation fault at address 0x124 
Xorg.0.log:[   898.241] (EE) Caught signal 11 (Segmentation fault). Server aborting 

I have sent you login information per e-mail. Feel free to poke with the notebook. 

Thanks!
Jirka

$ abrt-cli list 
id 36e908c9a27eca18428c22d51e22a0a4d474eec6 
reason:         Xorg server crashed 
time:           Tue 23 Oct 2018 08:30:21 PM CEST 
package:        xorg-x11-server-Xorg-1.20.1-5.el7 
uid:            0 (root) 
count:          1 
Directory:      /var/spool/abrt/xorg-2018-10-23-20:30:21-5435-1 
Run 'abrt-cli report /var/spool/abrt/xorg-2018-10-23-20:30:21-5435-1' for creating a case in Red Hat Customer Portal 

id d9d9ebc4a2ab25bb81829b615183d56fd459dab5 
reason:         Xorg killed by SIGABRT 
time:           Tue 23 Oct 2018 05:06:40 PM CEST 
cmdline:        /usr/bin/X :0 -background none -noreset -audit 4 -verbose -auth /run/gdm/auth-for-gdm-6x260x/database -seat seat0 vt1 
package:        xorg-x11-server-Xorg-1.20.1-5.el7 
uid:            0 (root) 
count:          1 
Directory:      /var/spool/abrt/ccpp-2018-10-23-17:06:40-21651 
Reported:       http://faf.lab.eng.brq.redhat.com/faf/reports/bthash/9868bf85b19cf1f4bbf4ab1765f5a3e5573b27a1 
               http://faf.lab.eng.brq.redhat.com/faf/reports/9028/ 
Run 'abrt-cli report /var/spool/abrt/ccpp-2018-10-23-17:06:40-21651' for creating a case in Red Hat Customer Portal 

The Autoreporting feature is disabled. Please consider enabling it by issuing 
'abrt-auto-reporting enabled' as a user with root privileges

Comment 13 Olivier Fourdan 2018-10-24 07:55:12 UTC
So found a core file in one of the abrt folders, isntalled the debuginfo packages locally on a VM and ran gdb, so at least we have a backtrace with symbols.


(gdb) bt
#0  0x00007f387398a207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f387398b8f8 in __GI_abort () at abort.c:90
#2  0x00005585a9643bda in OsAbort () at utils.c:1350
#3  0x00005585a9649773 in AbortServer () at log.c:877
#4  0x00005585a964a5bd in FatalError (
    f=f@entry=0x5585a967a930 "Caught signal %d (%s). Server aborting\n") at log.c:1015
#5  0x00005585a9640e49 in OsSigHandler (signo=11, sip=<optimized out>, unused=<optimized out>)
    at osinit.c:156
#6  <signal handler called>
#7  0x00007f38757a5cd0 in pci_device_next (iter=iter@entry=0x7ffcf02643d0) at common_iterator.c:182
#8  0x00007f38757a5d6b in pci_device_find_by_slot (domain=<optimized out>, bus=<optimized out>, 
    dev=<optimized out>, func=<optimized out>) at common_iterator.c:233
#9  0x00007f38757a7a46 in pci_device_vgaarb_init () at common_vgaarb.c:149
#10 0x00005585a9540c49 in xf86VGAarbiterInit () at xf86VGAarbiter.c:72
#11 0x00005585a951ba10 in xf86BusConfig () at xf86Bus.c:158
#12 0x00005585a9528eec in InitOutput (pScreenInfo=pScreenInfo@entry=0x5585a98f0a20 <screenInfo>, 
    argc=argc@entry=13, argv=argv@entry=0x7ffcf02646e8) at xf86Init.c:503
#13 0x00005585a94ec1b0 in dix_main (argc=13, argv=0x7ffcf02646e8, envp=<optimized out>) at main.c:193
#14 0x00007f38739763d5 in __libc_start_main (main=0x5585a94d64a0 <main>, argc=13, argv=0x7ffcf02646e8, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffcf02646d8)
    at ../csu/libc-start.c:266
#15 0x00005585a94d64ce in _start ()

(gdb) p *iter
$3 = {next_index = 1, mode = match_slot, match = {slot = {domain = 0, bus = 0, dev = 2, func = 0, 
      match_data = 139880467972592}, id = {vendor_id = 0, device_id = 0, subvendor_id = 2, 
      subdevice_id = 0, device_class = 1973076464, device_class_mask = 32568, 
      match_data = 139880467972592}}}
(gdb) list
177		while ( iter->next_index < pci_sys->num_devices ) {
178		    struct pci_device_private * const temp =
179		      & pci_sys->devices[ iter->next_index ];
180	
181		    iter->next_index++;
182		    if ( PCI_ID_COMPARE( iter->match.slot.domain, temp->base.domain )
183			 && PCI_ID_COMPARE( iter->match.slot.bus, temp->base.bus )
184			 && PCI_ID_COMPARE( iter->match.slot.dev, temp->base.dev )
185			 && PCI_ID_COMPARE( iter->match.slot.func, temp->base.func ) ) {
186			d = temp;
(gdb) p *temp
Cannot access memory at address 0x0
(gdb) p temp
$4 = (struct pci_device_private * const) 0x0

Comment 14 Olivier Fourdan 2018-10-24 07:56:59 UTC
that's weird...:

(gdb) p *pci_sys
$5 = {methods = 0x7f38759abd40 <linux_sysfs_methods>, num_devices = 25, devices = 0x0, mtrr_fd = 12, vgaarb_fd = 14, vga_count = 1, vga_target = 0x0, vga_default_dev = 0x0}

Comment 16 Olivier Fourdan 2018-10-24 08:10:34 UTC
Oh wait!

In `populate_entries()` from linux_sysfs.c, in case of errors, we do clean the content but leave the number of entries unchanged:

 188 int
 189 populate_entries( struct pci_system * p )
 190 {
 191     struct dirent ** devices = NULL;
 192     int n;
 193     int i;
 194     int err = 0;
 195 
 196 
 197     n = scandir( SYS_BUS_PCI, & devices, scan_sys_pci_filter, alphasort );
 198     if ( n > 0 ) {
 199         p->num_devices = n;
 200         p->devices = calloc( n, sizeof( struct pci_device_private ) );
 201 
    ...
 254         }
 255         else {
 256             err = ENOMEM;
 257         }
 258     }
 259 
 260     for (i = 0; i < n; i++)
 261         free(devices[i]);
 262     free(devices);
 263 
 264     if (err) {
 265         free(p->devices);
 266         p->devices = NULL;
 267     }
 268 
 269     return err;
 270 }

Looks like this is exactly what happens here, we have the p->devices = NULL but  p->num_devices = 25!

I think we should just fix that p->num_devices to 0 in case of errors and the issue wouldn't occur...

Comment 18 Olivier Fourdan 2018-10-24 08:40:09 UTC
I've ran a scratch buold of libpciaccess with that patch added:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18906350

Could you update libpciaccess with that scratch build and try to reproduce again, to see if that solves the issue?

Comment 19 Jiri Hladky 2018-10-24 12:16:29 UTC
Thank you! I have scheduled the Beaker job and I will let you know the results later today. 

Jirka

Comment 20 Jiri Hladky 2018-10-24 15:04:58 UTC
Hi Olivier,

it has indeed fixed the problem. I have added following to the Beaker XML 

      <task name="/distribution/command" role="STANDALONE">
        <params>
          <param name="CMDS_TO_RUN" value="yum -y localinstall http://perf-desktop.brq.redhat.com/Kernel/2018-Oct-24_14h06m36s_libpciaccess-0.14-1.1test.el7.repo/libpciaccess-0.14-1.1test.el7.x86_64.rpm http://perf-desktop.brq.redhat.com/Kernel/2018-Oct-24_14h06m36s_libpciaccess-0.14-1.1test.el7.repo/libpciaccess-devel-0.14-1.1test.el7.x86_64.rpm"/>
        </params>
      </task>

and it has worked as expected, there was no crash.

For the reference, here is the Beaker job:
https://beaker.cluster-qe.lab.eng.brq.redhat.com/bkr/jobs/89240

Thanks!
Jirka

Comment 21 Olivier Fourdan 2018-10-24 15:23:00 UTC
Woohoo brilliant, thanks for testing!

Comment 25 Jiri Hladky 2018-10-26 08:24:46 UTC
Hi,

I will try an initial draft.

Cause : Bug in the code. It's know to get triggered on Lenovo T580 laptop. 
Consequence : Crash in xorg-x11-server-Xorg
Workaround : None, AFAIK
Result : ? 


@Olivier - could you review it and add more details?

Thanks!
Jirka

Comment 26 Olivier Fourdan 2018-10-26 08:46:07 UTC
I suspect this bug is triggered by beaker starting multiple instances of Xorg simultaneously, I think it would be much harder to trigger in a real life scenario where only one Xserver gets started at once.

Also, I'm not sure why it would show up on the T580 in particular, could be a kernel bug or something that causes either the parsing of a sysfs entry to fail or a sysfs_read to fail.

Either way, such a failure would cause a discrepancy in the list of pci devices in libpciaccess, with the number of entries being left unchanged while the actual entry data being reset to NULL when an error is detected while populating the device entries.

As a results, when iterating over the list of entries later on in libpciaccess, the code could try to access devices data which were previously reset to NULL in the error handler while populating the device list, hence cause the NULL pointe rdereference and the crash of the Xserver which uses libpciaccess.

The fix consist of resetting the number of devices to 0 in case of error so that ther eis no discrepancy between the number of devices and the actual device list.

Comment 27 Jiri Hladky 2018-10-26 09:11:08 UTC
>Also, I'm not sure why it would show up on the T580 in particular, could be a >kernel bug or something that causes either the parsing of a sysfs entry to >fail or a sysfs_read to fail.

We run the same test on a number of other laptops (Lenovo: x240, w541, t470s, t450s, t460p; Dell Elitebook8470p) and t580 is the only one where we have experienced this issue.

Comment 29 Jiri Hladky 2018-10-29 14:14:53 UTC
Hi Marie,

the current description looks good to me. 

========================================================================
X.org X11 crashes on Lenovo T580

Due to a bug in the `libpciaccess` library, the X.org X11 server terminates unexpectedly on Lenovo T580 laptops.
========================================================================

Thanks
Jirka


Note You need to log in before you can comment on or make changes to this bug.