Bug 1641044
| Summary: | Crash of xorg-x11-server-Xorg on T580 laptop | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jiri Hladky <jhladky> | ||||||
| Component: | libpciaccess | Assignee: | Dave Airlie <airlied> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | Desktop QE <desktop-qa-list> | ||||||
| Severity: | urgent | Docs Contact: | Marek Suchánek <msuchane> | ||||||
| Priority: | unspecified | ||||||||
| Version: | 7.6 | CC: | airlied, jhladky, jkoten, jvozar, kkolakow, mboisver, ofourdan, pasik, pstourac, tpelka | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| See Also: | https://gitlab.freedesktop.org/xorg/lib/libpciaccess/issues/5 | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Known Issue | |||||||
| Doc Text: |
X.org X11 crashes on Lenovo T580
Due to a bug in the `libpciaccess` library, the X.org X11 server terminates unexpectedly on Lenovo T580 laptops.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-02-15 07:43:39 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Jiri Hladky
2018-10-19 13:30:03 UTC
Hi Oliver,
I have tested scanpci from the attachment 1496677 [details] and it works just fine. I will upload log
./scanpci 2>&1 > scanpci_$(uname -r).log
I will also send you access details in an e-mail for t580 so that you can poke around.
Thanks
Jirka
Created attachment 1496681 [details]
scanpci log
scanpci works fine. I'm attaching the log file
./scanpci 2>&1 > scanpci_$(uname -r).log
Hi Oliver,
I have run the test in Beaker twice and it has always failed. It's running 3 times in a row
gnome-shell-perf-tool --perf-iters=2 --replace
Based on
$ grep Segmentation /var/log/Xorg.*
/var/log/Xorg.1.log:[ 881.168] (EE) Segmentation fault at address 0x124
/var/log/Xorg.1.log:[ 881.168] (EE) Caught signal 11 (Segmentation fault). Server aborting
/var/log/Xorg.2.log:[ 881.292] (EE) Segmentation fault at address 0x124
/var/log/Xorg.2.log:[ 881.292] (EE) Caught signal 11 (Segmentation fault). Server aborting
/var/log/Xorg.3.log:[ 881.417] (EE) Segmentation fault at address 0x124
/var/log/Xorg.3.log:[ 881.417] (EE) Caught signal 11 (Segmentation fault). Server aborting
/var/log/Xorg.4.log:[ 881.521] (EE) Segmentation fault at address 0x124
/var/log/Xorg.4.log:[ 881.521] (EE) Caught signal 11 (Segmentation fault). Server aborting
/var/log/Xorg.5.log:[ 881.625] (EE) Segmentation fault at address 0x124
/var/log/Xorg.5.log:[ 881.625] (EE) Caught signal 11 (Segmentation fault). Server aborting
it seems it has crashed 5 times out of 6 (=2*3)
I have tried to run the test manually several times now but I'm not able to reproduce the crash. I'm not sure what has changed.
$ { time dogtail-run-headless-next "turbostat -out turbostat_$(date "+%Y-%b-%d_%Hh%Mm%Ss").log gnome-shell-perf-tool --perf-iters=2 --replace"; } >gnome-shell-perf-tool_$(date "+%Y-%b-%d_%Hh%Mm%Ss").log 2>&1
Jirka
Right, so if I understand correctly, this is dogtail running multiple gnome-shell-perf-tool launching different Xservers simultaneously. So it could be that these multiple X servers fail to access/parse "/dev/vga_arbiter", maybe a race condition, that could explain why it fails sometimes, not always. There is always only one gnome-shell-perf-tool running. We run them one after another, not in parallel. I have restarted the job to get a clear status. I will update you in an hour or so about the outcome. Jirka (In reply to Olivier Fourdan from comment #9) > Right, so if I understand correctly, this is dogtail running multiple > gnome-shell-perf-tool launching different Xservers simultaneously. > > So it could be that these multiple X servers fail to access/parse > "/dev/vga_arbiter", maybe a race condition, that could explain why it fails > sometimes, not always. I don't think that can be it. Reads from /dev/vga_arbiter are atomic (they take the lock around the vga arb state), libpciaccess uses that to construct arguments for pci_device_find_by_slot(), which just walks the in-memory list of PCI devices looking for a match (which we've already built, which is how we know to try vgaarb init at all). The only way that can really crash, afaict, would be if the PCI device list was getting corrupted. The last Beaker shows again the crash (as all previous Beaker runs) I see the crash in all Xorg.* log files: Xorg.0.log:[ 898.241] (EE) Segmentation fault at address 0x124 Xorg.0.log:[ 898.241] (EE) Caught signal 11 (Segmentation fault). Server aborting I have sent you login information per e-mail. Feel free to poke with the notebook. Thanks! Jirka $ abrt-cli list id 36e908c9a27eca18428c22d51e22a0a4d474eec6 reason: Xorg server crashed time: Tue 23 Oct 2018 08:30:21 PM CEST package: xorg-x11-server-Xorg-1.20.1-5.el7 uid: 0 (root) count: 1 Directory: /var/spool/abrt/xorg-2018-10-23-20:30:21-5435-1 Run 'abrt-cli report /var/spool/abrt/xorg-2018-10-23-20:30:21-5435-1' for creating a case in Red Hat Customer Portal id d9d9ebc4a2ab25bb81829b615183d56fd459dab5 reason: Xorg killed by SIGABRT time: Tue 23 Oct 2018 05:06:40 PM CEST cmdline: /usr/bin/X :0 -background none -noreset -audit 4 -verbose -auth /run/gdm/auth-for-gdm-6x260x/database -seat seat0 vt1 package: xorg-x11-server-Xorg-1.20.1-5.el7 uid: 0 (root) count: 1 Directory: /var/spool/abrt/ccpp-2018-10-23-17:06:40-21651 Reported: http://faf.lab.eng.brq.redhat.com/faf/reports/bthash/9868bf85b19cf1f4bbf4ab1765f5a3e5573b27a1 http://faf.lab.eng.brq.redhat.com/faf/reports/9028/ Run 'abrt-cli report /var/spool/abrt/ccpp-2018-10-23-17:06:40-21651' for creating a case in Red Hat Customer Portal The Autoreporting feature is disabled. Please consider enabling it by issuing 'abrt-auto-reporting enabled' as a user with root privileges So found a core file in one of the abrt folders, isntalled the debuginfo packages locally on a VM and ran gdb, so at least we have a backtrace with symbols.
(gdb) bt
#0 0x00007f387398a207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007f387398b8f8 in __GI_abort () at abort.c:90
#2 0x00005585a9643bda in OsAbort () at utils.c:1350
#3 0x00005585a9649773 in AbortServer () at log.c:877
#4 0x00005585a964a5bd in FatalError (
f=f@entry=0x5585a967a930 "Caught signal %d (%s). Server aborting\n") at log.c:1015
#5 0x00005585a9640e49 in OsSigHandler (signo=11, sip=<optimized out>, unused=<optimized out>)
at osinit.c:156
#6 <signal handler called>
#7 0x00007f38757a5cd0 in pci_device_next (iter=iter@entry=0x7ffcf02643d0) at common_iterator.c:182
#8 0x00007f38757a5d6b in pci_device_find_by_slot (domain=<optimized out>, bus=<optimized out>,
dev=<optimized out>, func=<optimized out>) at common_iterator.c:233
#9 0x00007f38757a7a46 in pci_device_vgaarb_init () at common_vgaarb.c:149
#10 0x00005585a9540c49 in xf86VGAarbiterInit () at xf86VGAarbiter.c:72
#11 0x00005585a951ba10 in xf86BusConfig () at xf86Bus.c:158
#12 0x00005585a9528eec in InitOutput (pScreenInfo=pScreenInfo@entry=0x5585a98f0a20 <screenInfo>,
argc=argc@entry=13, argv=argv@entry=0x7ffcf02646e8) at xf86Init.c:503
#13 0x00005585a94ec1b0 in dix_main (argc=13, argv=0x7ffcf02646e8, envp=<optimized out>) at main.c:193
#14 0x00007f38739763d5 in __libc_start_main (main=0x5585a94d64a0 <main>, argc=13, argv=0x7ffcf02646e8,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffcf02646d8)
at ../csu/libc-start.c:266
#15 0x00005585a94d64ce in _start ()
(gdb) p *iter
$3 = {next_index = 1, mode = match_slot, match = {slot = {domain = 0, bus = 0, dev = 2, func = 0,
match_data = 139880467972592}, id = {vendor_id = 0, device_id = 0, subvendor_id = 2,
subdevice_id = 0, device_class = 1973076464, device_class_mask = 32568,
match_data = 139880467972592}}}
(gdb) list
177 while ( iter->next_index < pci_sys->num_devices ) {
178 struct pci_device_private * const temp =
179 & pci_sys->devices[ iter->next_index ];
180
181 iter->next_index++;
182 if ( PCI_ID_COMPARE( iter->match.slot.domain, temp->base.domain )
183 && PCI_ID_COMPARE( iter->match.slot.bus, temp->base.bus )
184 && PCI_ID_COMPARE( iter->match.slot.dev, temp->base.dev )
185 && PCI_ID_COMPARE( iter->match.slot.func, temp->base.func ) ) {
186 d = temp;
(gdb) p *temp
Cannot access memory at address 0x0
(gdb) p temp
$4 = (struct pci_device_private * const) 0x0
that's weird...:
(gdb) p *pci_sys
$5 = {methods = 0x7f38759abd40 <linux_sysfs_methods>, num_devices = 25, devices = 0x0, mtrr_fd = 12, vgaarb_fd = 14, vga_count = 1, vga_target = 0x0, vga_default_dev = 0x0}
Oh wait!
In `populate_entries()` from linux_sysfs.c, in case of errors, we do clean the content but leave the number of entries unchanged:
188 int
189 populate_entries( struct pci_system * p )
190 {
191 struct dirent ** devices = NULL;
192 int n;
193 int i;
194 int err = 0;
195
196
197 n = scandir( SYS_BUS_PCI, & devices, scan_sys_pci_filter, alphasort );
198 if ( n > 0 ) {
199 p->num_devices = n;
200 p->devices = calloc( n, sizeof( struct pci_device_private ) );
201
...
254 }
255 else {
256 err = ENOMEM;
257 }
258 }
259
260 for (i = 0; i < n; i++)
261 free(devices[i]);
262 free(devices);
263
264 if (err) {
265 free(p->devices);
266 p->devices = NULL;
267 }
268
269 return err;
270 }
Looks like this is exactly what happens here, we have the p->devices = NULL but p->num_devices = 25!
I think we should just fix that p->num_devices to 0 in case of errors and the issue wouldn't occur...
I've ran a scratch buold of libpciaccess with that patch added: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18906350 Could you update libpciaccess with that scratch build and try to reproduce again, to see if that solves the issue? Thank you! I have scheduled the Beaker job and I will let you know the results later today. Jirka Hi Olivier,
it has indeed fixed the problem. I have added following to the Beaker XML
<task name="/distribution/command" role="STANDALONE">
<params>
<param name="CMDS_TO_RUN" value="yum -y localinstall http://perf-desktop.brq.redhat.com/Kernel/2018-Oct-24_14h06m36s_libpciaccess-0.14-1.1test.el7.repo/libpciaccess-0.14-1.1test.el7.x86_64.rpm http://perf-desktop.brq.redhat.com/Kernel/2018-Oct-24_14h06m36s_libpciaccess-0.14-1.1test.el7.repo/libpciaccess-devel-0.14-1.1test.el7.x86_64.rpm"/>
</params>
</task>
and it has worked as expected, there was no crash.
For the reference, here is the Beaker job:
https://beaker.cluster-qe.lab.eng.brq.redhat.com/bkr/jobs/89240
Thanks!
Jirka
Woohoo brilliant, thanks for testing! Hi, I will try an initial draft. Cause : Bug in the code. It's know to get triggered on Lenovo T580 laptop. Consequence : Crash in xorg-x11-server-Xorg Workaround : None, AFAIK Result : ? @Olivier - could you review it and add more details? Thanks! Jirka I suspect this bug is triggered by beaker starting multiple instances of Xorg simultaneously, I think it would be much harder to trigger in a real life scenario where only one Xserver gets started at once. Also, I'm not sure why it would show up on the T580 in particular, could be a kernel bug or something that causes either the parsing of a sysfs entry to fail or a sysfs_read to fail. Either way, such a failure would cause a discrepancy in the list of pci devices in libpciaccess, with the number of entries being left unchanged while the actual entry data being reset to NULL when an error is detected while populating the device entries. As a results, when iterating over the list of entries later on in libpciaccess, the code could try to access devices data which were previously reset to NULL in the error handler while populating the device list, hence cause the NULL pointe rdereference and the crash of the Xserver which uses libpciaccess. The fix consist of resetting the number of devices to 0 in case of error so that ther eis no discrepancy between the number of devices and the actual device list. >Also, I'm not sure why it would show up on the T580 in particular, could be a >kernel bug or something that causes either the parsing of a sysfs entry to >fail or a sysfs_read to fail.
We run the same test on a number of other laptops (Lenovo: x240, w541, t470s, t450s, t460p; Dell Elitebook8470p) and t580 is the only one where we have experienced this issue.
Hi Marie, the current description looks good to me. ======================================================================== X.org X11 crashes on Lenovo T580 Due to a bug in the `libpciaccess` library, the X.org X11 server terminates unexpectedly on Lenovo T580 laptops. ======================================================================== Thanks Jirka After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |