Bug 965749 - X's smart scheduler is crashy on ppc64
Summary: X's smart scheduler is crashy on ppc64
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-server
Version: 19
Hardware: ppc64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: X/OpenGL Maintenance List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker AcceptedFreezeException
Depends On:
Blocks: F19-accepted, F19FinalFreezeException
TreeView+ depends on / blocked
 
Reported: 2013-05-21 16:18 UTC by Adam Jackson
Modified: 2013-06-06 02:22 UTC (History)
12 users (show)

Fixed In Version: xorg-x11-server-1.14.1-3.fc19
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-06-06 02:22:00 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
core dump (30.31 MB, application/octet-stream)
2013-05-23 13:44 UTC, Adam Jackson
no flags Details

Description Adam Jackson 2013-05-21 16:18:23 UTC
The X server has two algorithms for scheduling among clients.  The dumb scheduler is little more than a round-robin.  The smart scheduler arms a timer with setitimer() when entering a new client, and uses the SIGALRM it generates to estimate whether a particular client is being too greedy with X's CPU time.

For some reason (which I'm going to assume is related to glibc's implementation of signal delivery) the smart scheduler is entirely too crashy to use on ppc64.  Running 'x11perf -shmput{10,100,500}' against 'X -ac -noreset :0' will reliably crash; adding -dumbSched to the X command line makes it not crash.  The backtrace visible in gdb for this crash does not correspond to any real call sequence in X:

(gdb) bt
#0  0x000000001005661c in .GetExtensionEntry ()
#1  0x000000001003e6c4 in 00000046.plt_call.pixman_transform_init_translate ()
#2  0x0000000010027ecc in 00000046.plt_call.pixman_transform_init_translate ()
#3  0x0000008014b8432c in generic_start_main (
    main=@0x102a0568: 0x100279a0 <00000046.plt_call.pixman_transform_init_translate+12348>, argc=<optimized out>, ubp_av=0x3fffe5d06848, 
    auxvec=0x3fffe5d06920, init=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=<optimized out>, fini=<optimized out>) at ../csu/libc-start.c:258
#4  0x0000008014b84554 in __libc_start_main (argc=<optimized out>, 
    ubp_av=<optimized out>, ubp_ev=<optimized out>, auxvec=<optimized out>, 
    rtld_fini=<optimized out>, stinfo=<optimized out>, 
    stack_on_entry=<optimized out>)
    at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:91
#5  0x0000000000000000 in ?? ()

pixman_transform_init_translate() does not call any function other than memset, so this is clearly bogus.

Comment 1 Jeff Law 2013-05-21 17:36:10 UTC
I've learned to never trust the backtrace when there's PLT call stubs :-)

It'd probably help everyone considerably if you attached a core file and a list of the RPMs so that we can investigate more thoroughly.

Comment 2 Carlos O'Donell 2013-05-21 19:54:47 UTC
(In reply to Jeff Law from comment #1)
> I've learned to never trust the backtrace when there's PLT call stubs :-)
> 
> It'd probably help everyone considerably if you attached a core file and a
> list of the RPMs so that we can investigate more thoroughly.

An strace of the server would also be very useful.

The SIGALRM handler in xserver/os/utils.c is pretty simple:
~~~
static void
SmartScheduleTimer(int sig)
{
    SmartScheduleTime += SmartScheduleInterval;
}
~~~
It should compile down to an ld/std sequence which is atomic (type is 64-bit long).

The setitimer and getitimer routines in glibc are wrappers around the kernel so glibc doesn't implement anything really, just a shim to the syscall.

Until someone identifies are more concrete issue this could be either a glibc wrapper, kernel itimer, or gcc miscomplation issue.

Comment 3 Adam Jackson 2013-05-23 13:44:05 UTC
Created attachment 752235 [details]
core dump

The relevant rpms are here:

http://ppc.koji.fedoraproject.org/packages/xorg-x11-server/1.14.1/2.fc19/ppc64/

The strace of the server right before the crash isn't especially interesting:

---
read(12, 0x3fffa8f40010, 4096)          = -1 EAGAIN (Resource temporarily unavailable)
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
ioctl(10, 0xc020645d, 0x3ffff09df370)   = 0
munmap(0x3fffa8dc0000, 1536000)         = 0
ioctl(10, 0x80086409, 0x3ffff09df248)   = 0
ioctl(10, 0xc020645e, 0x3ffff09df3a0)   = 0
mmap(NULL, 1536000, PROT_READ|PROT_WRITE, MAP_SHARED, 10, 0x100920000) = 0x3fffa8dc0000
ioctl(10, 0xc0086464, 0x3ffff09df318)   = 0
ioctl(10, 0xc020645d, 0x3ffff09def60)   = 0
ioctl(10, 0xc020645e, 0x3ffff09def90)   = 0
mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_SHARED, 10, 0x1007a0000) = 0x3fffa9330000
ioctl(10, 0xc0086464, 0x3ffff09def08)   = 0
ioctl(10, VIDIOC_INT_RESET, 0x1000cf753d8) = 0
ioctl(10, 0xc008646a, 0x3ffff09df728)   = 0
ioctl(10, 0xc008646a, 0x3ffff09df728)   = 0
ioctl(10, 0xc008646a, 0x3ffff09df728)   = 0
select(256, [1 3 4 5 7 10 11 12], NULL, NULL, NULL) = 1 (in [12])
setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0
read(12, "b\0\0\4\0\7\0\1MIT-SHM\0", 4096) = 16
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x100000001} ---
+++ killed by SIGSEGV (core dumped) +++
---

As an english translation: we run out of requests to read from our one client, disarm the timer, run the BlockHandler callback chain to prepare to block in select(), wake up on the same client, arm the timer, read the request, and die.

gdb seems to cope slightly better with the core file than with a live crash:

---
Core was generated by `X -core -ac -terminate -retro :0 '.
Program terminated with signal 11, Segmentation fault.
#0  FindExtension (extname=0x3fffa8f40018 "MIT-SHM", len=7) at extension.c:173
173             if ((strlen(extensions[i]->name) == len) &&
(gdb) bt
#0  FindExtension (extname=0x3fffa8f40018 "MIT-SHM", len=7) at extension.c:173
#1  0x00000000100568d8 in ProcQueryExtension (client=0x1000d0ab830)
    at extension.c:265
#2  0x000000001003e75c in Dispatch () at dispatch.c:432
#3  0x0000000010027ecc in main (argc=<optimized out>, argv=0x3ffff09e03f8, 
    envp=<optimized out>) at main.c:298
(gdb) p i
$1 = 0
(gdb) p extensions[0]
Cannot access memory at address 0x100000001
---

Which certainly looks like memory or register corruption.  But given that this _never_ happens with -dumbSched, I'm not inclined to blame X proper for scribbling on that variable.

Comment 4 Carlos O'Donell 2013-05-23 19:51:44 UTC
(In reply to Adam Jackson from comment #3)
> Which certainly looks like memory or register corruption.  But given that
> this _never_ happens with -dumbSched, I'm not inclined to blame X proper for
> scribbling on that variable.

Adam,

I don't have any further comments.

This doesn't look like a glibc issue.

I'm setting component to xorg-x11-server since that's the component with the fault. I'll stay on the CC to comment as required.

My first and best approach at problems like this is to compile single object files with -O0 or -O1 and see if the error goes away, trying to determine if there is a miscompilation leading to the corruption.

You could put a SystemTap probe on FindExtension and print out the extensions array when it triggers, that way you can see what the state of the array was just before each call.

Cheers,
Carlos.

Comment 5 Dave Airlie 2013-05-24 05:05:21 UTC
valgrind any help?

Comment 6 Dennis Gilmore 2013-05-31 15:15:16 UTC
as another datapoint this seems to be also true on arm

#0  0x48d2cae4 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x00068aa8 in WaitForSomething (pClientsReady=0x1cb000,
    pClientsReady@entry=0x1d236c <dispatchException>) at WaitFor.c:221
#2  0x00038cb4 in Dispatch () at dispatch.c:361
#3  0x00028280 in main (argc=9, argv=0x28280 <main+1084>, envp=<optimized out>)
    at main.c:298

is a backtrace i got from the running X, ive tested running X with -dumbSched and so far no crash.

Comment 7 Fedora Update System 2013-06-04 18:11:37 UTC
xorg-x11-server-1.14.1-3.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/xorg-x11-server-1.14.1-3.fc19

Comment 8 Adam Williamson 2013-06-04 18:23:39 UTC
Secondary arches do not block releases. You can nominate this as an FE issue, though.

Comment 9 Fedora Update System 2013-06-05 02:30:54 UTC
Package xorg-x11-server-1.14.1-3.fc19:
* should fix your issue,
* was pushed to the Fedora 19 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing xorg-x11-server-1.14.1-3.fc19'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-10008/xorg-x11-server-1.14.1-3.fc19
then log in and leave karma (feedback).

Comment 10 Adam Williamson 2013-06-05 17:33:59 UTC
Discussed at 2013-06-05 blocker review meeting: http://meetbot.fedoraproject.org/fedora-blocker-review/2013-06-05/f19final-blocker-review-3.2013-06-05-16.05.log.txt . Rejected as a blocker but accepted as a freeze exception issues: by policy, issues affecting only secondary arches that would be blockers for primary arches can usually be considered FE issues, but are not blockers.

Comment 11 Fedora Update System 2013-06-06 02:22:00 UTC
xorg-x11-server-1.14.1-3.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.