Description of problem:
An X application crashes after many hours of drawing
Customer has an application that does a lot of XDrawString and XDrawLine.
After several hours the application is exited by an XIOError.
ANALYSIS AND RESEARCH
The XIOError is called in libX11 in the file xcb_io.c, function _XReply.
It does not get a response from xcb_wait_for_reply.
libxcb 1.5 is fine, libxcb 1.8.1 is not.
Bisecting libxcb points to this commit:
Author: Jamey Sharp <firstname.lastname@example.org>
Date: Sat Oct 9 17:13:45 2010 -0700
xcb_in: Use 64-bit sequence numbers internally everywhere.
Widen sequence numbers on entry to those public APIs that still take
32-bit sequence numbers.
Signed-off-by: Jamey Sharp <email@example.com>
Reverting it on top of 1.8.1 helps.
Upon adding traces to libxcb customer found that the last request numbers
used for xcb_wait_for_reply are these: 4294900463 and 4294965487 (two calls
while loop of the _XReply function), half a second later: 63215 (then
XIOError is called).
The widen_request is also 63215, I would have expected 63215+2^32.
Therefore it seems that the request is not correctly widened.
The commit above also changed the compares in poll_for_reply from
XCB_SEQUENCE_COMPARE_32 to XCB_SEQUENCE_COMPARE.
Maybe the widening never worked correctly, but it was never observed, because
only the lower 32bits were compared.
The bug is also opened on freedesktop.org:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Down testcase of https://bugs.freedesktop.org/attachment.cgi?id=88996
2. Compile the testcase
3. Run it
ERROR Received a X IO error on display=8073008.
backtrace() returned 10 addresses
- bug: .......... https://bugs.freedesktop.org/show_bug.cgi?id=71338
- testcase: ..... https://bugs.freedesktop.org/attachment.cgi?id=88996
- proposed patch: https://bugs.freedesktop.org/attachment.cgi?id=89001
- discussion: http://lists.x.org/archives/xorg-devel/2013-October/038370.html.
Created attachment 841403 [details]
Patch of fixing libX11 uint_64
As xorg-devel discussion, the attachment should have fix the issue. Would you like merge the patch into latest RHEL?
Would you like to give me some respond about the issue. If the patch don't fix the isssue, do you have good advice about the issue.
Defect still present in RHEL6u6 and RHEL7u0 (libxcb 1.9-5). Also occurs on apps compiled as 32 bits on 64 bit systems. Can see the failure in less than 5 minutes of run time, under 'best case' conditions.
(In reply to dave.kinsell from comment #11)
> Defect still present in RHEL6u6 and RHEL7u0 (libxcb 1.9-5). Also occurs on
> apps compiled as 32 bits on 64 bit systems. Can see the failure in less
> than 5 minutes of run time, under 'best case' conditions.
A patch to address this issue is currently under review upstream.
Note that X IO errors can have multiple causes, including bugs in the program itself. Reaching the failure in less than 5 minutes sounds surprising, it means you reach the 32bit sequence number limit in less than 5 minutes. To give you an idea, it takes me roughly 5 hours to reach that limit in a VM using the reproducer program, which draws a line continuously.
Thank you Olivier, so nice to hear this may get a patch from upstream.
The 5 minutes to failure is done with rapid XNoOp() calls, as discussed in https://bugs.freedesktop.org/show_bug.cgi?id=71338
I used my own counter to make sure it was making 2^32 calls before failing. With a realistic program that we support, it can fail in about 28 hours.
I wanted to clarify this happens with any 32 bit executable, not just on 32 bit systems, because the number of people affected is much larger.
(In reply to dave.kinsell from comment #14)
> The 5 minutes to failure is done with rapid XNoOp() calls, as discussed in
> I used my own counter to make sure it was making 2^32 calls before failing.
> With a realistic program that we support, it can fail in about 28 hours.
OK, thanks for clarifying.
> I wanted to clarify this happens with any 32 bit executable, not just on 32
> bit systems, because the number of people affected is much larger.
Yes, correct, 32apps link to 32bit libs and therefore are equally affected, even on a 64bit system - This is why I also cloned these bugs for el7 as well.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.