Created attachment 1525205 [details]
X server patch which fixes the problem — written and tested by people who aren't X server devs.
We recently upgraded to RHEL 7.6 (from 7.5) and have begun seeing X server crashes. This appears to be caused by a bug in the X Server, as shown in the URL for the external bug tracker. This is also being tracked in the CentOS bug tracker (see other link).
The problem affects the xorg-x11-server-Xorg-1.20.1-5.2 package. However, our locally-built version does not suffer from the same problem, thanks to the patch posted to the X Server GitLab issue by Sokov V.M.. I think it is fair to claim that neither I nor Sokov V.M. understands all of the possible implications of this patch — *but it works*!
The problem occurs 100% of the time, but the procedure is very specific to our location. Other people have encountered the same problem (see the linked bug reports) using other procedures, but this is the only one that works 100% reliably for us.
Steps to Reproduce:
1. Log into RHEL 7.6 graphically — both GNOME and KDE are affected equally, so it doesn't matter which is chosen
2. Launch the X2Go remote desktop client (the latest version from EPEL)
3. Connect to a particular SuSE 12.3 server and launch IceWM
4. Run ncview: an application which uses a legacy X11 toolkit (X Athena Widgets or Xaw) rather than a more modern toolkit like GTK+ or Qt
5. Click a certain button in the ncview application
At this point, the entire X11 server crashes, dropping the user to a login screen. Interestingly, due to the session-resume capabilities of X2Go, the ncview application and IceWM desktop continue to run on the remote machine. Obviously, the expectation is that the application, remote DE, and local DE would all have remained alive and functional when the button was pressed.
Can you help me and the others affected by this to understand the problem and the apparent solution — i.e., the patch? And is this something that Red Hat can help to move upstream?
I'm having a bit of trouble using Bugzilla to add the link to the X Server issue on freedesktop.org's GitLab, so here it is:
I don't think I can take that patch as-is, I think it ends up being an ABI break. Can you
try this build instead?
The above build is against 7.6, but the content (other than one of the patches) is currently
queued for 7.7. If you would prefer to test just this change in isolation, extract
0001-composite-Fix-some-backing-store-bugs-1677719.patch from the linked SRPM and apply it
to your own build.
(In reply to Adam Jackson from comment #5)
> I don't think I can take that patch as-is, I think it ends up being an ABI
> break. Can you
> try this build instead?
> The above build is against 7.6, but the content (other than one of the
> patches) is currently
> queued for 7.7. If you would prefer to test just this change in isolation,
> 0001-composite-Fix-some-backing-store-bugs-1677719.patch from the linked
> SRPM and apply it
> to your own build.
Any progress here?
Thank you for the suggestion and sorry for the delayed response. I have gotten a chance to apply Adam Jackson's patch to our RHEL 7.6 systems. I did this twice: once by rebuilding our existing xorg-x11-server-1.20.1-5.3.el7 (RHEL 7.6) packages with the 0001-composite-Fix-some-backing-store-bugs-1677719.patch, and once by replacing those with the supplied xorg-x11-server-1.20.4-5.el7_6 packages (containing RHEL 7.7 content).
Both tests produced the same results. The bad news is that my test case still crashes the X server, whereas Sokov V.M.'s ABI-breaking patch fixes this problem. The good news is that this crash occurs with a different stacktrace, which suggests that your fixes may have worked for one part of the problem. Here is the new stacktrace (with line #s derived from xorg-x11-server-debuginfo-1.20.4-5.el7_6.x86_64.rpm):
#0 0x00007fd1233592c7 in raise () at /lib64/libc.so.6
#1 0x00007fd12335a9b8 in abort () at /lib64/libc.so.6
#2 0x000055769ea7fe5a in OsAbort () at utils.c:1351
#3 0x000055769ea859f3 in AbortServer () at log.c:879
#4 0x000055769ea8683d in FatalError (f=f@entry=0x55769eab6c90 "Caught signal %d (%s). Server aborting\n") at log.c:1017
#5 0x000055769ea7d0c9 in OsSigHandler (signo=11, sip=<optimized out>, unused=<optimized out>) at osinit.c:156
#6 0x00007fd1236ff5d0 in <signal handler called> () at /lib64/libpthread.so.0
#7 0x000055769ea6a7d3 in miComputeClips (pParent=pParent@entry=0x5576a21ca910, pScreen=pScreen@entry=0x5576a08a5650, universe=universe@entry=0x7ffe815af350, kind=kind@entry=VTUnmap, exposed=exposed@entry=0x7ffe815af370) at mivaltree.c:291
#8 0x000055769ea6b267 in miValidateTree (pParent=0x5576a1ed2880, pChild=<optimized out>, kind=VTUnmap) at mivaltree.c:687
#9 0x000055769e9502f2 in UnmapWindow (pWin=0x5576a21ca910, fromConfigure=fromConfigure@entry=0) at window.c:2881
#10 0x000055769e91e9a4 in ProcUnmapWindow (client=<optimized out>) at dispatch.c:879
#11 0x000055769e92444b in Dispatch () at dispatch.c:478
#12 0x000055769e92849a in dix_main (argc=18, argv=0x7ffe815af5d8, envp=<optimized out>) at main.c:276
#13 0x00007fd123345495 in __libc_start_main () at /lib64/libc.so.6
#14 0x000055769e91258e in _start ()
mivaltree.c:291 looks like this:
dx = pParent->drawable.x - pParent->valdata->before.oldAbsCorner.x;
Of particular interest is the fact that this code accesses an instance of the _Validate union (pParent->valdata->before). I believe that I correctly attributed the root cause of the original issue  to inconsistent handling of the _Validate union: code was accessing the ->after "side" of the union despite data having been stored in the ->before "side." Indeed, Sokov V.M.'s approach is an extreme solution to this problem. Could this be another example of this problem?
I am certainly willing to test additional builds of this. I will try to be quicker to reply in the future.
I've uploaded a test build for 7.7 here:
Please test and report any problems.
Your link shows me an empty page at people.redhat.com
How can we access the test build?
I'm also struggling with vinagre, connecting via RDP.
I think it's based on freerdp-libs.
It might be related to this bug here.
(In reply to Fredy Paquet from comment #12)
> Hello Adam,
> Your link shows me an empty page at people.redhat.com
> How can we access the test build?
Apologies, that was a typo on my part: