1795931 – remote-viewer: Reproducible crash over ssh X11 forwarding

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1795931 - remote-viewer: Reproducible crash over ssh X11 forwarding

Summary: remote-viewer: Reproducible crash over ssh X11 forwarding

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	libX11
Sub Component:
Version:	8.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	Adam Jackson
QA Contact:	Desktop QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-29 09:33 UTC by Frediano Ziglio
Modified:	2020-12-20 07:51 UTC (History)
CC List:	16 users (show)
Fixed In Version:	libX11-1.6.8-3.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1758384
Environment:
Last Closed:	2020-04-28 15:41:41 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:1633	0	None	None	None	2020-04-28 15:41:59 UTC

Description Frediano Ziglio 2020-01-29 09:33:05 UTC

+++ This bug was initially created as a clone of Bug #1758384 +++

Description of problem: remote-viewer crashes a bit more than once per day.  When it happens, it's always when I'm moving my host mouse cursor across the remote-viewer window border.



Version-Release number of selected component (if applicable):

virt-viewer 8.0.0, libxcb 1.13.1, qemu 4.1.0, spice 0.14.2, and spice-gtk 0.37.  no qemu guest agent.

When this started a few months ago, I was on virt-viewer 8.0.0, libcxb 1.13 (not .1), qemu 4.0.0, spice 0.14.0, and spice-gtk 0.37.

8.0.0 was the first version I installed on the headless machine, but I tried downgrading to 7.0 and was still able to replicate the crash.



How reproducible: 100% (within 1-60 seconds of trying to aggravate it by repeatedly going in and out of the remote-viewer window.)  Otherwise about once a day.



Steps to Reproduce:

1. ssh to other/headless machine sitting 2 feet away with X11 forwarding enabled
2. run remote-viewer to connect via spice to QEMU vm (remote-viewer spice+unix:///<socket file>)
3. wait for about a day, and be unlucky when going in or out of the remote-viewer window; or go in and out as quickly as I can for 1-60 seconds



Actual results: 

remote-viewer crashes showing:

[xcb] Unknown sequence number while processing queue
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
remote-viewer: xcb_io.c:263: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.
Aborted (core dumped)

journalctl shows core was dumped, visible here: http://ix.io/1KpK



Expected results: remote-viewer to not crash



Additional info:

I am reporting to virt-viewer because I haven't been able to replicate this over ssh with X11 forwarding with other GUI programs.

This happens whether the guest vm is running in tty without mouse, tty with gpm (text mode mouse), or with an x server.

After writing a script to repeatedly move the mouse up (xdotool mousemove_relative -- -2 -10) 30 times and down (xdotool mousemove_relative 2 10) 30 times, I verified it would still reproduce the crash.  Then, I tried letting it run for 10 minutes with another window being active but not even covering the remote-viewer window, but it failed to crash.

The guest vm isn't using the qemu guest agent.

I'm thinking it's probably related to 

The actual system I'm physically using and the headless one I'm ssh'ing into are 2 feet away.  They have a direct InfiniBand connection (no switches involved) which provides ultra low latency and high bandwidth.  That's usually what I connect through (using IP over Infiniband.)  But, I've replicated this connecting via ssh over gigabit ethernet.

I run most of my VM's on the headless system with remote-viewer over ssh with X11 trusted forwarding.  But, I created a VM on the system I'm physically using, and am unable to replicate the crash.

I temporarily installed an x server on the typically headless system and ran remote-viewer directly on it, but was unable to replicate the crash.  So, it definitely seems limited to being ran over ssh with X11 trusted forwarding.  I'll mention the two systems are identical (hardware and software version wise) except the headless machine only has its onboard video rather than a Radeon Vega 64.

To be clear, running remote-viewer on the system I'm physically using to a VM on it does not crash.

It's just the specific remote-viewer that crashes, not the VM itself, and not other remote-viewers running within the same ssh session.

I tried using xtrace (the one that sets up a proxy x server to trace client/server communications, not the one included with glibc) but even over X11 forwarding, it prevents being able to replicate the crash.  Guessing it introduces a timing change that prevents it.

--- Additional comment from  on 2019-10-04 04:42:47 UTC ---

I went back and let the xdotool-based mouse movement run for an extremely long time, and actually reproduced a crash under it.  Perhaps it just changes the timings enough to make it less likely.

The entire xtrace log is 172MB and is available upon request, but the last 1,000 lines (315K) is viewable here: http://ix.io/1XxD

If you search for "xcb" in the linked partial log, you'll come to line 926/1000 which is the origianl error I posted.

----------

Also, in case it helps, qemu command line I'm using is:

/usr/bin/qemu-system-x86_64 \
   -name crash,process=qemu:crash \
   -no-user-config \
   -nodefaults \
   -nographic \
   -uuid 4a30c830-80d5-4b88-a79f-e6ad4e44a7fe \
   -pidfile /tmp/vm_crash.pid \
   -machine q35,accel=kvm,vmport=off,dump-guest-core=off \
   -cpu SandyBridge-IBRS \
   -smp cpus=4,cores=2,threads=1,sockets=2 \
   -m 4G \
   -drive if=pflash,format=raw,readonly,file=/usr/share/ovmf/x64/OVMF_CODE.fd \
   -drive if=pflash,format=raw,readonly,file=/var/qemu/efivars/vm_crash.fd \
   -monitor telnet:localhost:8000,server,nowait,nodelay \
   -spice unix,addr=/tmp/spice.crash.sock,disable-ticketing \
   -device ioh3420,id=pcie.1,bus=pcie.0,slot=0 \
   -device qxl-vga,bus=pcie.0,addr=2,ram_size_mb=64,vram_size_mb=8,vgamem_mb=16,max_outputs=1 \
   -usbdevice tablet \
   -netdev bridge,id=network0,br=br0 \
   -device virtio-net-pci,netdev=network0,mac=94:f9:7b:a9:15:a4,bus=pcie.0,addr=3 \
   -device virtio-scsi-pci,id=scsi1 \
   -drive driver=raw,node-name=hd0,file=/dev/newLvm/vm_crash,if=none,discard=unmap,cache=none,aio=threads \
   -device scsi-hd,drive=hd0,bootindex=1

--- Additional comment from Cole Robinson on 2020-01-21 21:09:32 UTC ---

Found this bug by googling 'spice-gtk xcb_xlib_threads_sequence_lost'. Thanks for the detailed report!

FWIW we have had a lot of crashes in virt-manager too that may be related, see all the dupes here:

  https://bugzilla.redhat.com/show_bug.cgi?id=1756065

And another bug that I am actively working with the reporter to try and identify, from the virt-manager perspective, where the crash always happens on VM window interaction.

  https://bugzilla.redhat.com/show_bug.cgi?id=1792576

FWIW virt-viewer doesn't have any native usage of threads AFAICT, so if the app is crashing like this then it is something lower level than virt-viewer, spice-gtk is my guess.

Can you still reproduce easily with your current setup?
Can you verify what distro and package versions you can still reproduce with?

It would help to get a gdb backtrace. If you are on Fedora, you can do:

  sudo dnf debuginfo-install virt-viewer
  gdb --eval-command=run --args virt-viewer [insert your args]

Reproduce the crash (virt-viewer will freeze), then go back to the gdb terminal and run

  thread apply all bt

And attach the whole result here

--- Additional comment from Cole Robinson on 2020-01-21 21:14:10 UTC ---

Note, I see now that you have the backtrace in journalctl, but lack of debuginfo makes it not too useful, so if you can reproduce in gdb that will help. gdb could make it hard to reproduce, so maybe just installing debuginfo packages and reproducing like normal will give a more useful backtrace in the logs

--- Additional comment from Frediano Ziglio on 2020-01-25 08:57:54 UTC ---

log that was at http://ix.io/1XxD, just to have everything in one place

--- Additional comment from Frediano Ziglio on 2020-01-26 09:42:10 UTC ---

A gdb or stack trace could be not so interesting if the other (not main) thread do its job and the main one (as almost all reports are showing) report the unknown sequence.
From the partial xtrace attached here it seems that some focus in/out was happening and also that there has been some freeing of image which may suggest that mouse movements were close to the viewer border.

--- Additional comment from Frediano Ziglio on 2020-01-27 10:11:45 UTC ---

I finally got an easy way to reproduce.
After compiling from source xtrace (from Debian sources) and installing it, simply a

$ xtrace -o trace -n strace -f remote-viewer spice://localhost:5900

and move the mouse on the windows between above the remote viewer windows and inside (guest part). After repeating for a while (like 20/30 times maximum maybe) back and forth the crash happens.
I can see just before the xcb error another thread doing something. I can see that the thread is doing something with dbus but I cannot understand how this affects xcb (it interacts with a different file descriptor). I just picked up a random VM (currently Windows 7) on a Fedora 30 client/host. I can confirm this happens with both stock and master spice-gtk. Not saying that this should be reliable for everyone, probably I'm just lucky that on my machine happens so reliably (and I'm trying to understand it before a restart).
I tried to compile out the spice integration code on spice-gtk (the only part using directly dbus) but it didn't help out).

--- Additional comment from Frediano Ziglio on 2020-01-28 23:23:55 UTC ---

It's a bug in libx11, specifically poll_for_response. Code is

   273  static xcb_generic_reply_t *poll_for_response(Display *dpy)
   274  {
   275          void *response;
   276          xcb_generic_error_t *error;
   277          PendingRequest *req;
   278          while(!(response = poll_for_event(dpy, False)) &&
   279                (req = dpy->xcb->pending_requests) &&
   280                !req->reply_waiter)
   281          {
   282                  uint64_t request;
   283
   284                  if(!xcb_poll_for_reply64(dpy->xcb->connection, req->sequence,
   285                                           &response, &error)) {
   286                          /* xcb_poll_for_reply64 may have read events even if
   287                           * there is no reply. */
   288                          response = poll_for_event(dpy, True);
   289                          break;
   290                  }
   291
   292                  request = X_DPY_GET_REQUEST(dpy);
   293                  if(XLIB_SEQUENCE_COMPARE(req->sequence, >, request))
   294                  {
   295                          throw_thread_fail_assert("Unknown sequence number "
   296                                                   "while awaiting reply",
   297                                                  xcb_xlib_threads_sequence_lost);
   298                  }
   299                  X_DPY_SET_LAST_REQUEST_READ(dpy, req->sequence);
   300                  if(response)
   301                          break;
   302                  dequeue_pending_request(dpy, req);
   303                  if(error)
   304                          return (xcb_generic_reply_t *) error;
   305          }
   306          return response;
   307  }

is it possible that when poll_for_event is called there are no events (still to arrive), but there are pending_requests (as these are added when requests are sent without waiting for reply from server), then when xcb_poll_for_reply64 is called replies and events are read so last_request_read is set to request sequence which is higher than event skipped in this function.
After when poll_for_event is called the event is fetched, sequence number of event is passed to widen function which will add 0x100000000 (on 64 bit) which will cause XLIB_SEQUENCE_COMPARE(event_sequence, >, request) to be triggered.
So this has nothing to do with threads.

Too late today to post proper bug to libX11.

--- Additional comment from Frediano Ziglio on 2020-01-29 09:11:16 UTC ---



--- Additional comment from Frediano Ziglio on 2020-01-29 09:14:05 UTC ---

See https://gitlab.freedesktop.org/xorg/lib/libx11/merge_requests/34

Comment 5 errata-xmlrpc 2020-04-28 15:41:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1633

Note You need to log in before you can comment on or make changes to this bug.