Bug 1753606 - VM VNC Console doesn't display content
Summary: VM VNC Console doesn't display content
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.2.0
Assignee: Roman Mohr
QA Contact: Radim Hrazdil
URL:
Whiteboard:
: 1751244 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-19 12:08 UTC by Radim Hrazdil
Modified: 2020-01-30 16:27 UTC (History)
14 users (show)

Fixed In Version: hyperconverged-cluster-operator-container-v2.2.0-3 virt-operator-container-v2.2.0-2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-30 16:27:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs.tar.bz2 (1.53 MB, application/x-bzip)
2019-09-19 14:13 UTC, Marek Libra
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2020:0307 0 None None None 2020-01-30 16:27:23 UTC

Description Radim Hrazdil 2019-09-19 12:08:46 UTC
Description of problem:
When user navigates to VM Console tab, it sometimes doesn't display the content and remains empty. 


Version-Release number of selected component (if applicable):
HCO v2.1.0-47

How reproducible:
Not always, 75%

Steps to Reproduce:
1. Create a VM (i.e. example VM from yaml)
2. Navigate to Consoles tab
3.

Actual results:
Content not displayed

Expected results:
Should be displayed

Additional info:

Comment 1 Marek Libra 2019-09-19 14:11:42 UTC
As reported, the issue is not reproducible in 100% cases, so the UI part seems to be working.

I am attaching full logs from relevant pods.

Following can be found there:

*** kube-apiserver-working:
I0919 13:48:43.542140       1 log.go:172] http: response.WriteHeader on hijacked connection from github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.WithPanicRecovery.func1 (wrap.go:31)
I0919 13:48:43.542197       1 log.go:172] http: response.Write on hijacked connection from fmt.Fprintln (print.go:266)
E0919 13:48:43.542215       1 wrap.go:32] apiserver panic'd on GET /apis/subresources.kubevirt.io/v1alpha3/namespaces/default/virtualmachineinstances/vm-cirros/vnc
I0919 13:48:43.542341       1 log.go:172] http: panic serving 192.168.126.1:34598: killing connection/stream because serving request timed out and response had been started
goroutine 11497709 [running]:
net/http.(*conn).serve.func1(0xc042400780)
        /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/net/http/server.go:1769 +0x139
panic(0x539c7a0, 0xc000354150)
        /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc057687ba0, 0x1, 0x1)
        /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x539c7a0, 0xc000354150)
        /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc0290921c0, 0xc02e54b5f0)



*** virt-api:
{"component":"virt-api","kind":"VirtualMachineInstance","level":"error","msg":"error encountered reading from client stream","name":"vm-cirros","namespace":"default","pos":"subresource.go:150","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:16.233749Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-api","kind":"VirtualMachineInstance","level":"error","msg":"Error in websocket proxy","name":"vm-cirros","namespace":"default","pos":"subresource.go:158","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:16.233919Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-api","level":"info","msg":"http: response.WriteHeader on hijacked connection\n","pos":"server.go:2979","timestamp":"2019-09-19T13:50:16.233989Z"}
{"component":"virt-api","kind":"VirtualMachineInstance","level":"error","msg":"error encountered reading from virt-handler stream","name":"vm-cirros","namespace":"default","pos":"subresource.go:144","reason":"read tcp 10.129.6.135:42428-\u003e10.131.6.115:8186: use of closed network connection","timestamp":"2019-09-19T13:50:16.234180Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}


*** virt-handler
{"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from client (virt-api) websocket","name":"vm-cirros","namespace":"default","pos":"console.go:190","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:17.518111Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-handler","kind":"","level":"error","msg":"Error in proxing websocket and unix socket","name":"vm-cirros","namespace":"default","pos":"console.go:199","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:17.518260Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-handler","level":"info","msg":"http: response.WriteHeader on hijacked connection\n","pos":"server.go:2979","timestamp":"2019-09-19T13:50:17.518320Z"}
{"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from unix socket","name":"vm-cirros","namespace":"default","pos":"console.go:184","reason":"read unix @-\u003e/var/run/kubevirt-private/eded8e48-dae3-11e9-984b-5254008ede1c/virt-vnc: use of closed network connection","timestamp":"2019-09-19T13:50:17.518399Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-handler","kind":"","level":"info","msg":"Websocket connection upgraded","name":"vm-cirros","namespace":"default","pos":"console.go:168","timestamp":"2019-09-19T13:50:30.135983Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-handler","kind":"","level":"info","msg":"Connecting to proc/64140/root/var/run/kubevirt-private/eded8e48-dae3-11e9-984b-5254008ede1c/virt-vnc","name":"vm-cirros","namespace":"default","pos":"console.go:169","timestamp":"2019-09-19T13:50:30.136080Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-handler","kind":"","level":"info","msg":"Connected to proc/64140/root/var/run/kubevirt-private/eded8e48-dae3-11e9-984b-5254008ede1c/virt-vnc","name":"vm-cirros","namespace":"default","pos":"console.go:179","timestamp":"2019-09-19T13:50:30.136189Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from client (virt-api) websocket","name":"vm-cirros","namespace":"default","pos":"console.go:190","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:32.116680Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-handler","kind":"","level":"error","msg":"Error in proxing websocket and unix socket","name":"vm-cirros","namespace":"default","pos":"console.go:199","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:32.117011Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
{"component":"virt-handler","level":"info","msg":"http: response.WriteHeader on hijacked connection\n","pos":"server.go:2979","timestamp":"2019-09-19T13:50:32.117079Z"}
{"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from unix socket","name":"vm-cirros","namespace":"default","pos":"console.go:184","reason":"read unix @-\u003e/var/run/kubevirt-private/eded8e48-dae3-11e9-984b-5254008ede1c/virt-vnc: use of closed network connection","timestamp":"2019-09-19T13:50:32.117155Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}

Comment 2 Marek Libra 2019-09-19 14:13:45 UTC
Created attachment 1616749 [details]
logs.tar.bz2

Comment 3 Fabian Deutsch 2019-09-20 07:03:31 UTC
Roman, please take a look

Comment 4 Roman Mohr 2019-09-20 07:29:07 UTC
On it.

Comment 5 Yurii Prokulevych 2019-09-20 11:10:01 UTC
Got same issue with cirros vm. Excerpt from virt-handler* pod

{"component":"virt-handler","kind":"","level":"info","msg":"Connecting to proc/71065/root/var/run/kubevirt-private/c0e75cff-db7c-11e9-80e9-52540049c4b8/virt-vnc","name":"yp-test","namespace":"default","pos":"console.go:169","timestamp":"2019-09-20T08:01:11.527941Z","uid":"c0e75cff-db7c-11e9-80e9-52540049c4b8"}
{"component":"virt-handler","kind":"","level":"info","msg":"Connected to proc/71065/root/var/run/kubevirt-private/c0e75cff-db7c-11e9-80e9-52540049c4b8/virt-vnc","name":"yp-test","namespace":"default","pos":"console.go:179","timestamp":"2019-09-20T08:01:11.528047Z","uid":"c0e75cff-db7c-11e9-80e9-52540049c4b8"}
{"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from unix socket","name":"yp-test","namespace":"default","pos":"console.go:184","reason":"read unix @-\u003e/var/run/kubevirt-private/c0e75cff-db7c-11e9-80e9-52540049c4b8/virt-vnc: use of closed network connection","timestamp":"2019-09-20T08:01:12.594031Z","uid":"c0e75cff-db7c-11e9-80e9-52540049c4b8"}

Comment 7 Roman Mohr 2019-09-23 11:23:28 UTC
Can I see the UI network traffic logs? Is it then still possible to re-connect to VNC (via kubectl or the reloading the UI)?

Comment 8 Roman Mohr 2019-09-23 11:48:12 UTC
Can you pleas also provide the versions of: openshift, kubevirt and the UI?

Comment 9 Radim Hrazdil 2019-09-23 12:30:21 UTC
provided the environment over private channel

Comment 10 Roman Mohr 2019-09-23 12:39:41 UTC
Ok, here the build info:

```
OpenShift Version
4.2.0-0.nightly-2019-09-17-232025
Kubernetes Version
v1.14.6+e353665

```

a) So here my observations from the UI only:

 1) Starting a VMI, and visit the `console` screen in the UI. I see sometimes (as described) just a grey/black box and nothing else. It claims to be connected. The logs confirm that a connection was established.
 2) I can reload the page multiple times, nothing changes.
 3) I focus with a mouse-click on the VNC screen and hit an arbitrary key. As a result the connection is immediately terminated by the UI.
 4) I now reload the page again and the VNC conection works from now on.
 5) The log entries in virt-handler are not nice enough but they only indicate that the connection was closed by the UI

b) Here my observations when mixing UI and `virtctl vnc`:

 1) I ensure that I see the grey box again in the UI.
 2) I run `virtctl vnc myvm`.
 3) The UI shows that the connection was terminated (which it should, since we only allow one connection).
 4) I see the VNC screen correctly in `virtctl vnc`.
 5) I can run `virtctl vnc` multiple times, also in parallel. All reconnects work as expected and I always see the correct screen.
 6) I reload the UI. I still see the grey box.
 7) I repeat the `virtctl` and UI switch multiple times. Always `virtctl` works, the UI does not
 8) I focus in the UI on the box and hit a button. The connection is terminated
 9) I reload the UI and now see there the screen too.
 10) Reloading and reconnecting from the UI and virtctl now shows everywhere the screen.

Looks to me like there might be an issue with novnc in the UI.

Marek could you have another look?

Comment 11 Roman Mohr 2019-09-23 12:43:33 UTC
Regarding to this error:


```
I0919 13:48:43.542341       1 log.go:172] http: panic serving 192.168.126.1:34598: killing connection/stream because serving request timed out and response had been started
goroutine 11497709 [running]:
net/http.(*conn).serve.func1(0xc042400780)
        /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/net/http/server.go:1769 +0x139
panic(0x539c7a0, 0xc000354150)
        /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc057687ba0, 0x1, 0x1)
        /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x539c7a0, 0xc000354150)
        /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc0290921c0, 0xc02e54b5f0)
```

this indicates that no traffic was sent for a timeout period, which lead to the termination of the connection. It looks like the VNC connection is not fully established through the webui, although the websocket connection is open. After one minute the connection is closed by the apiserver.

Comment 12 Fabian Deutsch 2019-09-23 13:53:37 UTC
Roman, what do other servers do? Do they have a longer timeout?
Could the UI change?

Comment 13 Roman Mohr 2019-09-23 14:27:07 UTC
This is an apiserver timeout (openshift apiserver) there exist commands which rightfully may not send something for minutes (like e.g. "log"). They are extra whitelisted in the sourcecode. I am not yet sure if aggregated apiservers are in general also whitelisted (don't think so, so this could bite us e.g. on the "console" endpoint). In this case the timeout is not the issue. There is  simply nothing sent back to the UI, even if we would wait for one hour. There is still the rest of a chance that there is a backend bug, but it seems like some UI issues.

Comment 14 Fabian Deutsch 2019-09-23 14:33:43 UTC
I was thinking if there could be silent keepalives or if it's something to discuss with the OCP API Team.

Comment 15 Roman Mohr 2019-09-23 14:43:05 UTC
For VNC we don't need it, since it permanently sends the latest screen. Yes for serial console we have two options:
 * send keep-alives as part of the sub-protocol
 * create an issue regarding to allow registering long-running requests on the aggregated apiserver registration

Comment 16 Marek Libra 2019-09-24 10:42:46 UTC
I debugged on the "bridge" process (which is the openshift/console backend, acts like a proxy to k8s API to workaround the same-origin-policy).

** The flow in the UI:
- (1) VNC console is connected (I can see login prompt there)
- page refresh or repeated changing of tabs does not lead to the "grey" screen (means connect/disconnect succeed there repeatedly)
- (2) in-activity for 1-5 minutes (varies during attempts) usually leads to the "grey" VNC screen (which looks like guest's screensaver!)
- (3) based on browser's logs, connection is closed (but not exactly - see bellow the bridge description)
- (4) page refresh leads to guest's screensaver screen. Based on logs, socket is open but the handshake (RFB) is not fully finished.
- (5) a key-press (like "enter") leads to closing the socket (so the UI gets into "disconnected" state)
- (6) another page refresh successfully opens the connection, the login prompt is visible again


** On the "bridge":
From the messages going back and forth, I can see following for the scenario described above:
- (1) connection to API is established, messages are being interchanged (all kinds, nothing weird)
- (2) keep-alive ping/pongs from the "bridge" to browser are being successfully sent/received every 30 secs 
- (3) after a while, just the ping/pongs are being transmitted. Once the UI gets "grey" (screensaver?), other messages seize to be transmitted.
      If a key is pressed at this point, the screensaver is gone and bi-directional transmission continues intensively (as expected).
      After another round of in-activity, the reading from API's websocket ends up with 1006 (abnormal closure) error (code per RFC 6455, section 11.7) after 60seconds of in-activity between API and the bridge.
- (4) when the page is refreshed, both connections in the bridge are established again (to API, to browser). But except a few initial messages and repeating ping/pongs, no additional messages are being transmitted.
      Based on observation in the UI, the guest VM is still screensaved now.
- (5) When a key is pressed in the UI, writing to API socket ends up with 1005 (no status) error (code per RFC 6455)

It looks like the guest's screensaver is related to the issue.
If the guest VM is screensaved, the socket is not operational - not closed, but muted (no data are coming from). Writing to such a socket leads to 1005.


** Summary:
I think sending of ping packets from API (means our kubevirt's subresource) to client (means the "bridge") will solve the issue at point (3).

I am not sure about solution for the point (5) - receovery from previous error. 
It needs further investigation why writing to a newly established WebSocket ends up with 1005 (while the guest is screensaved, previous connection ended up with error, most probably time-outed).

Just ideas: Is it because of
  - improper re-establishing VNC connection in kubevirt's subresource.go after an error?
  - improper "garbage collection" for connections/web-sockets in the API?
  - QEMU-VNC issue?

Comment 17 Roman Mohr 2019-09-24 11:04:24 UTC
Marek I can tell you that I don't see any errors neither in the backend, nor in virt-viewer, when I use virtctl. Also I can tell you that this does not only happen when the screensaver occurs. On the setup from Radim this very often happens on the initial connection already, that it is grey, whereas at the same time I can happily connect via virt-viewer.

Comment 18 Roman Mohr 2019-09-24 11:08:53 UTC
When looking at the logs, it looks to me like qemu itself closes the connection. This line from virt-handler indicates the the connectio was closed on the socket: 

{"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from unix socket","name":"yp-test","namespace":"default","pos":"console.go:184","reason":"read unix @-\u003e/var/run/kubevirt-private/c0e75cff-db7c-11e9-80e9-52540049c4b8/virt-vnc: use of closed network connection","timestamp":"2019-09-20T08:01:12.594031Z","uid":"c0e75cff-db7c-11e9-80e9-52540049c4b8"}

Comment 19 Roman Mohr 2019-09-24 12:13:15 UTC
Arik, could it be that we somehow now don't send all data over the sockets? Something small which virt-viewer can deal with, but novnc not?

Comment 20 Arik 2019-09-25 08:25:36 UTC
(In reply to Roman Mohr from comment #19)
> Arik, could it be that we somehow now don't send all data over the sockets?
> Something small which virt-viewer can deal with, but novnc not?

Unlikely

Comment 21 Roman Mohr 2019-09-25 08:32:53 UTC
Marek can you try an older version of the UI, which we know worked for sure, and also try it on openshift 3? That would provide more context.

Comment 25 Roman Mohr 2019-09-27 09:55:20 UTC
Fabian FYI regarding to supporting long-running requests through the apiserver, here the result after some slack discussions so far: https://github.com/kubernetes/kubernetes/issues/83200#issuecomment-535670932.

Comment 26 Roman Mohr 2019-09-27 09:55:35 UTC
Marek, any results so far?

Comment 29 Marek Libra 2019-09-30 13:48:05 UTC
Having mid-april version of the web-ui containing the older novnc 1.0.0 (recently 1.1.0), I got the grey screen as well (VM: https://github.com/kubevirt/kubevirt/blob/master/examples/vm-cirros.yaml).

Just once during multiple attempts, but as the original BZ description says, the steps to reproduce are non-deterministic.

OCP: 4.2.0-0.nightly-2019-09-27-191818

Comment 30 Fabian Deutsch 2019-09-30 14:04:41 UTC
Thus, what is the conclusion?

Comment 31 Marek Libra 2019-09-30 14:15:09 UTC
Per summary stated above, I would keep searching on the backend.

Comment 32 Roman Mohr 2019-09-30 14:22:29 UTC
> Per summary stated above, I would keep searching on the backend.

Makes sense. I will investigate further.

Comment 33 sgott 2019-09-30 20:23:25 UTC
Roman, after re-reading the comments in this thread, I have a couple of thoughts, which might or might not be helpful.

Prior to Comment #18, there was a running theory that the screensaver might have something to do with the dropped connection. What if the guest has literally terminated signal to the display? e.g. power save mode?

During bootup, some guests change their display resolution.

Comment 35 Tomas Jelinek 2019-10-02 10:49:46 UTC
*** Bug 1751244 has been marked as a duplicate of this bug. ***

Comment 36 Roman Mohr 2019-11-06 12:53:32 UTC
Finally here the fix: https://github.com/kubevirt/kubevirt/pull/2872

Will backport it to 2.2 once it is merged, since it is already acked.

Comment 38 Ruth Netser 2019-12-19 09:00:15 UTC
Verified on 
$ oc get csv -n openshift-cnv
NAME                                      DISPLAY                                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.2.0   Container-native virtualization Operator   2.2.0     kubevirt-hyperconverged-operator.v2.1.0   Succeeded



Windows, Fedora WS and RHEL - browse to Console tab. VNC console displays the screen.
Navigated to other tabs / windows and back - console is still displayed.
Rebooted the VM - VNC console is back once the VM is booted.

Comment 40 errata-xmlrpc 2020-01-30 16:27:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0307


Note You need to log in before you can comment on or make changes to this bug.