Description of problem: When user navigates to VM Console tab, it sometimes doesn't display the content and remains empty. Version-Release number of selected component (if applicable): HCO v2.1.0-47 How reproducible: Not always, 75% Steps to Reproduce: 1. Create a VM (i.e. example VM from yaml) 2. Navigate to Consoles tab 3. Actual results: Content not displayed Expected results: Should be displayed Additional info:
As reported, the issue is not reproducible in 100% cases, so the UI part seems to be working. I am attaching full logs from relevant pods. Following can be found there: *** kube-apiserver-working: I0919 13:48:43.542140 1 log.go:172] http: response.WriteHeader on hijacked connection from github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.WithPanicRecovery.func1 (wrap.go:31) I0919 13:48:43.542197 1 log.go:172] http: response.Write on hijacked connection from fmt.Fprintln (print.go:266) E0919 13:48:43.542215 1 wrap.go:32] apiserver panic'd on GET /apis/subresources.kubevirt.io/v1alpha3/namespaces/default/virtualmachineinstances/vm-cirros/vnc I0919 13:48:43.542341 1 log.go:172] http: panic serving 192.168.126.1:34598: killing connection/stream because serving request timed out and response had been started goroutine 11497709 [running]: net/http.(*conn).serve.func1(0xc042400780) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/net/http/server.go:1769 +0x139 panic(0x539c7a0, 0xc000354150) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5 github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc057687ba0, 0x1, 0x1) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105 panic(0x539c7a0, 0xc000354150) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc0290921c0, 0xc02e54b5f0) *** virt-api: {"component":"virt-api","kind":"VirtualMachineInstance","level":"error","msg":"error encountered reading from client stream","name":"vm-cirros","namespace":"default","pos":"subresource.go:150","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:16.233749Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-api","kind":"VirtualMachineInstance","level":"error","msg":"Error in websocket proxy","name":"vm-cirros","namespace":"default","pos":"subresource.go:158","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:16.233919Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-api","level":"info","msg":"http: response.WriteHeader on hijacked connection\n","pos":"server.go:2979","timestamp":"2019-09-19T13:50:16.233989Z"} {"component":"virt-api","kind":"VirtualMachineInstance","level":"error","msg":"error encountered reading from virt-handler stream","name":"vm-cirros","namespace":"default","pos":"subresource.go:144","reason":"read tcp 10.129.6.135:42428-\u003e10.131.6.115:8186: use of closed network connection","timestamp":"2019-09-19T13:50:16.234180Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} *** virt-handler {"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from client (virt-api) websocket","name":"vm-cirros","namespace":"default","pos":"console.go:190","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:17.518111Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-handler","kind":"","level":"error","msg":"Error in proxing websocket and unix socket","name":"vm-cirros","namespace":"default","pos":"console.go:199","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:17.518260Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-handler","level":"info","msg":"http: response.WriteHeader on hijacked connection\n","pos":"server.go:2979","timestamp":"2019-09-19T13:50:17.518320Z"} {"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from unix socket","name":"vm-cirros","namespace":"default","pos":"console.go:184","reason":"read unix @-\u003e/var/run/kubevirt-private/eded8e48-dae3-11e9-984b-5254008ede1c/virt-vnc: use of closed network connection","timestamp":"2019-09-19T13:50:17.518399Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-handler","kind":"","level":"info","msg":"Websocket connection upgraded","name":"vm-cirros","namespace":"default","pos":"console.go:168","timestamp":"2019-09-19T13:50:30.135983Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-handler","kind":"","level":"info","msg":"Connecting to proc/64140/root/var/run/kubevirt-private/eded8e48-dae3-11e9-984b-5254008ede1c/virt-vnc","name":"vm-cirros","namespace":"default","pos":"console.go:169","timestamp":"2019-09-19T13:50:30.136080Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-handler","kind":"","level":"info","msg":"Connected to proc/64140/root/var/run/kubevirt-private/eded8e48-dae3-11e9-984b-5254008ede1c/virt-vnc","name":"vm-cirros","namespace":"default","pos":"console.go:179","timestamp":"2019-09-19T13:50:30.136189Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from client (virt-api) websocket","name":"vm-cirros","namespace":"default","pos":"console.go:190","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:32.116680Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-handler","kind":"","level":"error","msg":"Error in proxing websocket and unix socket","name":"vm-cirros","namespace":"default","pos":"console.go:199","reason":"websocket: close 1006 (abnormal closure): unexpected EOF","timestamp":"2019-09-19T13:50:32.117011Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"} {"component":"virt-handler","level":"info","msg":"http: response.WriteHeader on hijacked connection\n","pos":"server.go:2979","timestamp":"2019-09-19T13:50:32.117079Z"} {"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from unix socket","name":"vm-cirros","namespace":"default","pos":"console.go:184","reason":"read unix @-\u003e/var/run/kubevirt-private/eded8e48-dae3-11e9-984b-5254008ede1c/virt-vnc: use of closed network connection","timestamp":"2019-09-19T13:50:32.117155Z","uid":"eded8e48-dae3-11e9-984b-5254008ede1c"}
Created attachment 1616749 [details] logs.tar.bz2
Roman, please take a look
On it.
Got same issue with cirros vm. Excerpt from virt-handler* pod {"component":"virt-handler","kind":"","level":"info","msg":"Connecting to proc/71065/root/var/run/kubevirt-private/c0e75cff-db7c-11e9-80e9-52540049c4b8/virt-vnc","name":"yp-test","namespace":"default","pos":"console.go:169","timestamp":"2019-09-20T08:01:11.527941Z","uid":"c0e75cff-db7c-11e9-80e9-52540049c4b8"} {"component":"virt-handler","kind":"","level":"info","msg":"Connected to proc/71065/root/var/run/kubevirt-private/c0e75cff-db7c-11e9-80e9-52540049c4b8/virt-vnc","name":"yp-test","namespace":"default","pos":"console.go:179","timestamp":"2019-09-20T08:01:11.528047Z","uid":"c0e75cff-db7c-11e9-80e9-52540049c4b8"} {"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from unix socket","name":"yp-test","namespace":"default","pos":"console.go:184","reason":"read unix @-\u003e/var/run/kubevirt-private/c0e75cff-db7c-11e9-80e9-52540049c4b8/virt-vnc: use of closed network connection","timestamp":"2019-09-20T08:01:12.594031Z","uid":"c0e75cff-db7c-11e9-80e9-52540049c4b8"}
Can I see the UI network traffic logs? Is it then still possible to re-connect to VNC (via kubectl or the reloading the UI)?
Can you pleas also provide the versions of: openshift, kubevirt and the UI?
provided the environment over private channel
Ok, here the build info: ``` OpenShift Version 4.2.0-0.nightly-2019-09-17-232025 Kubernetes Version v1.14.6+e353665 ``` a) So here my observations from the UI only: 1) Starting a VMI, and visit the `console` screen in the UI. I see sometimes (as described) just a grey/black box and nothing else. It claims to be connected. The logs confirm that a connection was established. 2) I can reload the page multiple times, nothing changes. 3) I focus with a mouse-click on the VNC screen and hit an arbitrary key. As a result the connection is immediately terminated by the UI. 4) I now reload the page again and the VNC conection works from now on. 5) The log entries in virt-handler are not nice enough but they only indicate that the connection was closed by the UI b) Here my observations when mixing UI and `virtctl vnc`: 1) I ensure that I see the grey box again in the UI. 2) I run `virtctl vnc myvm`. 3) The UI shows that the connection was terminated (which it should, since we only allow one connection). 4) I see the VNC screen correctly in `virtctl vnc`. 5) I can run `virtctl vnc` multiple times, also in parallel. All reconnects work as expected and I always see the correct screen. 6) I reload the UI. I still see the grey box. 7) I repeat the `virtctl` and UI switch multiple times. Always `virtctl` works, the UI does not 8) I focus in the UI on the box and hit a button. The connection is terminated 9) I reload the UI and now see there the screen too. 10) Reloading and reconnecting from the UI and virtctl now shows everywhere the screen. Looks to me like there might be an issue with novnc in the UI. Marek could you have another look?
Regarding to this error: ``` I0919 13:48:43.542341 1 log.go:172] http: panic serving 192.168.126.1:34598: killing connection/stream because serving request timed out and response had been started goroutine 11497709 [running]: net/http.(*conn).serve.func1(0xc042400780) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/net/http/server.go:1769 +0x139 panic(0x539c7a0, 0xc000354150) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5 github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc057687ba0, 0x1, 0x1) /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105 panic(0x539c7a0, 0xc000354150) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc0290921c0, 0xc02e54b5f0) ``` this indicates that no traffic was sent for a timeout period, which lead to the termination of the connection. It looks like the VNC connection is not fully established through the webui, although the websocket connection is open. After one minute the connection is closed by the apiserver.
Roman, what do other servers do? Do they have a longer timeout? Could the UI change?
This is an apiserver timeout (openshift apiserver) there exist commands which rightfully may not send something for minutes (like e.g. "log"). They are extra whitelisted in the sourcecode. I am not yet sure if aggregated apiservers are in general also whitelisted (don't think so, so this could bite us e.g. on the "console" endpoint). In this case the timeout is not the issue. There is simply nothing sent back to the UI, even if we would wait for one hour. There is still the rest of a chance that there is a backend bug, but it seems like some UI issues.
I was thinking if there could be silent keepalives or if it's something to discuss with the OCP API Team.
For VNC we don't need it, since it permanently sends the latest screen. Yes for serial console we have two options: * send keep-alives as part of the sub-protocol * create an issue regarding to allow registering long-running requests on the aggregated apiserver registration
I debugged on the "bridge" process (which is the openshift/console backend, acts like a proxy to k8s API to workaround the same-origin-policy). ** The flow in the UI: - (1) VNC console is connected (I can see login prompt there) - page refresh or repeated changing of tabs does not lead to the "grey" screen (means connect/disconnect succeed there repeatedly) - (2) in-activity for 1-5 minutes (varies during attempts) usually leads to the "grey" VNC screen (which looks like guest's screensaver!) - (3) based on browser's logs, connection is closed (but not exactly - see bellow the bridge description) - (4) page refresh leads to guest's screensaver screen. Based on logs, socket is open but the handshake (RFB) is not fully finished. - (5) a key-press (like "enter") leads to closing the socket (so the UI gets into "disconnected" state) - (6) another page refresh successfully opens the connection, the login prompt is visible again ** On the "bridge": From the messages going back and forth, I can see following for the scenario described above: - (1) connection to API is established, messages are being interchanged (all kinds, nothing weird) - (2) keep-alive ping/pongs from the "bridge" to browser are being successfully sent/received every 30 secs - (3) after a while, just the ping/pongs are being transmitted. Once the UI gets "grey" (screensaver?), other messages seize to be transmitted. If a key is pressed at this point, the screensaver is gone and bi-directional transmission continues intensively (as expected). After another round of in-activity, the reading from API's websocket ends up with 1006 (abnormal closure) error (code per RFC 6455, section 11.7) after 60seconds of in-activity between API and the bridge. - (4) when the page is refreshed, both connections in the bridge are established again (to API, to browser). But except a few initial messages and repeating ping/pongs, no additional messages are being transmitted. Based on observation in the UI, the guest VM is still screensaved now. - (5) When a key is pressed in the UI, writing to API socket ends up with 1005 (no status) error (code per RFC 6455) It looks like the guest's screensaver is related to the issue. If the guest VM is screensaved, the socket is not operational - not closed, but muted (no data are coming from). Writing to such a socket leads to 1005. ** Summary: I think sending of ping packets from API (means our kubevirt's subresource) to client (means the "bridge") will solve the issue at point (3). I am not sure about solution for the point (5) - receovery from previous error. It needs further investigation why writing to a newly established WebSocket ends up with 1005 (while the guest is screensaved, previous connection ended up with error, most probably time-outed). Just ideas: Is it because of - improper re-establishing VNC connection in kubevirt's subresource.go after an error? - improper "garbage collection" for connections/web-sockets in the API? - QEMU-VNC issue?
Marek I can tell you that I don't see any errors neither in the backend, nor in virt-viewer, when I use virtctl. Also I can tell you that this does not only happen when the screensaver occurs. On the setup from Radim this very often happens on the initial connection already, that it is grey, whereas at the same time I can happily connect via virt-viewer.
When looking at the logs, it looks to me like qemu itself closes the connection. This line from virt-handler indicates the the connectio was closed on the socket: {"component":"virt-handler","kind":"","level":"error","msg":"error encountered reading from unix socket","name":"yp-test","namespace":"default","pos":"console.go:184","reason":"read unix @-\u003e/var/run/kubevirt-private/c0e75cff-db7c-11e9-80e9-52540049c4b8/virt-vnc: use of closed network connection","timestamp":"2019-09-20T08:01:12.594031Z","uid":"c0e75cff-db7c-11e9-80e9-52540049c4b8"}
Arik, could it be that we somehow now don't send all data over the sockets? Something small which virt-viewer can deal with, but novnc not?
(In reply to Roman Mohr from comment #19) > Arik, could it be that we somehow now don't send all data over the sockets? > Something small which virt-viewer can deal with, but novnc not? Unlikely
Marek can you try an older version of the UI, which we know worked for sure, and also try it on openshift 3? That would provide more context.
Fabian FYI regarding to supporting long-running requests through the apiserver, here the result after some slack discussions so far: https://github.com/kubernetes/kubernetes/issues/83200#issuecomment-535670932.
Marek, any results so far?
Having mid-april version of the web-ui containing the older novnc 1.0.0 (recently 1.1.0), I got the grey screen as well (VM: https://github.com/kubevirt/kubevirt/blob/master/examples/vm-cirros.yaml). Just once during multiple attempts, but as the original BZ description says, the steps to reproduce are non-deterministic. OCP: 4.2.0-0.nightly-2019-09-27-191818
Thus, what is the conclusion?
Per summary stated above, I would keep searching on the backend.
> Per summary stated above, I would keep searching on the backend. Makes sense. I will investigate further.
Roman, after re-reading the comments in this thread, I have a couple of thoughts, which might or might not be helpful. Prior to Comment #18, there was a running theory that the screensaver might have something to do with the dropped connection. What if the guest has literally terminated signal to the display? e.g. power save mode? During bootup, some guests change their display resolution.
*** Bug 1751244 has been marked as a duplicate of this bug. ***
Finally here the fix: https://github.com/kubevirt/kubevirt/pull/2872 Will backport it to 2.2 once it is merged, since it is already acked.
Verified on $ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.2.0 Container-native virtualization Operator 2.2.0 kubevirt-hyperconverged-operator.v2.1.0 Succeeded Windows, Fedora WS and RHEL - browse to Console tab. VNC console displays the screen. Navigated to other tabs / windows and back - console is still displayed. Rebooted the VM - VNC console is back once the VM is booted.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:0307