Bug 1968520 - Watch of Service resource with WebSocket does not timeout appropriately
Summary: Watch of Service resource with WebSocket does not timeout appropriately
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Maciej Szulik
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On: 1986003
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-07 13:37 UTC by Steven Hawkins
Modified: 2021-10-27 08:23 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-27 08:22:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 888 0 None None None 2021-08-19 11:37:54 UTC
Red Hat Product Errata RHBA-2021:3931 0 None None None 2021-10-27 08:23:18 UTC

Description Steven Hawkins 2021-06-07 13:37:08 UTC
This is in reference to https://github.com/kubernetes/kubernetes/issues/102464 - where a WebSocket Watch of Service resources simply stops receiving events, but is not closed.

Since the WebSocket remains open and responsive to ping/pong after the timeout, the client cannot reliably detect that things are in a bad state.

See the upstream issue for a reproducer.

Comment 1 Stefan Schimanski 2021-06-07 14:16:52 UTC
This needs a 1.21.x rebase (fix landed 5 days ago in upstream).

Comment 2 Maciej Szulik 2021-06-08 11:53:27 UTC
I'm waiting for 1.21.2 and will update it as soon as it's out.

Comment 3 Maciej Szulik 2021-06-21 11:14:05 UTC
https://github.com/kubernetes/kubernetes/pull/102541 hasn't merged as of yet, we'll probably get that in 1.21.3

Comment 4 Michal Fojtik 2021-07-21 11:23:06 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 8 Maciej Szulik 2021-08-19 11:37:54 UTC
This will be fixed in https://github.com/openshift/kubernetes/pull/888

Comment 9 Maciej Szulik 2021-10-15 10:59:39 UTC
The previous PR landed, moving to modified.

Comment 11 Ke Wang 2021-10-19 11:19:24 UTC
The fixed PR https://github.com/openshift/kubernetes/pull/888 is on 4.8, move the Target Release -> 4.8.z

Comment 12 Maciej Szulik 2021-10-19 16:56:14 UTC
Wrong PR number, we got it in https://github.com/openshift/kubernetes/pull/935 which is 4.7

Comment 13 Ke Wang 2021-10-20 04:56:18 UTC
Verification steps,

Created Websocket test client program testws-cm.go under <$GOPATH>/src/github.com/openshift/kubernetes/test

$ cat testws-cm.go
package main

import (
	"bytes"
	"io"
	"log"
	"net/http"
	"path/filepath"

	"golang.org/x/net/websocket"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/client-go/util/homedir"
)

func main() {
	var home = homedir.HomeDir()
	var kubeconfig = filepath.Join(home, ".kube", "config")
	config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
	if err != nil {
		panic(err.Error())
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err.Error())
	}

	var url = clientset.RESTClient().Get().URL()

	apiURL := "wss://" + url.Hostname() + ":" + url.Port() + "/api/v1/configmaps?watch=true&timeoutSeconds=5"
	wsc, err := websocket.NewConfig(apiURL, apiURL)
	if err != nil {
		log.Fatal(err)
	}
	wsc.TlsConfig, err = rest.TLSConfigFor(config)
	if err != nil {
		log.Fatal(err)
	}
	wsc.Header, err = headersForConfig(config)
	if err != nil {
		log.Fatal(err)
	}

	wsConn, err := websocket.DialConfig(wsc)
	if err != nil {
		log.Fatal(err)
	}
	defer wsConn.Close()

	buf := &bytes.Buffer{}
	for {
		var msg []byte
		if err := websocket.Message.Receive(wsConn, &msg); err != nil {
			if err == io.EOF {
				break
			}
			log.Fatalf("Failed to read completely from websocket %v", err)
		}
		if len(msg) == 0 {
			continue
		}
		log.Printf("Read %v %v", len(msg), string(msg))
		buf.Write(msg)
	}

	if buf.String() != "container is alive\n" {
		log.Fatalf("Unexpected websocket logs:\n%s", buf.String())
	}
}

type extractRT struct {
	http.Header
}

func (rt *extractRT) RoundTrip(req *http.Request) (*http.Response, error) {
	rt.Header = req.Header
	return nil, nil
}

func headersForConfig(c *rest.Config) (http.Header, error) {
	extract := &extractRT{}
	rt, err := rest.HTTPWrappersForConfig(c, extract)
	if err != nil {
		return nil, err
	}
	if _, err := rt.RoundTrip(&http.Request{}); err != nil {
		return nil, err
	}
	return extract.Header, nil
}

Change line ‘apiURL := "wss://" + url.Hostname() + ":" + url.Port() + "/api/v1/configmaps?watch=true&timeoutSeconds=5"’ watching object with services and save as testws-s.go.

Build them with go command line,
go build testws-cm.go
go build testws-s.go

Set the ~/.kube/config with your installed OCP cluster kubeconfig, and run Websocket client programs against the cluster,

Try on 4.7 nightly payload,
$ oc get clusterversion --kubeconfig=/home/kewang/.kube/config
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-10-18-191324   True        False         31m     Cluster version is 4.7.0-0.nightly-2021-10-18-191324

$ oc get no --kubeconfig=/home/kewang/.kube/config
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l07w5mt-f76d1-6zzz7-master-0         Ready    master   51m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-master-1         Ready    master   51m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-master-2         Ready    master   51m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-worker-a-z82nq   Ready    worker   41m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-worker-b-2h9sj   Ready    worker   41m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-worker-c-qhn69   Ready    worker   41m   v1.20.0+bbbc079

$ time ./testws-cm
...

real	0m8.704s
user	0m0.375s
sys	0m0.406s

$ time ./testws-s
...

real	0m5.716s
user	0m0.027s
sys	0m0.028s

The websocket watches services object after the timeout, it will terminate as expected, so move the bug VERIFIED.

Comment 16 errata-xmlrpc 2021-10-27 08:22:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.36 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3931


Note You need to log in before you can comment on or make changes to this bug.