Bug 1968520

Summary: Watch of Service resource with WebSocket does not timeout appropriately
Product: OpenShift Container Platform Reporter: Steven Hawkins <shawkins>
Component: kube-apiserverAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: aos-bugs, mfojtik, sttts, wking, xxia
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-27 08:22:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1986003    
Bug Blocks:    

Description Steven Hawkins 2021-06-07 13:37:08 UTC
This is in reference to https://github.com/kubernetes/kubernetes/issues/102464 - where a WebSocket Watch of Service resources simply stops receiving events, but is not closed.

Since the WebSocket remains open and responsive to ping/pong after the timeout, the client cannot reliably detect that things are in a bad state.

See the upstream issue for a reproducer.

Comment 1 Stefan Schimanski 2021-06-07 14:16:52 UTC
This needs a 1.21.x rebase (fix landed 5 days ago in upstream).

Comment 2 Maciej Szulik 2021-06-08 11:53:27 UTC
I'm waiting for 1.21.2 and will update it as soon as it's out.

Comment 3 Maciej Szulik 2021-06-21 11:14:05 UTC
https://github.com/kubernetes/kubernetes/pull/102541 hasn't merged as of yet, we'll probably get that in 1.21.3

Comment 4 Michal Fojtik 2021-07-21 11:23:06 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 8 Maciej Szulik 2021-08-19 11:37:54 UTC
This will be fixed in https://github.com/openshift/kubernetes/pull/888

Comment 9 Maciej Szulik 2021-10-15 10:59:39 UTC
The previous PR landed, moving to modified.

Comment 11 Ke Wang 2021-10-19 11:19:24 UTC
The fixed PR https://github.com/openshift/kubernetes/pull/888 is on 4.8, move the Target Release -> 4.8.z

Comment 12 Maciej Szulik 2021-10-19 16:56:14 UTC
Wrong PR number, we got it in https://github.com/openshift/kubernetes/pull/935 which is 4.7

Comment 13 Ke Wang 2021-10-20 04:56:18 UTC
Verification steps,

Created Websocket test client program testws-cm.go under <$GOPATH>/src/github.com/openshift/kubernetes/test

$ cat testws-cm.go
package main

import (
	"bytes"
	"io"
	"log"
	"net/http"
	"path/filepath"

	"golang.org/x/net/websocket"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/client-go/util/homedir"
)

func main() {
	var home = homedir.HomeDir()
	var kubeconfig = filepath.Join(home, ".kube", "config")
	config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
	if err != nil {
		panic(err.Error())
	}
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err.Error())
	}

	var url = clientset.RESTClient().Get().URL()

	apiURL := "wss://" + url.Hostname() + ":" + url.Port() + "/api/v1/configmaps?watch=true&timeoutSeconds=5"
	wsc, err := websocket.NewConfig(apiURL, apiURL)
	if err != nil {
		log.Fatal(err)
	}
	wsc.TlsConfig, err = rest.TLSConfigFor(config)
	if err != nil {
		log.Fatal(err)
	}
	wsc.Header, err = headersForConfig(config)
	if err != nil {
		log.Fatal(err)
	}

	wsConn, err := websocket.DialConfig(wsc)
	if err != nil {
		log.Fatal(err)
	}
	defer wsConn.Close()

	buf := &bytes.Buffer{}
	for {
		var msg []byte
		if err := websocket.Message.Receive(wsConn, &msg); err != nil {
			if err == io.EOF {
				break
			}
			log.Fatalf("Failed to read completely from websocket %v", err)
		}
		if len(msg) == 0 {
			continue
		}
		log.Printf("Read %v %v", len(msg), string(msg))
		buf.Write(msg)
	}

	if buf.String() != "container is alive\n" {
		log.Fatalf("Unexpected websocket logs:\n%s", buf.String())
	}
}

type extractRT struct {
	http.Header
}

func (rt *extractRT) RoundTrip(req *http.Request) (*http.Response, error) {
	rt.Header = req.Header
	return nil, nil
}

func headersForConfig(c *rest.Config) (http.Header, error) {
	extract := &extractRT{}
	rt, err := rest.HTTPWrappersForConfig(c, extract)
	if err != nil {
		return nil, err
	}
	if _, err := rt.RoundTrip(&http.Request{}); err != nil {
		return nil, err
	}
	return extract.Header, nil
}

Change line ‘apiURL := "wss://" + url.Hostname() + ":" + url.Port() + "/api/v1/configmaps?watch=true&timeoutSeconds=5"’ watching object with services and save as testws-s.go.

Build them with go command line,
go build testws-cm.go
go build testws-s.go

Set the ~/.kube/config with your installed OCP cluster kubeconfig, and run Websocket client programs against the cluster,

Try on 4.7 nightly payload,
$ oc get clusterversion --kubeconfig=/home/kewang/.kube/config
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-10-18-191324   True        False         31m     Cluster version is 4.7.0-0.nightly-2021-10-18-191324

$ oc get no --kubeconfig=/home/kewang/.kube/config
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l07w5mt-f76d1-6zzz7-master-0         Ready    master   51m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-master-1         Ready    master   51m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-master-2         Ready    master   51m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-worker-a-z82nq   Ready    worker   41m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-worker-b-2h9sj   Ready    worker   41m   v1.20.0+bbbc079
ci-ln-l07w5mt-f76d1-6zzz7-worker-c-qhn69   Ready    worker   41m   v1.20.0+bbbc079

$ time ./testws-cm
...

real	0m8.704s
user	0m0.375s
sys	0m0.406s

$ time ./testws-s
...

real	0m5.716s
user	0m0.027s
sys	0m0.028s

The websocket watches services object after the timeout, it will terminate as expected, so move the bug VERIFIED.

Comment 16 errata-xmlrpc 2021-10-27 08:22:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.36 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3931