Bug 1298500

Summary:	'--watch' for `oc get` does not work for old resource
Product:	OKD	Reporter:	Xingxing Xia <xxia>
Component:	Pod	Assignee:	Seth Jennings <sjenning>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.x	CC:	agoldste, aos-bugs, avagarwa, ccoleman, decarr, mmccomas
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-09-19 13:56:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xingxing Xia 2016-01-14 09:42:53 UTC

Description of problem:
e.g. `oc get pod <pod name> --watch` can watch changes of given pod and the cursor keeps blinking until we press Ctrl + C. But watch fails for old pod ('how old' is not fixed, see 'Additional info')
Same issue exists for '/' format: oc get pod/<pod name> --watch
Same issue exists for '--watch-only'.

Version-Release number of selected component (if applicable):
oc/openshift v1.1-766-g2eb4afc
kubernetes v1.1.0-origin-1107-g4c8e6f4
etcd 2.1.2

How reproducible:
Always

Steps to Reproduce:
1. oc login, create project

2. Create sth
1> Create new app
$ oc new-app -f origin/examples/sample-app/application-template-stibuild.json
2> After some time, create newer resource
$ oc create -f origin/examples/hello-openshift/hello-pod.json

3. Check the resource
$ oc get pod
NAME                        READY     STATUS      RESTARTS   AGE
database-1-1q3lc            1/1       Running     0          2h
frontend-1-pohcv            1/1       Running     0          2h
frontend-1-tdtd6            1/1       Running     0          2h
hello-openshift             1/1       Running     0          28m
ruby-sample-build-1-build   0/1       Completed   0          2h

4. Watch resource
1> Watch old pod
[xxia@local ~]$ oc get pod database-1-1q3lc --watch
NAME               READY     STATUS    RESTARTS   AGE
database-1-1q3lc   1/1       Running   0          2h
[xxia@local ~]$  # <--- `oc get` exit immediately
2> Watch newer pod
[xxia@local ~]$ oc get pod hello-openshift --watch
NAME              READY     STATUS    RESTARTS   AGE
hello-openshift   1/1       Running   0          28m
^C[xxia@local ~]$ # <--- `oc get` does not exit. Need press Ctrl + C

Actual results:
4.
As said in step 4 comment after '#'

Expected results:
4.
steps 1> and 2> should both watch the resource and not exit unless Ctrl + C pressed

Additional info:
How 'old' is not fixed because here is a test result where `oc get` exits while watching pod of AGE 15m (Note: here the result is from OSE testing, while the above steps are from Origin testing):
[tester@local ~]$ oc get pod hello-openshift --watch
NAME              READY     STATUS    RESTARTS   AGE
hello-openshift   1/1       Running   0          15m
[tester@local ~]  # <--- `oc get` exit immediately

Comment 1 Andy Goldstein 2016-02-09 15:53:16 UTC

The issue is that the "old" resource was last updated outside of the etcd watch window (which is currently 1000).

Clayton, is there any way we can have the client code start watching from "now"? Or is there anything else we can do? From an outside observer's standpoint, if you don't know anything about etcd and watch window, it seems like this should "just work" regardless of the age of the resource.

Comment 2 Clayton Coleman 2016-02-12 15:33:56 UTC

The API supports "get then watch" via `?watch=1&resourceVersion=0` which I think is what we need here.

Comment 3 Clayton Coleman 2016-02-12 15:34:35 UTC

And if the API returns an error there, I think that's a bug in the API

Comment 4 Seth Jennings 2016-02-15 17:07:48 UTC

I have a fix here

https://github.com/sjenning/kubernetes/pull/1

I am not saying it is pretty.  It parses the etcd error from the api server to get the current index and does a watch with the resourceVersion set to the current index.

I'm not seeing a way do to this in a more robust way that doesn't involve for invasive changes.  Review/comments/alternative approaches welcome!

Comment 5 Avesh Agarwal 2016-03-09 14:05:05 UTC

Hi Seth,

Not sure why you closed your PR, please let me know if you need any help and are not working on this actively.

Thanks
Avesh

Comment 6 Seth Jennings 2016-03-09 15:02:01 UTC

Avesh,

I never actually file the PR against upstream kube (the link is to a repo local PR).

The approach there does work and the changes are localized in kubectl, but it is fragile and involves parsing the error message returned by the api server.  If that error message changes format, this will break.

I guess I could file the PR upstream and see what becomes of it.

I'm not currently working on this if you have cycles and are interested.

Comment 7 Andy Goldstein 2016-03-09 15:03:29 UTC

Weren't we going to try to find a way to include the etcd index in an http response header that the client could use?

Comment 8 Seth Jennings 2016-03-09 15:06:21 UTC

Yes, but I couldn't find a way (not saying there isn't a way).  Since the api server doesn't passthrough the response from etcd, the current index doesn't make it to the client.

Comment 9 Avesh Agarwal 2016-03-09 16:39:14 UTC

Hi Seth,

I will try to reproduce it and see if I can make any progress.

Comment 10 Avesh Agarwal 2016-03-11 14:04:36 UTC

As per what Clayton suggested above seems like passing rv = 0 to the following call always in pkg/kubectl/cmd/get.go should do the trick:

w, err := r.Watch(rv)

However, I have been trying to reproduce it with etcd-2.2.1 and latest kube but can not. Event after 20 hours, watch keeps working.

#kubectl get pod hello-pod --watch
get.go rv ("36")
NAME        READY     STATUS    RESTARTS   AGE
hello-pod   1/1       Running   0          20h
hello-pod   1/1       Running   0         20h

And it does not exit.

Comment 11 Seth Jennings 2016-03-11 15:08:41 UTC

The watch window is not a fixed time but a delta between the resourceVersion of the object and the currentVersion in etcd, which increments when etcd changes.  I recreated it using Openshift, which make lots of regular changes to etcd, advancing the currentVersion in a reasonable amount of time.

Comment 12 Avesh Agarwal 2016-03-11 22:22:15 UTC

First of all passing rv="0" solves the problem. 

Also, I am observing that neither the createIndex nor delta theory seems to be working. For example, there are 4 pods:

oc get pods
NAME                   READY     STATUS    RESTARTS   AGE
bwide-pause-rc-kvyej   1/1       Running   0          1h
database-1-deploy      1/1       Running   0          1h
database-1-hook-pre    0/1       Running   0          1h
hello-openshift        0/1       Running   0          1h 

For each pod:

1. bwide-pause-rc-kvyej

From etcd:
        "createdIndex": 628,
        "modifiedIndex": 4300,

From oc edit: resourceVersion: 4300

Result: passed rv to Watch in oc get is 4300 and Watch works.

2. database-1-deploy

From etcd:
        "createdIndex": 661,
        "modifiedIndex": 4301,

From oc edit: resourceVersion: 4301

Result: passed rv to Watch in oc get is 4300 and Watch works.

3. database-1-hook-pre

From etcd:
        "createdIndex": 760,
        "modifiedIndex": 760,

From oc edit: resourceVersion: 760

Result: passed rv to Watch in oc get is 760 and Watch does NOT work.

4. hello-openshift
        "createdIndex": 727,
        "modifiedIndex": 727,

From oc edit: resourceVersion: 727

Result: passed rv to Watch in oc get is 727 and Watch does NOT work.

So even if we passed creatIndex which is anyway happening in case 3 and 4, watch would not work. And case 1 and 2 seem to be contradicting the delta theory. 

Anyway as I said, (based on Clayton's suggestion), I tested passing rv="0" and it works in all cases.

Comment 13 Avesh Agarwal 2016-03-14 13:19:56 UTC

I am wondering if it is worth investigating further or the solution with rv=0 seems fine. If the latter, please let me know and I will send a PR upstream.

Comment 14 Seth Jennings 2016-06-14 20:17:29 UTC

Opened a PR upstream to deal with this:
https://github.com/kubernetes/kubernetes/pull/27392

Comment 15 Andy Goldstein 2016-08-04 10:51:48 UTC

Seth, please cherry-pick the upstream PR to origin, as it's now merged.

Comment 16 Seth Jennings 2016-08-04 21:15:57 UTC

Origin PR:
https://github.com/openshift/origin/pull/10213

Comment 17 Seth Jennings 2016-08-05 13:16:44 UTC

origin PR has merged

Comment 18 Xingxing Xia 2016-08-08 05:56:09 UTC

VERIFIED in:
openshift v1.3.0-alpha.2+dc66809
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

Now it works for old resource:
[xxia@pc_vm3 oc]$ oc get pod database-1-m4bc4 --watch
NAME               READY     STATUS    RESTARTS   AGE
database-1-m4bc4   1/1       Running   0          3h

^C[xxia@pc_vm3 oc]$