Bug 1382142

Summary: router startup sometimes ignores scale
Product: OKD Reporter: Aleksandar Kostadinov <akostadi>
Component: DeploymentsAssignee: Michal Fojtik <mfojtik>
Status: CLOSED CURRENTRELEASE QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.xCC: akostadi, aos-bugs, mkargaki, mmccomas, pweil, sross
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-30 12:50:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
loglevel 10 reproducer on origin none

Description Aleksandar Kostadinov 2016-10-05 21:09:40 UTC
Description of problem:

creating a router, then deleting it, then creating a router with the same name again results in ending up with 0 desired replicas 50% of the time for me. I don't know if timeframe between deletion and creation matters. But IMO replicas should not be ignored either way.

Also I don't know if the issue has anything to do with the errors for existing service account and cluster role binding. These are tracked in bug 1381378.

See below two creation commands. I executed some 8-10 times and it was always like once replicas is 0 once 2 and iterating like that.

> [root@openshift-125 ~]# oadm router tc-518936 --images=registry.access.redhat.com/openshift3/ose-haproxy-router:v3.3.0.34 --replicas=2
> info: password for stats user admin has been set to cUsztwMFRB
> --> Creating router tc-518936 ...
>     error: serviceaccounts "router" already exists
>     error: rolebinding "router-tc-518936-role" already exists
>     deploymentconfig "tc-518936" created
>     service "tc-518936" created
> --> Failed
> [root@openshift-125 ~]# oc get dc
> NAME              REVISION   DESIRED   CURRENT   TRIGGERED BY
> docker-registry   2          1         1         config
> router            4          0         0         config
> tc-518936         1          2         0         config
> [root@openshift-125 ~]# oc delete dc/tc-518936 svc/tc-518936deploymentconfig > "tc-518936" deleted
> service "tc-518936" deleted
> [root@openshift-125 ~]# oadm router tc-518936 --images=registry.access.redhat.com/openshift3/ose-haproxy-router:v3.3.0.34 --replicas=2
> info: password for stats user admin has been set to EbtHgW5E2d
> --> Creating router tc-518936 ...
>     error: serviceaccounts "router" already exists
>     error: rolebinding "router-tc-518936-role" already exists
>     deploymentconfig "tc-518936" created
>     service "tc-518936" created
> --> Failed
> [root@openshift-125 ~]# oc get dc
> NAME              REVISION   DESIRED   CURRENT   TRIGGERED BY
> docker-registry   2          1         1         config
> router            4          0         0         config
> tc-518936         1          0         0         config


Version-Release number of selected component (if applicable):
> oc v3.3.0.34
> kubernetes v1.3.0+52492b4

How reproducible:
50%

Steps to Reproduce:
1. create route
2. delete dc and svc
3. create route again
4. oc get dc

Actual results:
desired replicas 0

Expected results:
desired replicas 2

Comment 1 Solly Ross 2016-10-07 19:53:18 UTC
can you reproduce with `--loglevel=10`, just so we can double-check the requests that `oadm router` is sending?

Comment 3 Aleksandar Kostadinov 2016-10-12 18:42:42 UTC
Created attachment 1209705 [details]
loglevel 10 reproducer on origin

It was not as consistent on origin but I could reproduce (see attached) with:

> oc v1.4.0-alpha.0+8f6030a
> kubernetes v1.4.0+776c994
> features: Basic-Auth GSSAPI Kerberos SPNEGO
> 
> Server https://172.18.14.117:8443
> openshift v1.4.0-alpha.0+8f6030a
> kubernetes v1.4.0+776c994

Comment 5 Solly Ross 2016-10-25 15:27:46 UTC
The actual requests and responses being sent to the API server look fine.  Is there any way we could get the controller manager logs (at log level at least 4) from when this was running, to see if anything looks off in the DC controller?  Also, can we get dumps of the DC and any deployments as YAML or JSON?

Comment 6 Aleksandar Kostadinov 2016-10-25 15:39:59 UTC
If you tell me steps how to obtain that log, I'll give it a go. But perhaps it will be more time efficient to try reproducing in an environment you have access to.

Comment 11 Aleksandar Kostadinov 2016-11-22 15:23:08 UTC
I can no longer reproduce on 3.4

Comment 12 Michal Fojtik 2016-12-15 08:50:28 UTC
I agree we are not going to backport this back to 3.3.

QA: Can you verify this on 3.4?

Comment 13 zhou ying 2016-12-19 07:03:29 UTC
Can't reproduce this issue with latest OCP3.4:
 openshift version
openshift v3.4.0.37+3b76456-1
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

[root@ip-172-18-11-194 ~]# oc get dc
NAME               REVISION   DESIRED   CURRENT   TRIGGERED BY
docker-registry    2          3         3         config
registry-console   1          1         1         config
tester             1          2         2         config
[root@ip-172-18-11-194 ~]# oc get po 
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-2-axuso    1/1       Running   0          45m
docker-registry-2-sy6yu    1/1       Running   0          45m
docker-registry-2-v0d2q    1/1       Running   0          45m
registry-console-1-cp1i6   1/1       Running   0          44m
tester-1-hni4v             1/1       Running   0          56s
tester-1-zfpzh             1/1       Running   0          56s