Bug 1566814

Summary: master api/controller should use "ose-control-plane" image but not "ose"
Product: OpenShift Container Platform Reporter: weiwei jiang <wjiang>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED NEXTRELEASE QA Contact: Weihua Meng <wmeng>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, ccoleman, ekuric, jiazha, jmencak, jokerman, mfojtik, mifiedle, mmccomas, wjiang, wmeng
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-23 12:48:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description weiwei jiang 2018-04-13 03:10:33 UTC
Description of problem:
After installation, found all deploy pod is in error status
[root@ip-172-18-6-45 ~]# oc get pods 
NAME                        READY     STATUS    RESTARTS   AGE
docker-registry-1-deploy    0/1       Error     0          7m
registry-console-1-deploy   0/1       Error     0          7m
router-1-deploy             0/1       Error     0          8m
[root@ip-172-18-6-45 ~]# oc logs -f docker-registry-2-deploy 
--> Scaling docker-registry-2 to 1
error: couldn't scale docker-registry-2 to 1: replicationcontrollers "docker-registry-2" is forbidden: User "system:serviceaccount:default:deployer" cannot get replicationcontrollers/scale in the namespace "default": User "system:serviceaccount:default:deployer" cannot get replicationcontrollers/scale in project "default"

Currently can be workaround with `oc policy add-role-to-user admin system:serviceaccount:default:deployer -n default `


Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.10.0-0.21.0.git.0.0b1d180.el7.noarch.rpm
rpm -q ansible 
ansible-2.4.2.0-2.el7.noarch
ansible --version

How reproducible:
Always 
Steps to Reproduce:
1. Set up installation
2. Check pod status in default namespaces
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated
2. 
[root@ip-172-18-6-45 ~]# oc get pods 
NAME                        READY     STATUS    RESTARTS   AGE
docker-registry-1-deploy    0/1       Error     0          7m
registry-console-1-deploy   0/1       Error     0          7m
router-1-deploy             0/1       Error     0          8m


Expected results:
All pod should be working well in default namespace

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Jian Zhang 2018-04-13 03:16:24 UTC
Similar to bug 1566357

Comment 2 Michal Fojtik 2018-04-16 14:16:40 UTC
Is this 100% reproducible or only happens sometimes?

Comment 4 Michal Fojtik 2018-04-16 15:05:11 UTC
Can we see output of oc version?

I'm 100% sure you're running API server 1.9.1 against new deployer image which was build after 1.10 rebase landed. That is causing the problems. You need to upgrade the API server.

Comment 5 Michal Fojtik 2018-04-16 15:06:52 UTC
*** Bug 1566357 has been marked as a duplicate of this bug. ***

Comment 6 Mike Fiedler 2018-04-16 15:12:01 UTC
This is 100% reproducible in my cluster.

oc v3.10.0-0.21.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-33-11.us-west-2.compute.internal:8443
openshift v3.10.0-0.14.0
kubernetes v1.9.1+a0ce1bc657


Why does this puddle not have the rebase?  It landed last Monday in origin.

Comment 8 Scott Dodson 2018-04-16 19:28:45 UTC
*** Bug 1568031 has been marked as a duplicate of this bug. ***

Comment 9 Scott Dodson 2018-04-16 19:29:38 UTC
https://github.com/openshift/openshift-ansible/pull/7964/commits/d1861a0280b4f1dc4651fff0b55b9eb6177278a4 addresses this but need to sort out image logistics for origin

Comment 10 Scott Dodson 2018-04-16 20:02:11 UTC
*** Bug 1565442 has been marked as a duplicate of this bug. ***

Comment 11 Mike Fiedler 2018-04-16 21:29:50 UTC
Marking this as urgent as duplicate of bug 1568031 as it blocks performance testing of 3.10

Comment 12 weiwei jiang 2018-04-17 02:10:23 UTC
Oh, I think this issue is related with https://bugzilla.redhat.com/show_bug.cgi?id=1565442 , since the image in installation script is not updated, so the version will be same with https://bugzilla.redhat.com/show_bug.cgi?id=1566814#c6 and if the image updated, this issue should be gone.

Comment 13 Johnny Liu 2018-04-17 03:25:10 UTC
Actually this bug should be a dup of 1565442, but not close 1565442 as a dup.

This issue is blocking all the testing. installer is trying to use "ose" image to start master static pod, but no such image any more on registry.

# journalctl -f  -u atomic-openshift-node.service |grep E0
Apr 16 22:59:27 qe-smoke310-mrre-1 atomic-openshift-node[22075]: E0416 22:59:27.496745   22075 pod_workers.go:186] Error syncing pod a2126bf858dec6e7c76ce2bbf8325184 ("master-controllers-qe-smoke310-mrre-1_kube-system(a2126bf858dec6e7c76ce2bbf8325184)"), skipping: failed to "StartContainer" for "controllers" with ImagePullBackOff: "Back-off pulling image \"registry.reg-aws.openshift.com:443/openshift3/ose:v3.10.0-0.22.0\""
Apr 16 22:59:27 qe-smoke310-mrre-1 atomic-openshift-node[22075]: E0416 22:59:27.503245   22075 pod_workers.go:186] Error syncing pod 37410126583f9d3795a9d1dfa2c4c1fc ("master-api-qe-smoke310-mrre-1_kube-system(37410126583f9d3795a9d1dfa2c4c1fc)"), skipping: failed to "StartContainer" for "api" with ImagePullBackOff: "Back-off pulling image \"registry.reg-aws.openshift.com:443/openshift3/ose:v3.10.0-0.22.0\""
Apr 16 22:59:28 qe-smoke310-mrre-1 atomic-openshift-node[22075]: E0416 22:59:28.222772   22075 reflector.go:205] github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:460: Failed to list *v1.Node: Get https://qe-smoke310-mrre-1:8443/api/v1/nodes?fieldSelector=metadata.name%3Dqe-smoke310-mrre-1&limit=500&resourceVersion=0: dial tcp 10.240.0.21:8443: getsockopt: connection refused

# curl -H "Authorization: Bearer $(oc --config=~/.kube/reg-aws whoami -t)" https://registry.reg-aws.openshift.com/v2/openshift3/ose/tags/list  | python -m json.tool | grep v3.10
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7703    0  7703    0     0  39178      0 --:--:-- --:--:-- --:--:-- 39101
        "v3.10.0-0.14.0",
        "v3.10.0-0.14.0.0",
        "v3.10.0-0.13.0.0",
        "v3.10.0",
        "v3.10",
        "v3.10.0-0.13.0",


# curl -H "Authorization: Bearer $(oc --config=~/.kube/reg-aws whoami -t)" https://registry.reg-aws.openshift.com/v2/openshift3/ose-control-plane/tags/list  | python -m json.tool | grep v3.10
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   211  100   211    0     0   1219      0 --:--:-- --:--:-- --:--:--  1226
        "v3.10.0",
        "v3.10.0-0.16.0.0",
        "v3.10.0-0.20.0.0",
        "v3.10.0-0.21.0",
        "v3.10",
        "v3.10.0-0.15.0",
        "v3.10.0-0.15.0.0",
        "v3.10.0-0.16.0",
        "v3.10.0-0.20.0",
        "v3.10.0-0.21.0.0"


# cat /etc/origin/node/pods/apiserver.yaml |grep image
    image: registry.reg-aws.openshift.com:443/openshift3/ose:v3.10.0-0.22.0

Comment 16 Weihua Meng 2018-04-23 02:57:55 UTC
Fixed.
openshift-ansible-3.10.0-0.27.0.git.0.abed3b7.el7

  Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)
       CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:GA:server
            Kernel: Linux 3.10.0-862.el7.x86_64