Description of problem: The second and third atomic-openshift-master-api couldn't be started for "controller.go:128] Unable to perform initial IP allocation check: unable to refresh the service IP block: User "system:anonymous" cannot list all services in the cluster" Version-Release number of selected component (if applicable): openshift-ansible-3.6.133 How reproducible: Both install and upgrade hit this issue with OCP v3.6. At least 5 times. Steps to Reproduce: 1. container install OCP v3.5 2. upgrade to OCP 3.6 anible-playbook usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml Actual results: [root@openshift-182 ~]# systemctl status atomic-openshift-master-api ● atomic-openshift-master-api.service - Atomic OpenShift Master API Loaded: loaded (/etc/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled) Active: activating (start-post) (Result: exit-code) since Wed 2017-07-05 01:23:06 EDT; 8s ago Docs: https://github.com/openshift/origin Process: 16535 ExecStop=/usr/bin/docker stop atomic-openshift-master-api (code=exited, status=1/FAILURE) Process: 16574 ExecStart=/usr/bin/docker run --rm --privileged --net=host --name atomic-openshift-master-api --env-file=/etc/sysconfig/atomic-openshift-master-api -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin -v /etc/pki:/etc/pki:ro openshift3/ose:${IMAGE_VERSION} start master api --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255) Process: 16567 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-api (code=exited, status=1/FAILURE) Main PID: 16574 (code=exited, status=255); : 16575 (sleep) Memory: 96.0K CGroup: /system.slice/atomic-openshift-master-api.service └─control └─16575 /usr/bin/sleep 10 Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: Trace[1623241719]: [343.386731ms] [343.379485ms] Etcd node listed Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: Trace[1623241719]: [566.788618ms] [223.401887ms] Node list decoded Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.311299 1 reflector.go:201] github.com/openshift/origin/pkg/authorization/generated/informers/inter... Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.311781 1 reflector.go:201] github.com/openshift/origin/pkg/authorization/generated/informers/inter... Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.311998 1 reflector.go:201] github.com/openshift/origin/pkg/authorization/generated/informers/inter... Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.312187 1 reflector.go:201] github.com/openshift/origin/pkg/quota/generated/informers/internalversi... Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.312730 1 reflector.go:201] github.com/openshift/origin/pkg/authorization/generated/informers/inter... Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: I0705 05:23:10.349300 1 serve.go:86] Serving securely on 0.0.0.0:8443 Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: F0705 05:23:10.387662 1 controller.go:128] Unable to perform initial IP allocation check: unable to ref...he cluster Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=255/n/a Hint: Some lines were ellipsized, use -l to show in full. Expected results: Additional info: The issue may disappear when you reinstall.
Created attachment 1294447 [details] The inventory and logs hit same issue two times with same inventory file 1. install ocp 3.5 and upgrade to ocp 3.6.133 by openshift-ansible v3.6.133. logs-20170705053310-upgrade 2. install OCP 3.6 by openshift-ansible 3.6.96. logs-20170704145436-config
I'd be very surprised if this was an installer issue but lets verify we can reproduce it and then see if we can get some better logs.
Couldn't reproduce :-( Started a fresh 3 node cluster at image tag (as specified in your inventory) v3.5.5.24 and then ran an upgrade with image tag (per your log) v3.6.133. Initial install landed docker-1.12.6-32.git88a4867.el7.x86_64, upgraded manually to newer docker-1.12.6-32.git88a4867.el7.x86_64. Rebooted. No changes. All atomic-openshift services are running and happy. Maybe it's just a fluke?
At least 5 times. Both jiajliu and I had hit this issue. I will try to reproduce it and leave the Env.
Could it for I couldn't reproduce it. I will open it when hit same issue.
Reopening bug as multiple clusters have hit this issue recently: Clusters installed with 3.6 or earlier running playbooks to upgrade control plan either with 3.6 or 3.7 playbooks: # ansible-playbook -i </path/to/inventory/file> \ /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade_control_plane.yml # ansible-playbook -i </path/to/inventory/file> \ /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_control_plane.yml End up with the API service failing due to a change in the openshift-master.kubeconfig file where the users.name changes showing the openshift_cluster_hostname but the user set in the current context is pointing to a users.name using the local masters hostname, this user is not present in the users section. Example: BEFORE: apiVersion: v1 clusters: - cluster: certificate-authority-data: REDACTED server: https://master03.ocp36.example.com:8443 name: master03-ocp36-example-com:8443 contexts: - context: cluster: master03-ocp36-example-com:8443 namespace: default user: system:openshift-master/master03-ocp36-example-com:8443 name: default/master03-ocp36-example-com:8443/system:openshift-master current-context: default/master03-ocp36-example-com:8443/system:openshift-master kind: Config preferences: {} users: - name: system:openshift-master/master03-ocp36-example-com:8443 user: client-certificate-data: REDACTED client-key-data: REDACTED AFTER: apiVersion: v1 clusters: - cluster: certificate-authority-data: REDACTED server: https://master03.ocp36.example.com:8443 name: master03-ocp36-example-com:8443 - cluster: certificate-authority-data: REDACTED server: https://cluster.ocp36.example.com:8443 name: cluster-ocp36-example-com:8443 contexts: - context: cluster: master03-ocp36-example-com:8443 namespace: default user: system:openshift-master/master03-ocp36-example-com:8443 name: default/master03-ocp36-example-com:8443/system:openshift-master - context: cluster: cluster-ocp36-example-com:8443 namespace: default user: system:openshift-master/cluster-ocp36-example-com:8443 name: default/cluster-ocp36-example-com:8443/system:openshift-master current-context: default/master03-ocp36-example-com:8443/system:openshift-master kind: Config preferences: {} users: - name: system:openshift-master/cluster-ocp36-example-com:8443 user: client-certificate-data: REDACTED client-key-data: REDACTED
Further research points to this happening on older cluster install around 3.2 time. Was related and was supposed to fix issue with loopback, I believe this left the kubeconfig in a state that the upgrade will cause this issue to be hit. https://bugzilla.redhat.com/show_bug.cgi?id=1306011 The before example above is incorrect, I will try to get an example. Root issue is with the following role. roles/openshift_master/tasks/set_loopback_context.yml The context is changed using a user that is not present. To fix this we need to add a task that sets credentials when set_loopback_cluster is changed. 3.6 -> 3.9 playbooks roles/openshift_master/tasks/set_loopback_context.yml 3.10+ playbooks roles/openshift_control_plane/tasks/set_loopback_context.yml ADD TASK: - command: > {{ openshift.common.client_binary }} config set-credentials --client-certificate={{ openshift_master_config_dir }}/openshift-master.crt --client-key={{ openshift_master_config_dir }}/openshift-master.key --embed-certs=true {{ openshift.master.loopback_user }} --config={{ openshift_master_loopback_config }} when: set_loopback_cluster | changed register: set_loopback_credentials
Upstream PR: https://github.com/openshift/openshift-ansible/pull/10271
With https://github.com/openshift/openshift-ansible/pull/10325 merged this does in fact appear to work correctly. I've crafted a kubeconfig where items were capitalized as reported in the customer's logs then run an upgrade. First, the full kubeconfig. [root@ose3-master ~]# oc config view --config /etc/origin/master/openshift-master.kubeconfig apiVersion: v1 clusters: - cluster: certificate-authority-data: REDACTED server: https://OSE3-MASTER.example.com:8443 name: OSE3-MASTER-example-com:8443 - cluster: certificate-authority-data: REDACTED server: https://ose3-master.example.com:8443 name: ose3-master-example-com:8443 contexts: - context: cluster: OSE3-MASTER-example-com:8443 namespace: default user: system:openshift-master/OSE3-MASTER-example-com:8443 name: default/OSE3-MASTER-example-com:8443/system:openshift-master - context: cluster: ose3-master-example-com:8443 namespace: default user: system:openshift-master/ose3-master-example-com:8443 name: default/ose3-master-example-com:8443/system:openshift-master current-context: default/ose3-master-example-com:8443/system:openshift-master kind: Config preferences: {} users: - name: system:openshift-master/OSE3-MASTER-example-com:8443 user: client-certificate-data: REDACTED client-key-data: REDACTED - name: system:openshift-master/ose3-master-example-com:8443 user: client-certificate-data: REDACTED client-key-data: REDACTED Now the minified version which only shows the current context. [root@ose3-master ~]# oc config view --config /etc/origin/master/openshift-master.kubeconfig --minify apiVersion: v1 clusters: - cluster: certificate-authority-data: REDACTED server: https://ose3-master.example.com:8443 name: ose3-master-example-com:8443 contexts: - context: cluster: ose3-master-example-com:8443 namespace: default user: system:openshift-master/ose3-master-example-com:8443 name: default/ose3-master-example-com:8443/system:openshift-master current-context: default/ose3-master-example-com:8443/system:openshift-master kind: Config preferences: {} users: - name: system:openshift-master/ose3-master-example-com:8443 user: client-certificate-data: REDACTED client-key-data: REDACTED Now, verify that it works. [root@ose3-master ~]# oc get nodes --config /etc/origin/master/openshift-master.kubeconfig NAME STATUS AGE VERSION ose3-master.example.com Ready 50m v1.7.6+a08f5eeb62 ose3-node1.example.com Ready 22m v1.7.6+a08f5eeb62 ose3-node2.example.com Ready 50m v1.7.6+a08f5eeb62
*** Bug 1636238 has been marked as a duplicate of this bug. ***
There appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached.