Bug 1467775

Summary: Unable to perform initial IP allocation check for container OCP
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: Cluster Version OperatorAssignee: Scott Dodson <sdodson>
Status: CLOSED DEFERRED QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: high    
Version: 3.6.0CC: aos-bugs, cynthia.devaraj, jfoots, jokerman, mmccomas, rhowe, sdodson, wmeng
Target Milestone: ---Keywords: Reopened
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The upgrade playbooks improperly modified /etc/origin/master/openshift-master.kubeconfig with the intent of correcting an error in environments provisioned in 3.5 and earlier. Consequence: Under certain circumstances this created API server errors. Fix: The process for updating the kubeconfig file has been updated to handle missing contexts. Result: The kubeconfig should be updated properly in all situations.
Story Points: ---
Clone Of:
: 1636238 1636558 1636559 (view as bug list) Environment:
Last Closed: 2019-02-28 14:57:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1636238, 1636558, 1636559    
Attachments:
Description Flags
The inventory and logs none

Description Anping Li 2017-07-05 05:45:03 UTC
Description of problem:
The second and third atomic-openshift-master-api couldn't be started for "controller.go:128] Unable to perform initial IP allocation check: unable to refresh the service IP block: User "system:anonymous" cannot list all services in the cluster"

Version-Release number of selected component (if applicable):
openshift-ansible-3.6.133

How reproducible:
Both install and upgrade hit this issue with OCP v3.6. At least 5 times. 

Steps to Reproduce:
1. container install OCP v3.5
2. upgrade to OCP 3.6
   anible-playbook usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml

Actual results:
[root@openshift-182 ~]# systemctl status atomic-openshift-master-api
● atomic-openshift-master-api.service - Atomic OpenShift Master API
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled)
   Active: activating (start-post) (Result: exit-code) since Wed 2017-07-05 01:23:06 EDT; 8s ago
     Docs: https://github.com/openshift/origin
  Process: 16535 ExecStop=/usr/bin/docker stop atomic-openshift-master-api (code=exited, status=1/FAILURE)
  Process: 16574 ExecStart=/usr/bin/docker run --rm --privileged --net=host --name atomic-openshift-master-api --env-file=/etc/sysconfig/atomic-openshift-master-api -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin -v /etc/pki:/etc/pki:ro openshift3/ose:${IMAGE_VERSION} start master api --config=${CONFIG_FILE} $OPTIONS (code=exited, status=255)
  Process: 16567 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-api (code=exited, status=1/FAILURE)
 Main PID: 16574 (code=exited, status=255);         : 16575 (sleep)
   Memory: 96.0K
   CGroup: /system.slice/atomic-openshift-master-api.service
           └─control
             └─16575 /usr/bin/sleep 10

Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: Trace[1623241719]: [343.386731ms] [343.379485ms] Etcd node listed
Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: Trace[1623241719]: [566.788618ms] [223.401887ms] Node list decoded
Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.311299       1 reflector.go:201] github.com/openshift/origin/pkg/authorization/generated/informers/inter...
Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.311781       1 reflector.go:201] github.com/openshift/origin/pkg/authorization/generated/informers/inter...
Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.311998       1 reflector.go:201] github.com/openshift/origin/pkg/authorization/generated/informers/inter...
Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.312187       1 reflector.go:201] github.com/openshift/origin/pkg/quota/generated/informers/internalversi...
Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: E0705 05:23:10.312730       1 reflector.go:201] github.com/openshift/origin/pkg/authorization/generated/informers/inter...
Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: I0705 05:23:10.349300       1 serve.go:86] Serving securely on 0.0.0.0:8443
Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com atomic-openshift-master-api[16574]: F0705 05:23:10.387662       1 controller.go:128] Unable to perform initial IP allocation check: unable to ref...he cluster
Jul 05 01:23:10 openshift-182.lab.eng.nay.redhat.com systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=255/n/a
Hint: Some lines were ellipsized, use -l to show in full.

Expected results:


Additional info:
The issue may disappear when you reinstall.

Comment 1 Anping Li 2017-07-05 05:52:13 UTC
Created attachment 1294447 [details]
The inventory and logs

hit same issue two times with same inventory file

1. install ocp 3.5 and upgrade to ocp 3.6.133 by openshift-ansible v3.6.133.
   logs-20170705053310-upgrade 
2. install OCP 3.6 by openshift-ansible 3.6.96. logs-20170704145436-config

Comment 2 Scott Dodson 2017-07-05 15:11:17 UTC
I'd be very surprised if this was an installer issue but lets verify we can reproduce it and then see if we can get some better logs.

Comment 3 Tim Bielawa 2017-07-05 20:31:57 UTC
Couldn't reproduce :-(


Started a fresh 3 node cluster at image tag (as specified in your inventory) v3.5.5.24 and then ran an upgrade with image tag (per your log) v3.6.133.


Initial install landed docker-1.12.6-32.git88a4867.el7.x86_64, upgraded manually to newer docker-1.12.6-32.git88a4867.el7.x86_64. Rebooted. No changes.


All atomic-openshift services are running and happy. Maybe it's just a fluke?

Comment 4 Anping Li 2017-07-06 10:19:52 UTC
At least 5 times.  Both jiajliu and I had hit this issue. I will try to reproduce it and leave the Env.

Comment 5 Anping Li 2017-07-06 14:52:53 UTC
Could it for I couldn't reproduce it. I will open it when hit same issue.

Comment 8 Ryan Howe 2018-09-28 19:12:07 UTC
Reopening bug as multiple clusters have hit this issue recently: 

Clusters installed with 3.6 or earlier running playbooks to upgrade control plan either with 3.6 or 3.7 playbooks: 

# ansible-playbook -i </path/to/inventory/file> \
/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade_control_plane.yml

# ansible-playbook -i </path/to/inventory/file> \
    /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_7/upgrade_control_plane.yml


End up with the API service failing due to a change in the openshift-master.kubeconfig file where the users.name changes showing the openshift_cluster_hostname but the user set in the current context is pointing to a users.name using the local masters hostname, this user is not present in the users section. 

Example: 

BEFORE: 

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: REDACTED
    server: https://master03.ocp36.example.com:8443
  name: master03-ocp36-example-com:8443
contexts:
- context:
    cluster: master03-ocp36-example-com:8443
    namespace: default
    user: system:openshift-master/master03-ocp36-example-com:8443
  name: default/master03-ocp36-example-com:8443/system:openshift-master
current-context: default/master03-ocp36-example-com:8443/system:openshift-master
kind: Config
preferences: {}
users:
- name: system:openshift-master/master03-ocp36-example-com:8443
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED


AFTER: 

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: REDACTED
    server: https://master03.ocp36.example.com:8443
  name: master03-ocp36-example-com:8443
- cluster:
    certificate-authority-data: REDACTED
    server: https://cluster.ocp36.example.com:8443
  name: cluster-ocp36-example-com:8443
contexts:
- context:
    cluster: master03-ocp36-example-com:8443
    namespace: default
    user: system:openshift-master/master03-ocp36-example-com:8443
  name: default/master03-ocp36-example-com:8443/system:openshift-master
- context:
    cluster: cluster-ocp36-example-com:8443
    namespace: default
    user: system:openshift-master/cluster-ocp36-example-com:8443
  name: default/cluster-ocp36-example-com:8443/system:openshift-master
current-context: default/master03-ocp36-example-com:8443/system:openshift-master
kind: Config
preferences: {}
users:
- name: system:openshift-master/cluster-ocp36-example-com:8443
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED

Comment 9 Ryan Howe 2018-09-28 20:10:07 UTC
Further research points to this happening on older cluster install around 3.2 time. 

Was related and was supposed to fix issue with loopback, I believe this left the kubeconfig in a state that the upgrade will cause this issue to be hit. 

   https://bugzilla.redhat.com/show_bug.cgi?id=1306011

The before example above is incorrect, I will try to get an example. 

Root issue is with the following role. 

 roles/openshift_master/tasks/set_loopback_context.yml

The context is changed using a user that is not present. To fix this we need to add a task that sets credentials when set_loopback_cluster is changed. 


3.6 -> 3.9 playbooks
 roles/openshift_master/tasks/set_loopback_context.yml

3.10+ playbooks 
  roles/openshift_control_plane/tasks/set_loopback_context.yml


ADD TASK:

- command: >
    {{ openshift.common.client_binary }} config set-credentials
    --client-certificate={{ openshift_master_config_dir }}/openshift-master.crt
    --client-key={{ openshift_master_config_dir }}/openshift-master.key
    --embed-certs=true 
    {{ openshift.master.loopback_user }}
    --config={{ openshift_master_loopback_config }}
  when: set_loopback_cluster | changed
  register: set_loopback_credentials

Comment 10 Ryan Howe 2018-09-28 20:57:14 UTC
Upstream PR: 

https://github.com/openshift/openshift-ansible/pull/10271

Comment 14 Scott Dodson 2018-10-15 20:29:25 UTC
With https://github.com/openshift/openshift-ansible/pull/10325 merged this does in fact appear to work correctly. I've crafted a kubeconfig where items were capitalized as reported in the customer's logs then run an upgrade. 

First, the full kubeconfig.

[root@ose3-master ~]# oc config view --config /etc/origin/master/openshift-master.kubeconfig
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: REDACTED
    server: https://OSE3-MASTER.example.com:8443
  name: OSE3-MASTER-example-com:8443
- cluster:
    certificate-authority-data: REDACTED
    server: https://ose3-master.example.com:8443
  name: ose3-master-example-com:8443
contexts:
- context:
    cluster: OSE3-MASTER-example-com:8443
    namespace: default
    user: system:openshift-master/OSE3-MASTER-example-com:8443
  name: default/OSE3-MASTER-example-com:8443/system:openshift-master
- context:
    cluster: ose3-master-example-com:8443
    namespace: default
    user: system:openshift-master/ose3-master-example-com:8443
  name: default/ose3-master-example-com:8443/system:openshift-master
current-context: default/ose3-master-example-com:8443/system:openshift-master
kind: Config
preferences: {}
users:
- name: system:openshift-master/OSE3-MASTER-example-com:8443
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED
- name: system:openshift-master/ose3-master-example-com:8443
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED

Now the minified version which only shows the current context.

[root@ose3-master ~]# oc config view --config /etc/origin/master/openshift-master.kubeconfig --minify                                                                                                                                                                           
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: REDACTED
    server: https://ose3-master.example.com:8443
  name: ose3-master-example-com:8443
contexts:
- context:
    cluster: ose3-master-example-com:8443
    namespace: default
    user: system:openshift-master/ose3-master-example-com:8443
  name: default/ose3-master-example-com:8443/system:openshift-master
current-context: default/ose3-master-example-com:8443/system:openshift-master
kind: Config
preferences: {}
users:
- name: system:openshift-master/ose3-master-example-com:8443
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED

Now, verify that it works.

[root@ose3-master ~]# oc get nodes --config /etc/origin/master/openshift-master.kubeconfig 
NAME                      STATUS    AGE       VERSION
ose3-master.example.com   Ready     50m       v1.7.6+a08f5eeb62
ose3-node1.example.com    Ready     22m       v1.7.6+a08f5eeb62
ose3-node2.example.com    Ready     50m       v1.7.6+a08f5eeb62

Comment 15 Scott Dodson 2018-10-24 12:53:50 UTC
*** Bug 1636238 has been marked as a duplicate of this bug. ***

Comment 16 Scott Dodson 2019-02-28 14:57:02 UTC
There appear to be no active cases related to this bug. As such we're closing this bug in order to focus on bugs that are still tied to active customer cases. Please re-open this bug if you feel it was closed in error or a new active case is attached.