Bug 1656645

Summary:	Upgrade fails on Check for apiservices/v1beta1.metrics.k8s.io registration
Product:	OpenShift Container Platform	Reporter:	Steven Walter <stwalter>
Component:	Installer	Assignee:	aos-install
Installer sub component:	openshift-ansible	QA Contact:	Gaoyun Pei <gpei>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	agk, aos-bugs, bart.joris, dcaros, deads, dwatson, faltahe, gpei, jack.ottofaro, jkaur, jlee, jmalde, jokerman, jrosenta, ksalunkh, maszulik, mmccomas, openshift-bugs-escalate, palonsor, rpuccini, rspazzol, rsunog, sburke, scuppett, sdodson, snalawad, stwalter, travi
Version:	3.10.0	Keywords:	Reopened
Target Milestone:	---
Target Release:	3.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	The tasks that verify relevant API services have returned to service used the default kubeconfig which may have been updated by the admin to use a user which doesn't have requisite permissions to verify those APIs. The tasks have been updated to use the admin kubeconfig in all situations avoiding this problem.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-02-11 21:30:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steven Walter 2018-12-05 23:10:23 UTC

Description of problem:
Fails to reconcile due to unexpected apiservices object. This object does not show up when we run "oc get apiservices"

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:
100% for customer, unconfirmed in test


Actual results:


TASK [openshift_control_plane : Check for apiservices/v1beta1.metrics.k8s.io registration] ************************************************************************************************************************
FAILED - RETRYING: Check for apiservices/v1beta1.metrics.k8s.io registration (30 retries left).Result was: {
    "attempts": 1, 
    "changed": true, 
    "cmd": [
        "oc", 
        "get", 
        "apiservices/v1beta1.metrics.k8s.io"
    ], 
. . .
    "msg": "non-zero return code", 
    "rc": 1, 
    "retries": 31, 
    "start": "2018-12-05 23:41:57.832274", 
    "stderr": "Error from server (NotFound): apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\" not found", 
    "stderr_lines": [
        "Error from server (NotFound): apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\" not found"
    ], 
    "stdout": "", 
    "stdout_lines": []
}
. . .
    }, 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2018-12-05 23:44:47.672341", 
    "stderr": "error: the server doesn't have a resource type \"apiservices\"", 
    "stderr_lines": [
        "error: the server doesn't have a resource type \"apiservices\""
    ], 
    "stdout": "", 
    "stdout_lines": []
}
. . .
Failure summary:


  1. Hosts:    rbsr012.example.com
     Play:     Reconcile Cluster Roles and Cluster Role Bindings and Security Context Constraints
     Task:     Check for apiservices/v1beta1.metrics.k8s.io registration
     Message:  non-zero return code

  2. Hosts:    localhost
     Play:     Gate on reconcile
     Task:     fail
     Message:  Upgrade cannot continue. The following masters did not finish reconciling: rbsr012.example.com



Customer had a deployment for kube-state-metrics in the kube-system namespace which we thought might be the issue. We removed that deployment and its service but issue persist.

Comment 1 Steven Walter 2018-12-05 23:11:29 UTC

Additionally we manually reconciled roles and migrated storage and it worked fine. But issue continued in upgrade script

Comment 4 Scott Dodson 2018-12-06 13:53:53 UTC

(In reply to Steven Walter from comment #1)
> Additionally we manually reconciled roles and migrated storage and it worked
> fine. But issue continued in upgrade script

You did this on the same host and user where the task failed or a different host?

Comment 6 Maciej Szulik 2018-12-06 15:53:41 UTC

Can you please provide me with: oc get apiservices --loglevel=8

Comment 7 Bart Joris 2018-12-06 18:22:39 UTC

(In reply to Maciej Szulik from comment #6)
> Can you please provide me with: oc get apiservices --loglevel=8
Please find the output in case 02258260

Comment 10 Pablo Alonso Rodriguez 2018-12-07 15:25:14 UTC

After some experiments in a 3.11 cluster, I see that apiservices/v1beta1.metrics.k8s.io should only be registered if metrics-server is installed, i.e. if openshift_metrics_server_install=true in inventory.

I have to confirm it in a 3.10 cluster. 

But if this is true, the solution would be to make this check (https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/openshift_control_plane/tasks/check_master_api_is_ready.yml#L27) dependent on whether `openshift_metrics_server_install` is "true".

Comment 13 Maciej Szulik 2018-12-07 16:47:26 UTC

For further investigation we need apiserver logs from that master which failed during the upgrade. If possible the higher loglevel the better. It looks like there's some problem with standing APIServices (which is part of aggrgation) on every API server.

Comment 19 David Eads 2018-12-11 23:44:44 UTC

Yes, comment 10 looks correct.  If the metrics server is not present, then you should not check for the apiservice or the subsequent `get raw`.  The check for service catalog may need to be similarly gated.

It is unusual to get the "error: the server doesn't have a resource type \"apiservices\" message.  If that is persistent from a particular server, it may warrant a separate investigation.  They type is a standard type in 1.10.  Sometimes a discover cache is out of date and it self-corrects.  If it is persistent, it's worth looking into.

Comment 26 Maciej Szulik 2018-12-12 17:16:11 UTC

> Yes, comment 10 looks correct.  If the metrics server is not present, then you should not check for the apiservice or the subsequent `get raw`.  The check for service catalog may need to be similarly gated.

>It is unusual to get the "error: the server doesn't have a resource type \"apiservices\" message.  If that is persistent from a particular server, it may warrant a separate investigation.  They type is a standard type in 1.10.  Sometimes a discover cache is out of date and it self-corrects.  If it is persistent, it's worth looking into.


I've talked with Scott about that issue, the ansible installer is written so that it should respond properly when that bit is not installed.
Subsequent raw requests will not be invoked if previous get failed. See [1] for details. 


[1] https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/openshift_control_plane/tasks/check_master_api_is_ready.yml#L36-L43

Comment 37 kedar 2018-12-18 06:37:49 UTC

Is this issue resolved.

Comment 48 Xingxing Xia 2019-01-02 11:21:06 UTC

I tried 3.9 ~ 3.10 upgrade; the 3.9 env was launched with openshift_metrics_install_metrics=true as above comments reported. It reproduced the error:
"Error from server (NotFound): apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\" not found"
But it did NOT reproduce the error:
"stderr": "error: the server doesn't have a resource type \"apiservices\""
Though met above NotFound error, the upgrade result is SUCCESS, different from the bug report.
The version in my try was openshift-ansible-3.10.83 which was also mentioned in above https://github.com/openshift/openshift-ansible/issues/10784 .

Comment 51 Daniel Caros 2019-01-09 21:23:27 UTC

I ran into this problem in a 3 master environment and 2 of the masters failed with "error: the server doesn't have a resource type \"apiservices\"". Turned out someone had "oc login" on these masters with their user id instead of system:admin and their login token had expired. Doing a "oc get apiservices/v1beta1.metrics.k8s.io" on these 2 masters gave the the not found error for apiservices. I had to "oc login system:admin" on these masters to clear things up and the install was able to get past this.

Comment 52 Scott Dodson 2019-01-09 21:46:26 UTC

(In reply to Daniel Caros from comment #51)

https://github.com/openshift/openshift-ansible/pull/10899 would probably address that. It's in openshift-ansible-3.10.90-1 and later.

Comment 53 Renato Puccini 2019-01-10 12:45:38 UTC

(In reply to Daniel Caros from comment #51)
> I ran into this problem in a 3 master environment and 2 of the masters
> failed with "error: the server doesn't have a resource type
> \"apiservices\"". Turned out someone had "oc login" on these masters with
> their user id instead of system:admin and their login token had expired.
> Doing a "oc get apiservices/v1beta1.metrics.k8s.io" on these 2 masters gave
> the the not found error for apiservices. I had to "oc login system:admin" on
> these masters to clear things up and the install was able to get past this.

Indeed my customer faced the same Daniel.

Before upgrading, I asked to run and check:
ansible masters -a 'oc whoami' 

If a result is different from: 
system:admin

Then they should run:
ansible masters -a 'oc login -u system:admin'

Comment 54 Scott Dodson 2019-01-11 14:45:10 UTC

Under the assumption that this was root caused to an ansible_user with a bad kubeconfig and that it's resolved by the referenced pull request[1] I'm moving this back to installer component and setting it ON_QA. The pull request is only relevant to OCP 3.10.x as the 3.11 codebase specifies a path to the admin kubeconfig in all tasks.

QE test case would be to `oc login` as root to change to a user who cannot query for those apis. Then confirm that upgrades fail on this task, retest with openshift-ansible-3.10.90-1 and later.

1 - https://github.com/openshift/openshift-ansible/pull/10899

Comment 55 raffaele spazzoli 2019-01-13 22:28:39 UTC

I also have a customer where the upgrade failed because of this error.
In our case the cluster is configured without metrics.
The expectation would be that this step is skipped altogether.

Comment 56 Weihua Meng 2019-01-17 14:48:05 UTC

Fixed.

openshift-ansible-3.10.101-1.git.0.5f32198.el7.noarch.rpm

upgrade with masters logged in with system:admin user and with other user, not meet those errors.

Comment 57 DAVID F WATSON 2019-01-18 18:55:35 UTC

There is also an issue with the "Check for apiservices/v1beta1.servicecatalog.k8s.io registration" & "Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered" tasks, if 
the Service Catalog is disabled.

Adding a check for the 'openshift_enable_service_catalog' hosts inventory variable ('when: openshift_enable_service_catalog | bool') to the "Check for apiservices/v1beta1.servicecatalog.k8s.io registration" task & adding that check to the 'when' attribute ('when: openshift_enable_service_catalog | bool and servicecatalog_service_registration.rc == 0') of "Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered" task, in the 'check_master_api_is_ready.yml' playbook resolves this issue, for both 'openshift-ansible-3.10.89-1' & 'openshift-ansible-3.11.59-1' openshift-ansible rpm's (verified with 'v3.10.89-2' & 'v3.11.59-2' ose-ansible containers).

Comment 58 Scott Dodson 2019-01-21 19:41:57 UTC

Whether or not metrics or service catalog are installed should not matter. The only thing that matters is that the commands run as a user who has access to query for their existence and that's what the pull request in comment #54 addresses.

Comment 60 Scott Dodson 2019-02-11 21:30:04 UTC

openshift-ansible-3.10.101-1.git.0.5f32198.el7 which shipped in RHBA-2019:0206 a week ago included this change.