Bug 1656645
Summary: | Upgrade fails on Check for apiservices/v1beta1.metrics.k8s.io registration | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Steven Walter <stwalter> |
Component: | Installer | Assignee: | aos-install |
Installer sub component: | openshift-ansible | QA Contact: | Gaoyun Pei <gpei> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | agk, aos-bugs, bart.joris, dcaros, deads, dwatson, faltahe, gpei, jack.ottofaro, jkaur, jlee, jmalde, jokerman, jrosenta, ksalunkh, maszulik, mmccomas, openshift-bugs-escalate, palonsor, rpuccini, rspazzol, rsunog, sburke, scuppett, sdodson, snalawad, stwalter, travi |
Version: | 3.10.0 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 3.10.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
The tasks that verify relevant API services have returned to service used the default kubeconfig which may have been updated by the admin to use a user which doesn't have requisite permissions to verify those APIs. The tasks have been updated to use the admin kubeconfig in all situations avoiding this problem.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-02-11 21:30:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Steven Walter
2018-12-05 23:10:23 UTC
Additionally we manually reconciled roles and migrated storage and it worked fine. But issue continued in upgrade script (In reply to Steven Walter from comment #1) > Additionally we manually reconciled roles and migrated storage and it worked > fine. But issue continued in upgrade script You did this on the same host and user where the task failed or a different host? Can you please provide me with: oc get apiservices --loglevel=8 (In reply to Maciej Szulik from comment #6) > Can you please provide me with: oc get apiservices --loglevel=8 Please find the output in case 02258260 After some experiments in a 3.11 cluster, I see that apiservices/v1beta1.metrics.k8s.io should only be registered if metrics-server is installed, i.e. if openshift_metrics_server_install=true in inventory. I have to confirm it in a 3.10 cluster. But if this is true, the solution would be to make this check (https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/openshift_control_plane/tasks/check_master_api_is_ready.yml#L27) dependent on whether `openshift_metrics_server_install` is "true". For further investigation we need apiserver logs from that master which failed during the upgrade. If possible the higher loglevel the better. It looks like there's some problem with standing APIServices (which is part of aggrgation) on every API server. Yes, comment 10 looks correct. If the metrics server is not present, then you should not check for the apiservice or the subsequent `get raw`. The check for service catalog may need to be similarly gated. It is unusual to get the "error: the server doesn't have a resource type \"apiservices\" message. If that is persistent from a particular server, it may warrant a separate investigation. They type is a standard type in 1.10. Sometimes a discover cache is out of date and it self-corrects. If it is persistent, it's worth looking into. > Yes, comment 10 looks correct. If the metrics server is not present, then you should not check for the apiservice or the subsequent `get raw`. The check for service catalog may need to be similarly gated. >It is unusual to get the "error: the server doesn't have a resource type \"apiservices\" message. If that is persistent from a particular server, it may warrant a separate investigation. They type is a standard type in 1.10. Sometimes a discover cache is out of date and it self-corrects. If it is persistent, it's worth looking into. I've talked with Scott about that issue, the ansible installer is written so that it should respond properly when that bit is not installed. Subsequent raw requests will not be invoked if previous get failed. See [1] for details. [1] https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/openshift_control_plane/tasks/check_master_api_is_ready.yml#L36-L43 Is this issue resolved. I tried 3.9 ~ 3.10 upgrade; the 3.9 env was launched with openshift_metrics_install_metrics=true as above comments reported. It reproduced the error: "Error from server (NotFound): apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\" not found" But it did NOT reproduce the error: "stderr": "error: the server doesn't have a resource type \"apiservices\"" Though met above NotFound error, the upgrade result is SUCCESS, different from the bug report. The version in my try was openshift-ansible-3.10.83 which was also mentioned in above https://github.com/openshift/openshift-ansible/issues/10784 . I ran into this problem in a 3 master environment and 2 of the masters failed with "error: the server doesn't have a resource type \"apiservices\"". Turned out someone had "oc login" on these masters with their user id instead of system:admin and their login token had expired. Doing a "oc get apiservices/v1beta1.metrics.k8s.io" on these 2 masters gave the the not found error for apiservices. I had to "oc login system:admin" on these masters to clear things up and the install was able to get past this. (In reply to Daniel Caros from comment #51) https://github.com/openshift/openshift-ansible/pull/10899 would probably address that. It's in openshift-ansible-3.10.90-1 and later. (In reply to Daniel Caros from comment #51) > I ran into this problem in a 3 master environment and 2 of the masters > failed with "error: the server doesn't have a resource type > \"apiservices\"". Turned out someone had "oc login" on these masters with > their user id instead of system:admin and their login token had expired. > Doing a "oc get apiservices/v1beta1.metrics.k8s.io" on these 2 masters gave > the the not found error for apiservices. I had to "oc login system:admin" on > these masters to clear things up and the install was able to get past this. Indeed my customer faced the same Daniel. Before upgrading, I asked to run and check: ansible masters -a 'oc whoami' If a result is different from: system:admin Then they should run: ansible masters -a 'oc login -u system:admin' Under the assumption that this was root caused to an ansible_user with a bad kubeconfig and that it's resolved by the referenced pull request[1] I'm moving this back to installer component and setting it ON_QA. The pull request is only relevant to OCP 3.10.x as the 3.11 codebase specifies a path to the admin kubeconfig in all tasks. QE test case would be to `oc login` as root to change to a user who cannot query for those apis. Then confirm that upgrades fail on this task, retest with openshift-ansible-3.10.90-1 and later. 1 - https://github.com/openshift/openshift-ansible/pull/10899 I also have a customer where the upgrade failed because of this error. In our case the cluster is configured without metrics. The expectation would be that this step is skipped altogether. Fixed. openshift-ansible-3.10.101-1.git.0.5f32198.el7.noarch.rpm upgrade with masters logged in with system:admin user and with other user, not meet those errors. There is also an issue with the "Check for apiservices/v1beta1.servicecatalog.k8s.io registration" & "Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered" tasks, if the Service Catalog is disabled. Adding a check for the 'openshift_enable_service_catalog' hosts inventory variable ('when: openshift_enable_service_catalog | bool') to the "Check for apiservices/v1beta1.servicecatalog.k8s.io registration" task & adding that check to the 'when' attribute ('when: openshift_enable_service_catalog | bool and servicecatalog_service_registration.rc == 0') of "Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered" task, in the 'check_master_api_is_ready.yml' playbook resolves this issue, for both 'openshift-ansible-3.10.89-1' & 'openshift-ansible-3.11.59-1' openshift-ansible rpm's (verified with 'v3.10.89-2' & 'v3.11.59-2' ose-ansible containers). Whether or not metrics or service catalog are installed should not matter. The only thing that matters is that the commands run as a user who has access to query for their existence and that's what the pull request in comment #54 addresses. openshift-ansible-3.10.101-1.git.0.5f32198.el7 which shipped in RHBA-2019:0206 a week ago included this change. |