Description of problem: Fails to reconcile due to unexpected apiservices object. This object does not show up when we run "oc get apiservices" Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: 100% for customer, unconfirmed in test Actual results: TASK [openshift_control_plane : Check for apiservices/v1beta1.metrics.k8s.io registration] ************************************************************************************************************************ FAILED - RETRYING: Check for apiservices/v1beta1.metrics.k8s.io registration (30 retries left).Result was: { "attempts": 1, "changed": true, "cmd": [ "oc", "get", "apiservices/v1beta1.metrics.k8s.io" ], . . . "msg": "non-zero return code", "rc": 1, "retries": 31, "start": "2018-12-05 23:41:57.832274", "stderr": "Error from server (NotFound): apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\" not found", "stderr_lines": [ "Error from server (NotFound): apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\" not found" ], "stdout": "", "stdout_lines": [] } . . . }, "msg": "non-zero return code", "rc": 1, "start": "2018-12-05 23:44:47.672341", "stderr": "error: the server doesn't have a resource type \"apiservices\"", "stderr_lines": [ "error: the server doesn't have a resource type \"apiservices\"" ], "stdout": "", "stdout_lines": [] } . . . Failure summary: 1. Hosts: rbsr012.example.com Play: Reconcile Cluster Roles and Cluster Role Bindings and Security Context Constraints Task: Check for apiservices/v1beta1.metrics.k8s.io registration Message: non-zero return code 2. Hosts: localhost Play: Gate on reconcile Task: fail Message: Upgrade cannot continue. The following masters did not finish reconciling: rbsr012.example.com Customer had a deployment for kube-state-metrics in the kube-system namespace which we thought might be the issue. We removed that deployment and its service but issue persist.
Additionally we manually reconciled roles and migrated storage and it worked fine. But issue continued in upgrade script
(In reply to Steven Walter from comment #1) > Additionally we manually reconciled roles and migrated storage and it worked > fine. But issue continued in upgrade script You did this on the same host and user where the task failed or a different host?
Can you please provide me with: oc get apiservices --loglevel=8
(In reply to Maciej Szulik from comment #6) > Can you please provide me with: oc get apiservices --loglevel=8 Please find the output in case 02258260
After some experiments in a 3.11 cluster, I see that apiservices/v1beta1.metrics.k8s.io should only be registered if metrics-server is installed, i.e. if openshift_metrics_server_install=true in inventory. I have to confirm it in a 3.10 cluster. But if this is true, the solution would be to make this check (https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/openshift_control_plane/tasks/check_master_api_is_ready.yml#L27) dependent on whether `openshift_metrics_server_install` is "true".
For further investigation we need apiserver logs from that master which failed during the upgrade. If possible the higher loglevel the better. It looks like there's some problem with standing APIServices (which is part of aggrgation) on every API server.
Yes, comment 10 looks correct. If the metrics server is not present, then you should not check for the apiservice or the subsequent `get raw`. The check for service catalog may need to be similarly gated. It is unusual to get the "error: the server doesn't have a resource type \"apiservices\" message. If that is persistent from a particular server, it may warrant a separate investigation. They type is a standard type in 1.10. Sometimes a discover cache is out of date and it self-corrects. If it is persistent, it's worth looking into.
> Yes, comment 10 looks correct. If the metrics server is not present, then you should not check for the apiservice or the subsequent `get raw`. The check for service catalog may need to be similarly gated. >It is unusual to get the "error: the server doesn't have a resource type \"apiservices\" message. If that is persistent from a particular server, it may warrant a separate investigation. They type is a standard type in 1.10. Sometimes a discover cache is out of date and it self-corrects. If it is persistent, it's worth looking into. I've talked with Scott about that issue, the ansible installer is written so that it should respond properly when that bit is not installed. Subsequent raw requests will not be invoked if previous get failed. See [1] for details. [1] https://github.com/openshift/openshift-ansible/blob/release-3.10/roles/openshift_control_plane/tasks/check_master_api_is_ready.yml#L36-L43
Is this issue resolved.
I tried 3.9 ~ 3.10 upgrade; the 3.9 env was launched with openshift_metrics_install_metrics=true as above comments reported. It reproduced the error: "Error from server (NotFound): apiservices.apiregistration.k8s.io \"v1beta1.metrics.k8s.io\" not found" But it did NOT reproduce the error: "stderr": "error: the server doesn't have a resource type \"apiservices\"" Though met above NotFound error, the upgrade result is SUCCESS, different from the bug report. The version in my try was openshift-ansible-3.10.83 which was also mentioned in above https://github.com/openshift/openshift-ansible/issues/10784 .
I ran into this problem in a 3 master environment and 2 of the masters failed with "error: the server doesn't have a resource type \"apiservices\"". Turned out someone had "oc login" on these masters with their user id instead of system:admin and their login token had expired. Doing a "oc get apiservices/v1beta1.metrics.k8s.io" on these 2 masters gave the the not found error for apiservices. I had to "oc login system:admin" on these masters to clear things up and the install was able to get past this.
(In reply to Daniel Caros from comment #51) https://github.com/openshift/openshift-ansible/pull/10899 would probably address that. It's in openshift-ansible-3.10.90-1 and later.
(In reply to Daniel Caros from comment #51) > I ran into this problem in a 3 master environment and 2 of the masters > failed with "error: the server doesn't have a resource type > \"apiservices\"". Turned out someone had "oc login" on these masters with > their user id instead of system:admin and their login token had expired. > Doing a "oc get apiservices/v1beta1.metrics.k8s.io" on these 2 masters gave > the the not found error for apiservices. I had to "oc login system:admin" on > these masters to clear things up and the install was able to get past this. Indeed my customer faced the same Daniel. Before upgrading, I asked to run and check: ansible masters -a 'oc whoami' If a result is different from: system:admin Then they should run: ansible masters -a 'oc login -u system:admin'
Under the assumption that this was root caused to an ansible_user with a bad kubeconfig and that it's resolved by the referenced pull request[1] I'm moving this back to installer component and setting it ON_QA. The pull request is only relevant to OCP 3.10.x as the 3.11 codebase specifies a path to the admin kubeconfig in all tasks. QE test case would be to `oc login` as root to change to a user who cannot query for those apis. Then confirm that upgrades fail on this task, retest with openshift-ansible-3.10.90-1 and later. 1 - https://github.com/openshift/openshift-ansible/pull/10899
I also have a customer where the upgrade failed because of this error. In our case the cluster is configured without metrics. The expectation would be that this step is skipped altogether.
Fixed. openshift-ansible-3.10.101-1.git.0.5f32198.el7.noarch.rpm upgrade with masters logged in with system:admin user and with other user, not meet those errors.
There is also an issue with the "Check for apiservices/v1beta1.servicecatalog.k8s.io registration" & "Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered" tasks, if the Service Catalog is disabled. Adding a check for the 'openshift_enable_service_catalog' hosts inventory variable ('when: openshift_enable_service_catalog | bool') to the "Check for apiservices/v1beta1.servicecatalog.k8s.io registration" task & adding that check to the 'when' attribute ('when: openshift_enable_service_catalog | bool and servicecatalog_service_registration.rc == 0') of "Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered" task, in the 'check_master_api_is_ready.yml' playbook resolves this issue, for both 'openshift-ansible-3.10.89-1' & 'openshift-ansible-3.11.59-1' openshift-ansible rpm's (verified with 'v3.10.89-2' & 'v3.11.59-2' ose-ansible containers).
Whether or not metrics or service catalog are installed should not matter. The only thing that matters is that the commands run as a user who has access to query for their existence and that's what the pull request in comment #54 addresses.
openshift-ansible-3.10.101-1.git.0.5f32198.el7 which shipped in RHBA-2019:0206 a week ago included this change.