Bug 1925673

Summary: fence_vmware_rest: Handle error 507 or 500 for more than 4000 VMs in vSphere 7
Product: Red Hat Enterprise Linux 8 Reporter: Reid Wahl <nwahl>
Component: fence-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: medium    
Version: 8.3CC: cluster-maint, mjuricek, sbradley
Target Milestone: rcKeywords: Triaged
Target Release: 8.5   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-30 12:42:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1906502    

Description Reid Wahl 2021-02-05 20:21:59 UTC
Description of problem:

In BZ1654058, we added handling for the case where the vCenter REST API throws a 400 error for more than 1000 VMs.

Today while looking for something else in the API doc, I noticed that both the limitation and the error code have changed.
  - Now, the limitation is 4000 VMs instead of 1000 VMs.
  - Now, the error code for "more than 4000 VMs" is either 507 or 500 (I'll post more details on the discrepancy below).
  - Now, error 400 only means "com.vmware.vapi.std.errors.invalid_argument : if the VM.FilterSpec.power-states field contains a value that is not supported by the server."


The API documentation at https://developer.vmware.com/docs/vsphere-automation/latest/vcenter/rest/vcenter/vm/get/ (which I accessed via https://developer.vmware.com/) makes no mention of which API version or vSphere version introduced this change. It only says "latest" for the API version in the URL. I see no ChangeLog, no way to choose which version of the API you want to view docs for, or display of what the current latest version is.

Via a search engine, I found an alternative API documentation tree, which shows through comparison that the change was introduced in vSphere 7.0.

Here's the more extensive doc tree with version choices:
  - https://code.vmware.com/web/sdk/6.7/vsphere-automation-rest
  - https://code.vmware.com/web/sdk/7.0/vsphere-automation-rest


The info about the old vSphere 6.7 interface for vcenter/vm/list can be found here:
  - vSphere Automation API 6.7 (https://code.vmware.com/apis/366/vsphere-automation)
  - vSphere Automation API Reference 6.7 U1 (https://vmware.github.io/vsphere-automation-sdk-rest/6.7.1/operations/com/vmware/vcenter/vm.list-operation.html)
  - Note: Both of the above (6.7 and 6.7 U1) agree on the details of the vcenter/vm list operation.


There are two conflicting documents for the vSphere 7.0 API, and both are linked from the doc tree that I provided above.
  - vSphere Automation API 7.0 (https://code.vmware.com/apis/991/vsphere-automation)
    - This redirects to the developer.vmware.com documentation that I linked near the top.
    - This says that the limit is now 4000 VMs, and that the GET call for the list operation returns error **507** if there are more than 4000 VMs.
  - vSphere Automation API 7.0U1 (https://code.vmware.com/apis/1119/vsphere-automation)
    - This says that the limit is now 4000 VMs, and that the GET call for the list operation returns error **500** if there are more than 4000 VMs.


The fence_vmware_rest fence agent uses the following exception handling logic:
~~~
        try:
                command = "vcenter/vm"
                if "--filter" in options:
                        command = command + "?" + options["--filter"]
                res = send_command(conn, command)
        except Exception as e:
                logging.debug("Failed: {}".format(e))
                if str(e).startswith("400"):
                        if options.get("--original-action") == "monitor":
                                return outlets
                        else:
                                logging.error("More than 1000 VMs returned. Use --filter parameter to limit which VMs to list.")
                                fail(EC_STATUS)
                else:
                        fail(EC_STATUS)
~~~

So there are at least two changes we need to make.
  - We need to look for 507 or 500 as the "too many VMs" error code to avoid failing with a generic EC_STATUS.
  - We need to modify the error message to say "More than 4000 VMs".


A related problem that you can see in the vSphere 6.7 API doc that I linked above: Error code 400 can mean two different things.

400 	invalid_argument 	if the vcenter.VM.filter_spec.power_states field contains a value that is not supported by the server.
400 	unable_to_allocate_resource 	if more than 1000 virtual machines match the vcenter.VM.filter_spec.

So we don't actually know which one is meant just by checking the error code. On vSphere 7.0, code 507 or 500 is used for "too many VMs", so we have distinct error codes.

-----

Version-Release number of selected component (if applicable):

fence-agents-vmware-rest-4.2.1-53.el8_3.1

-----

How reproducible:

Presumably always

-----

Steps to Reproduce:
1. Set up more than 4000 VMs on a vCenter running version 7.0 or greater.
2. Run `fence_vmware_rest <options> -o list`.

-----

Actual results:

Based on the current agent code, the agent will fail with only "Unable to connect/login to fencing device".

-----

Expected results:

The agent fails with "More than 1000 VMs returned. Use --filter parameter to limit which VMs to list" in addition to the "Unable to connect/login to fencing device" error.

-----

Additional info:

We should coordinate with VMware here to verify which of the 500 vs. 507 error codes is correct, or when the change from one to the other was introduced.

There is technically no impact, as this only affects logging. However, if the agent interprets error codes incorrectly because their meanings have changed, it can make troubleshooting more difficult.

Comment 7 Reid Wahl 2021-11-19 19:34:13 UTC
It's been reported that the error code 507 for too many VMs is not an issue in current testing. I looked into that, and here's what I found.

In comment 0 I said:

>   - vSphere Automation API 7.0 (https://code.vmware.com/apis/991/vsphere-automation)
>     - This redirects to the developer.vmware.com documentation that I linked near the top.
>     - This says that the limit is now 4000 VMs, and that the GET call for the list operation returns error **507** if there are more than 4000 VMs.


If you go to that URL now and follow it to the vcenter/vm GET endpoint page, the page now says error code **500** for too many VMs. The other (direct) link I provided, https://developer.vmware.com/docs/vsphere-automation/latest/vcenter/rest/vcenter/vm/get/, also says error code **500** for too many VMs. So the documentation has changed between the day I filed this bug and today. It looks as if VMware has reverted the error code from **507** to **500** since then.

-----

I want to point something else out: fence_vmware_rest connects to "{api_host}/rest". This is deprecated as of v7.0 U2. The new REST APIs are served under "{api_host}/api".

VMware says: "There is no immediate impact out of this change as the old REST APIs will continue to work.  We are not removing the old REST APIs or their support. We intend to remove it only after 2 major vSphere releases, also subject to customer feedback."
  - https://core.vmware.com/blog/vsphere-7-update-2-rest-api-modernization

So that's good news for us. However, it is something we need to be aware of. At some point we'll want to update fence_vmware_rest to make "/api" the default api_path; users on legacy vSphere deployments can still override this by setting api_path="/rest". **Eventually**, using "/rest" may break.

We could either change our default in RHEL 9 (or 10, since we've probably got a while); or just wait until we're working on officially supporting vSphere 9.

Either way, that's a separate BZ.

Comment 8 Oyvind Albrigtsen 2021-11-22 10:17:39 UTC
Nice to know.

Do you know how far back support for /api is available? If it's only 7.0+

Comment 9 Reid Wahl 2021-11-22 19:19:40 UTC
(In reply to Oyvind Albrigtsen from comment #8)
> Do you know how far back support for /api is available? If it's only 7.0+

All I can do is read the VMware announcement that's linked above:

    All REST APIs from 6.0 to 6.7 were served under /rest and referred to as old REST APIs.
    Starting from vSphere 7, REST APIs are served under /api and referred to as new REST APIs.

    With the release of vSphere 7 Update 2, VMware announces the deprecation of old REST APIs.


"/api" didn't appear in the docs I was looking through until 7.0 Update 2.

Comment 10 Oyvind Albrigtsen 2021-11-30 12:42:24 UTC
Closing since 500/507 issue has been fixed in the API.