Bug 1544713 - Incorrect results being returned by marker pagination logic in the nova-api
Summary: Incorrect results being returned by marker pagination logic in the nova-api
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-13 11:14 UTC by Punit Kundal
Modified: 2023-03-21 18:44 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-25 09:12:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-8812 0 None None None 2022-08-09 11:24:55 UTC

Comment 2 Matthew Booth 2018-02-16 13:36:39 UTC
Some notes on the above:

The relevant code here is compute.API.get_all(..., marker, ...). This functions returns:

        # The ordering of instances will be
        # [sorted instances with no host] + [sorted instances with host].
        # This means BuildRequest and cell0 instances first, then cell
        # instances

The separate searches (BuildRequest, cell0, cells) are done consecutively. We look for the marker in each. If it's found the subsequent queries run with marker=None, i.e. all results returned. This isn't quite what's documented, but BuildRequests should generally have created_at > anything in a cell, because they are still being built, and cell0 is a bit different. It's close enough.

Applying this to the scenarios above:

1. This is the baseline test, and is working as expected.

2. We would expect this to return nothing, as we've set the marker to the last result returned by (1). However, it appears to be returning 4 BuildRequests which are all newer than the marker, then subsequently no cell0 or cell instances. The mystery here is why it's returning the 4 BuildRequests, as the other 2 are as expected.

3. This working as expected. The marker given is the final BuildRequest returned by (2). Either:

* this is still a BuildRequest at the time of the query, in which case we expect there to be no BuildRequests older than this (hence zero returned), and subsequently return all cell0 and cell instances

* this is now an instance, in which case we will return no BuildRequests (marker not found), no cell0 instances (assuming no scheduling error), and all cell instances older than the marker, which is almost all of them.

The focus has to be on (2) above, as that's the only case which isn't working as expected. BuildRequestList.get_by_filters() should not find a matching marker, and therefore not return any results.

Comment 3 Matthew Booth 2018-02-16 15:16:59 UTC
Note: build request is created with an instance object which hasn't been persisted in the DB, and therefore may not have a created_at yet.

Comment 4 Matthew Booth 2018-02-20 15:48:23 UTC
I had a theory that when we create the instance in the db in compute.API._provision_instances() the instance might have been created with a NULL created_at, which might have interacted with the pagination code. I note that the latest pagination code allows for nullfirst and nulllast in the sort direction.

However, I just tested it and it seems it is present.

Comment 5 Matthew Booth 2018-02-20 15:52:56 UTC
My current hypothesis is that the issue is solely related to the cell query. This is because from code inspection I can't see how the BuildRequest query could return results if the marker is not present in that list, which it isn't. I can't currently see how the problem is occurring, though.

Could you please run the test again, and this time post DEBUG logs from the controllers. Also could you please include exact timestamps in your pagination query output so we can see exactly when the query was executed?

Comment 7 Michael Bayer 2018-02-20 16:38:08 UTC
If possible, if we can enable SQLAlchemy debug logging, which in nova should be via using the oslo.db parameter connection_debug to 100, e.g. nova.conf -> [database] -> connection_debug=100, then restart.   this will generate an enormous amount of logs, so I would advise running the reproduction case quickly, then stop the server again and remove the connection debug.   The purpose of this logging will be to illustrate not just the SQL being sent but also the results coming back.    we can grep that out and see exactly what SQL is emitted and what comes back which hopefully can isolate the issue.

Comment 13 Matthew Booth 2018-03-23 12:21:51 UTC
Punit,

Unfortunately this isn't quite what we're looking for. I see some sqlalchemy debug logs in nova-api.log, but unfortunately this is for the 'openstack show ...' command, not for the paginated query. Would it be possible to do this again, but this time to get the output from the pagination query?

Comment 20 Matthew Booth 2018-05-25 09:12:56 UTC
I have closed this bug as it has been waiting for more info for at least 2 weeks. We only do this to ensure that we don't accumulate stale bugs which can't be addressed. If you are able to provide the requested information, please feel free to re-open this bug.


Note You need to log in before you can comment on or make changes to this bug.