Some notes on the above: The relevant code here is compute.API.get_all(..., marker, ...). This functions returns: # The ordering of instances will be # [sorted instances with no host] + [sorted instances with host]. # This means BuildRequest and cell0 instances first, then cell # instances The separate searches (BuildRequest, cell0, cells) are done consecutively. We look for the marker in each. If it's found the subsequent queries run with marker=None, i.e. all results returned. This isn't quite what's documented, but BuildRequests should generally have created_at > anything in a cell, because they are still being built, and cell0 is a bit different. It's close enough. Applying this to the scenarios above: 1. This is the baseline test, and is working as expected. 2. We would expect this to return nothing, as we've set the marker to the last result returned by (1). However, it appears to be returning 4 BuildRequests which are all newer than the marker, then subsequently no cell0 or cell instances. The mystery here is why it's returning the 4 BuildRequests, as the other 2 are as expected. 3. This working as expected. The marker given is the final BuildRequest returned by (2). Either: * this is still a BuildRequest at the time of the query, in which case we expect there to be no BuildRequests older than this (hence zero returned), and subsequently return all cell0 and cell instances * this is now an instance, in which case we will return no BuildRequests (marker not found), no cell0 instances (assuming no scheduling error), and all cell instances older than the marker, which is almost all of them. The focus has to be on (2) above, as that's the only case which isn't working as expected. BuildRequestList.get_by_filters() should not find a matching marker, and therefore not return any results.
Note: build request is created with an instance object which hasn't been persisted in the DB, and therefore may not have a created_at yet.
I had a theory that when we create the instance in the db in compute.API._provision_instances() the instance might have been created with a NULL created_at, which might have interacted with the pagination code. I note that the latest pagination code allows for nullfirst and nulllast in the sort direction. However, I just tested it and it seems it is present.
My current hypothesis is that the issue is solely related to the cell query. This is because from code inspection I can't see how the BuildRequest query could return results if the marker is not present in that list, which it isn't. I can't currently see how the problem is occurring, though. Could you please run the test again, and this time post DEBUG logs from the controllers. Also could you please include exact timestamps in your pagination query output so we can see exactly when the query was executed?
If possible, if we can enable SQLAlchemy debug logging, which in nova should be via using the oslo.db parameter connection_debug to 100, e.g. nova.conf -> [database] -> connection_debug=100, then restart. this will generate an enormous amount of logs, so I would advise running the reproduction case quickly, then stop the server again and remove the connection debug. The purpose of this logging will be to illustrate not just the SQL being sent but also the results coming back. we can grep that out and see exactly what SQL is emitted and what comes back which hopefully can isolate the issue.
Punit, Unfortunately this isn't quite what we're looking for. I see some sqlalchemy debug logs in nova-api.log, but unfortunately this is for the 'openstack show ...' command, not for the paginated query. Would it be possible to do this again, but this time to get the output from the pagination query?
I have closed this bug as it has been waiting for more info for at least 2 weeks. We only do this to ensure that we don't accumulate stale bugs which can't be addressed. If you are able to provide the requested information, please feel free to re-open this bug.