Bug 1575753 - nova sends multiple request to bind port on different hosts for same VM instance [NEEDINFO]
Summary: nova sends multiple request to bind port on different hosts for same VM instance
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: nova-maint
QA Contact: nova-maint
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-07 19:47 UTC by bigswitch
Modified: 2019-09-09 14:38 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-31 11:02:19 UTC
Target Upstream Version:
mwitt: needinfo? (rhosp-bugs-internal)


Attachments (Terms of Use)
complete neutron-server.log (8.98 MB, text/plain)
2018-05-07 19:47 UTC, bigswitch
no flags Details
events for the port in question on 3 different threads with timestamps (108.04 KB, image/jpeg)
2018-05-07 19:54 UTC, bigswitch
no flags Details
heat stack executed to generate the issue (4.15 KB, application/x-gzip)
2018-05-22 15:17 UTC, bigswitch
no flags Details

Description bigswitch 2018-05-07 19:47:18 UTC
Created attachment 1432802 [details]
complete neutron-server.log

Description of problem:
While running a heat template based deployment of VMs, we ran into a weird situation where nova sends multiple requests for bind_port on different hosts, before neutron has had a chance to complete the first request.

I (neutron) get port_update request with 2 different hosts in a span of less than 30secs. Typically, there's about 3 port updates until its finally bound and in active state. I couldn't figure out where the aggressive timeout is, that forces a retry to another host.

This is a newton deployment with Big Switch Networks neutron plugin.

Version-Release number of selected component (if applicable):


How reproducible:
This is intermittently reproducible, but once every 3-4 tries and I can create this situation.

Steps to Reproduce:
1. create stack using the provided heat template
2. stack create complete
3.

*stack creation has some constants in the template such as keypair, external network, cinder volume name. those can be changed or same named item can be created before running the heat template create.

Actual results:
bind_port is received for the same port_id but for different binding host_id in a span of 30 seconds

Expected results:
bind_port is received for a given port with only one binding host_id and retried to another host after the first query is returned from neutron


Additional info:

Comment 1 bigswitch 2018-05-07 19:52:51 UTC
The port_id in question is 'df612d2d-f168-4d6d-8ac3-c1c3d8892e7d'. I have a PID (process ID) based timeline of events in an image that I am attaching.
Legend for the image:
top of each column = 6digit PID
C0 = compute 0
C1 = compute 1

Comment 2 bigswitch 2018-05-07 19:54:43 UTC
Created attachment 1432804 [details]
events for the port in question on 3 different threads with timestamps

Comment 3 melanie witt 2018-05-11 17:29:31 UTC
It sounds like nova is failing to build on some host and then rescheduling to try the build on another host, and so on. We need to take a look at the nova logs to see why the build fails and the reschedule happens. Could you please attach the nova-compute, nova-scheduler, and nova-conductor logs to this BZ?

We don't set a timeout for our requests to neutron, so something else is going on during the instance build when you hit this issue. We'll continue investigating when we have the logs.

Comment 4 bigswitch 2018-05-22 15:17:22 UTC
Created attachment 1440247 [details]
heat stack executed to generate the issue

stack.yaml has sub-part server.yaml.
and server.yaml in turn uses user-data

Comment 5 bigswitch 2018-05-22 15:23:35 UTC
Hi Melanie,

Apologies for the delay. This happened in a customer setup the first time. We collected neutron logs, but unfortunately did not get the sosreport. So nova logs are missing.
All of my understanding of the issue is based on the analysis of neutron logs.
We haven't been able to reproduce this in a local setup and neither has the customer hit the issue again.

I'll ensure that if this happens again, all the service logs, including nova and heat are collected.
This doesn't happen in a regular instance creation via horizon GUI. It only happens when creating instances using heat-stack. I've attached a tar.gz file containing the heat stack template.

I guess for now this will have to be paused/hibernated until reproduced again.

Thanks!
- Aditya

Comment 6 Artom Lifshitz 2018-05-31 11:02:19 UTC
> I guess for now this will have to be paused/hibernated until reproduced
> again.

Thanks for letting us know. I'll close the bz for now then; by all means re-open it to make it come out of hibernation if you're able to reproduce and collect logs.

Cheers!


Note You need to log in before you can comment on or make changes to this bug.