Created attachment 1432802 [details]
Description of problem:
While running a heat template based deployment of VMs, we ran into a weird situation where nova sends multiple requests for bind_port on different hosts, before neutron has had a chance to complete the first request.
I (neutron) get port_update request with 2 different hosts in a span of less than 30secs. Typically, there's about 3 port updates until its finally bound and in active state. I couldn't figure out where the aggressive timeout is, that forces a retry to another host.
This is a newton deployment with Big Switch Networks neutron plugin.
Version-Release number of selected component (if applicable):
This is intermittently reproducible, but once every 3-4 tries and I can create this situation.
Steps to Reproduce:
1. create stack using the provided heat template
2. stack create complete
*stack creation has some constants in the template such as keypair, external network, cinder volume name. those can be changed or same named item can be created before running the heat template create.
bind_port is received for the same port_id but for different binding host_id in a span of 30 seconds
bind_port is received for a given port with only one binding host_id and retried to another host after the first query is returned from neutron
The port_id in question is 'df612d2d-f168-4d6d-8ac3-c1c3d8892e7d'. I have a PID (process ID) based timeline of events in an image that I am attaching.
Legend for the image:
top of each column = 6digit PID
C0 = compute 0
C1 = compute 1
Created attachment 1432804 [details]
events for the port in question on 3 different threads with timestamps
It sounds like nova is failing to build on some host and then rescheduling to try the build on another host, and so on. We need to take a look at the nova logs to see why the build fails and the reschedule happens. Could you please attach the nova-compute, nova-scheduler, and nova-conductor logs to this BZ?
We don't set a timeout for our requests to neutron, so something else is going on during the instance build when you hit this issue. We'll continue investigating when we have the logs.
Created attachment 1440247 [details]
heat stack executed to generate the issue
stack.yaml has sub-part server.yaml.
and server.yaml in turn uses user-data
Apologies for the delay. This happened in a customer setup the first time. We collected neutron logs, but unfortunately did not get the sosreport. So nova logs are missing.
All of my understanding of the issue is based on the analysis of neutron logs.
We haven't been able to reproduce this in a local setup and neither has the customer hit the issue again.
I'll ensure that if this happens again, all the service logs, including nova and heat are collected.
This doesn't happen in a regular instance creation via horizon GUI. It only happens when creating instances using heat-stack. I've attached a tar.gz file containing the heat stack template.
I guess for now this will have to be paused/hibernated until reproduced again.
> I guess for now this will have to be paused/hibernated until reproduced
Thanks for letting us know. I'll close the bz for now then; by all means re-open it to make it come out of hibernation if you're able to reproduce and collect logs.