Bug 2216702

Summary: [OVN] Instance creation fails occasionally because of network-vif-plugged related timeout
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: openstack-neutronAssignee: Rodolfo Alonso <ralonsoh>
Status: CLOSED NOTABUG QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16.2 (Train)CC: chrisw, dalvarez, dhill, ebarrera, mlavalle, ralonsoh, rhos-maint, scohen, shtiwari, twilson
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-24 18:22:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2023-06-22 10:08:41 UTC
Description of problem:

Customer has quite special usage model for his RHOSP 16.2 deployment with ML2/OVN: numerous VMs are created and deleted daily. We have another bugs reported for his deployments:
- bug #2203811
- bug #2203848

After applying hotfix for bug #2203848 (it may be unrelated, but leaving a note just in case) customer reported another problem: 10-20 %% of VMs are not able to start because of timeout while waiting for network-vif-plugged event.

It looks like this event should come from Neutron Server. We have collected logs for these events and kindly ask for a help to figure out what went wrong there.


Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.2.4 (Train)


How reproducible:

#!/bin/bash
COUNT=${1}
NET=${2}
for i in {1..10}; do
  echo "round ${i}"
  openstack server create --image IMAGE_UUID --use-config-drive --min ${COUNT} --max ${COUNT} --flavor m1.tiny --network ${NET} launch-test
  sleep 30
  openstack server delete $(openstack server list --status ACTIVE -c ID -f value | xargs)
done


Actual results:
10-20%% of servers fail because of network-vif-plugged timeout

Expected results:
All servers started properly

Comment 31 Miguel Lavalle 2023-07-24 18:22:06 UTC
Hi Alex,

We discussed the reopening of this BZ during our triaging meeting. We concluded that in any case there won't be any code changes as a result of your question in comment 30 above. So the BZ shouldn't have been reopened. Please ask confirmation in slack about the confirmation you asked in comment 30