Description of problem: Instance goes to ERROR state with the following fault: | fault | {"message": "Build of instance 50c22b9b-7e4c-4c8e-a075-ee8c4e857ff5 aborted: Failed to allocate the network(s), not rescheduling.", "code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 1840, in _do_build_and_run_instance | Version-Release number of selected component (if applicable): opendaylight-8.0.0-9.el7ost.noarch.rpm How reproducible: sporadically Steps to Reproduce: 1. run this job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/ Actual results: Instance stuck in ERROR state Expected results: Instance should be ACTIVE with connectivity Additional info: There is a similar bug, but the nova fault indicates there was no compute host available to spawn the instance. In this case, the reason for the failure is due to network allocation. we'll need to dig through neutron and odl logs first, I think. This is the specific place to see the error in the jobs robot log: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/32/robot/report/log.html#s1-s5-t11-k4-k1-k3-k5-k2 (NOTE: sometimes the jenkins server seems to not serve up the robot html files and you get a blank page. if that happens just wget the log.html file to your local system and look at it that way) controller logs (including neutron, odl, etc logs) are here: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-13_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-csit/32/artifact/
Please attach logs from neutron & ODL.
Created attachment 1435512 [details] nova, neutron, odl logs logs that should be included: $ tree -s . ├── [ 4096] compute-0 │ └── [ 8754072] nova.log ├── [ 4096] compute-1 │ └── [ 5343585] nova.log ├── [ 4096] controller-0 │ ├── [ 1563711] neutron-dhcp.log │ ├── [ 7526355] neutron.log │ └── [ 19361348] odl.log ├── [ 4096] controller-1 │ ├── [ 1542974] neutron-dhcp.log │ ├── [ 7361950] neutron.log │ └── [ 14770551] odl.log ├── [ 4096] controller-2 │ ├── [ 1528786] neutron-dhcp.log │ ├── [ 8322507] neutron.log │ └── [ 36139353] odl.log └── [ 2649048] logs.tar.xz
From the neutron logs it seems that around 2018-05-03 03:25:34.160 and 2018-05-03 03:52:51.995 the whole cluster was unresponsive, it either reported 404 or the connection timed out entirely.
*** Bug 1573973 has been marked as a duplicate of this bug. ***
Mike, what does "whole cluster" mean in this context? This is just rest api calls from neutron to our haproxy VIP right? 404 would mean that at least RESTCONF is working, but unresponsive could mean haproxy is sending the requests to a downed ODL. I am digging through the odl logs to look at the cluster state changes to see if I can map anything to problems in the robot time stamps or what you see in the neutron logs.
Im Am also having this problem, Has anyone found a solution. If you could you help me out.
*** Bug 1574739 has been marked as a duplicate of this bug. ***
I was making the assumption that this bug was only hitting us in our d/s jobs so thought we could focus on how we did our deployments, but that is not the case. I see this in u/s ODL CSIT too: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-openstack-queens-upstream-stateful-oxygen/270/robot-plugin/log_full.html.gz#s1-s5-t21-k3-k1-k3-k1-k2
(In reply to jamo luhrsen from comment #8) > I was making the assumption that this bug was only hitting us in our d/s jobs > so thought we could focus on how we did our deployments, but that is not the > case. I see this in u/s ODL CSIT too: > > https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit- > 3node-openstack-queens-upstream-stateful-oxygen/270/robot-plugin/log_full. > html.gz#s1-s5-t21-k3-k1-k3-k1-k2 So it’s not related to HAProxy...
(In reply to jdchester7 from comment #6) > Im Am also having this problem, Has anyone found a solution. If you could > you help me out. We haven’t found a solution yet, but we’re interested in any information you might have — do you have logs from OpenStack and OpenDaylight when the problem occurs? (karaf.log at least from the OpenDaylight controllers.)
This still shows up in Oxygen CSIT.
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality