This is from the updates-ovn jobs : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/ There is an OC instance created during the Overcloud-update : this instance has a Cinder Volume which is in ERROR state, and that prevents the live migration: Instance and volume attachment happen here : (created before the OC reboot : https://github.com/openstack/tripleo-upgrade/blob/c5038babb5966d3998e807b668c9ecc477175cb0/templates/workload_launch.sh.j2#L211) The attachement error can be seen here: /var/log/containers/nova/nova-compute.log:2019-01-17 12:52:21.075 1 ERROR nova.volume.cinder [req-9cc1f5b0-023d-4eca-b071-3d102d2776d4 ec75d0e6c48c4e8e9ee9818e0b88bdf5 8fa2e29ca76f4106955124dd1dea5331 - default default] [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] Create attachment failed for volume 68bb0540-d8f1-4cdf-962b-7f66e2523821. Error: Unable to create attachment for volume and also here which causes live migration to halt: /var/log/containers/nova/nova-compute.log:2019-01-17 12:52:22.169 1 ERROR nova.compute.manager [-] [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] Pre live migration failed at compute-0.localdomain: RemoteError: Remote error: ClientException Unable to create attachment for volume. (HTTP 500) (Request-ID: req-d541d697-159f-4e4b-ab44-a7dd10780b92) ... /var/log/containers/nova/nova-compute.log:2019-01-17 12:52:22.169 1 ERROR nova.compute.manager [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] RemoteError: Remote error: ClientException Unable to create attachment for volume. (HTTP 500) (Request-ID: req-d541d697-159f-4e4b-ab44-a7dd10780b92) Cinder also confirms the error with the volume: /var/log/containers/cinder/cinder-volume.log:2019-01-17 10:23:42.514 70 ERROR cinder.volume.manager Stderr: u' Failed to find logical volume "cinder-volumes/volume-68bb0540-d8f1-4cdf-962b-7f66e2523821"\n' To test this with more jobs I've created an infrared commit which tries to check and remove the volume from the workload instance before the OC_reboot procedure : https://review.gerrithub.io/#/c/redhat-openstack/infrared/+/441197/3 You can reproduce this using : IR_GERRIT_CHANGE: 441197/3 This patch tries to do: openstack server remove volume $vol_id $server_workload_id Results : After Overcloud update instance volume is in Error state http://pastebin.test.redhat.com/699078 This result reproduced again and again and I believe this is the source of the problem here : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/71/artifact/.sh/ir-tripleo-overcloud-reboot.log https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/70/artifact/.sh/ir-tripleo-overcloud-reboot.log Seeing as the previous issue with rabbit-neutron post reboot is no longer the prime suspect here (this is happening pre oc_reboot), I'm changing this bug direction towards the error'd volume after OC_upgrade.
automation workaround is : IR_GERRIT_CHANGE: 441357 This removes the volume creation and attachement to an instance in workload creation before the update stages begin tested in : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/88/
The problem is due to using cinder's LVM backend in a multi-controller environment. This is not a valid configuration, so the CI job will need to be updated. The issue is cinder's LVM backend stores volumes locally on whatever controller node is currently running the c-vol service. Rebooting the active controller will cause pacemaker to transfer the c-vol service onto another node. Unfortunately, the cinder volume data isn't present on the new node, which results in the failure you're seeing. Any multi-controller CI jobs that involve cinder volumes require cinder use a shared storage backend (e.g. ceph).
ack, I'll push for an update (+ceph) on the reproducer jobs and post results
issue did not reproduce with update+ceph topology , tested in pidone dfg job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/pidone/view/updates/job/DFG-pidone-updates-14_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-sanity/ removing the blocker flag , but this is still an automation blocker for ovn , Eran can you retest your updates job with the said patch after :3ceph has been added (Arie's patch) (ir_gerrit_patch: 442177) and tell us if you are still seeing an issue ?
(In reply to pkomarov from comment #8) > issue did not reproduce with update+ceph topology , tested in pidone dfg > job: > https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/pidone/ > view/updates/job/DFG-pidone-updates-14_director-rhel-virthost- > 3cont_2comp_3ceph-ipv4-vxlan-sanity/ > > removing the blocker flag , but this is still an automation blocker for ovn > , > > Eran can you retest your updates job with the said patch after :3ceph has > been added (Arie's patch) > (ir_gerrit_patch: 442177) > and tell us if you are still seeing an issue ? it still under testing I will update soon as I have some results.
This BZ have nothing to do with update