Bug 1668568 - After UC/OC update : Pre live migration failed at compute-x: RemoteError: Remote error: ClientException Unable to create attachment for volume.
Summary: After UC/OC update : Pre live migration failed at compute-x: RemoteError: Rem...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ceph
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Giulio Fidente
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-23 07:12 UTC by pkomarov
Modified: 2019-03-31 06:40 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-31 06:40:58 UTC
Target Upstream Version:
Embargoed:
abishop: needinfo-


Attachments (Terms of Use)

Description pkomarov 2019-01-23 07:12:48 UTC
This is from the updates-ovn jobs : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/

There is an OC instance created during the Overcloud-update : this instance has a Cinder Volume which is in ERROR state, and that prevents the live migration:

Instance and volume attachment happen here : (created before the OC reboot : https://github.com/openstack/tripleo-upgrade/blob/c5038babb5966d3998e807b668c9ecc477175cb0/templates/workload_launch.sh.j2#L211)

The attachement error can be seen here: 
/var/log/containers/nova/nova-compute.log:2019-01-17 12:52:21.075 1 ERROR nova.volume.cinder [req-9cc1f5b0-023d-4eca-b071-3d102d2776d4 ec75d0e6c48c4e8e9ee9818e0b88bdf5 8fa2e29ca76f4106955124dd1dea5331 - default default] [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] Create attachment failed for volume 68bb0540-d8f1-4cdf-962b-7f66e2523821. Error: Unable to create attachment for volume

and also here which causes live migration to halt: 
/var/log/containers/nova/nova-compute.log:2019-01-17 12:52:22.169 1 ERROR nova.compute.manager [-] [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] Pre live migration failed at compute-0.localdomain: RemoteError: Remote error: ClientException Unable to create attachment for volume. (HTTP 500) (Request-ID: req-d541d697-159f-4e4b-ab44-a7dd10780b92)
...
/var/log/containers/nova/nova-compute.log:2019-01-17 12:52:22.169 1 ERROR nova.compute.manager [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] RemoteError: Remote error: ClientException Unable to create attachment for volume. (HTTP 500) (Request-ID: req-d541d697-159f-4e4b-ab44-a7dd10780b92)

Cinder also confirms the error with the volume: 
/var/log/containers/cinder/cinder-volume.log:2019-01-17 10:23:42.514 70 ERROR cinder.volume.manager Stderr: u'  Failed to find logical volume "cinder-volumes/volume-68bb0540-d8f1-4cdf-962b-7f66e2523821"\n'

To test this with more jobs I've created an infrared commit which tries to check and remove the volume from the workload instance before the OC_reboot procedure : 
https://review.gerrithub.io/#/c/redhat-openstack/infrared/+/441197/3
You can reproduce this using : IR_GERRIT_CHANGE: 441197/3

This patch tries to do: openstack server remove volume $vol_id $server_workload_id
Results : After Overcloud update instance volume is in Error state 
http://pastebin.test.redhat.com/699078

This result reproduced again and again and I believe this is the source of the problem here : 

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/71/artifact/.sh/ir-tripleo-overcloud-reboot.log
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/70/artifact/.sh/ir-tripleo-overcloud-reboot.log

Seeing as the previous issue with rabbit-neutron post reboot is no longer the prime suspect here (this is happening pre oc_reboot), I'm changing 
this bug direction towards the error'd volume after OC_upgrade.

Comment 1 pkomarov 2019-01-23 12:06:28 UTC
automation workaround is : 
IR_GERRIT_CHANGE: 441357 
This removes the volume creation and attachement to an instance in workload creation before the update stages begin 
tested in : 
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/88/

Comment 2 Alan Bishop 2019-01-23 15:21:21 UTC
The problem is due to using cinder's LVM backend in a multi-controller environment. This is not a valid configuration, so the CI job will need to be updated.

The issue is cinder's LVM backend stores volumes locally on whatever controller node is currently running the c-vol service. Rebooting the active controller will cause pacemaker to transfer the c-vol service onto another node. Unfortunately, the cinder volume data isn't present on the new node, which results in the failure you're seeing.

Any multi-controller CI jobs that involve cinder volumes require cinder use a shared storage backend (e.g. ceph).

Comment 3 pkomarov 2019-01-25 14:27:34 UTC
ack, I'll push for an update (+ceph) on the reproducer jobs
and post results

Comment 8 pkomarov 2019-01-27 06:28:24 UTC
issue did not reproduce with update+ceph topology , tested in pidone dfg job: 
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/pidone/view/updates/job/DFG-pidone-updates-14_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-sanity/

removing the blocker flag , but this is still an automation blocker for ovn , 

Eran can you retest your updates job with the said patch after :3ceph has been added (Arie's patch) 
(ir_gerrit_patch: 442177)
and tell us if you are still seeing an issue ?

Comment 10 Eran Kuris 2019-01-29 07:54:57 UTC
(In reply to pkomarov from comment #8)
> issue did not reproduce with update+ceph topology , tested in pidone dfg
> job: 
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/pidone/
> view/updates/job/DFG-pidone-updates-14_director-rhel-virthost-
> 3cont_2comp_3ceph-ipv4-vxlan-sanity/
> 
> removing the blocker flag , but this is still an automation blocker for ovn
> , 
> 
> Eran can you retest your updates job with the said patch after :3ceph has
> been added (Arie's patch) 
> (ir_gerrit_patch: 442177)
> and tell us if you are still seeing an issue ?

it still under testing I will update soon as I have some results.

Comment 13 Raviv Bar-Tal 2019-02-04 13:41:20 UTC
This BZ have nothing to do with update


Note You need to log in before you can comment on or make changes to this bug.