Bug 2161733

Summary: Make sure errors of nova-manage attachment refresh command are shown
Product: Red Hat OpenStack Reporter: Artom Lifshitz <alifshit>
Component: openstack-novaAssignee: Amit Uniyal <auniyal>
Status: ON_QA --- QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: low Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: dasmith, eglynn, jhakimra, kchamart, sbauza, sgordon, udesale, vromanso
Target Milestone: z6Keywords: Patch, Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-nova-20.6.2-2.20230713165111.8a24acd.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Artom Lifshitz 2023-01-17 19:51:57 UTC
Work item: Improve output upon error. From the email thread:

>> 2- Missing error messages
>>
>> There were times when the refresh command didn't output any errors on
>> the terminal though it had failed.
>>
>> This happened for example when we run into the issue described in the
>> previous point, where exit code from the script would be 1 but there
>> would be no message printed on the screen.
>>
>> This lead support team to believe during the weekend session that the
>> command did run successfully, so they didn't understand why the proposed
>> steps in our document would not work.
> yep that was mentioned by melanie or gibi as well i think on monday that defently can be improved

When I heard this I thought maybe the output was being captured to a log
file instead of going to the console. But I googled again and found this:

https://stackoverflow.com/questions/55325145/python-not-printing-output

Maybe we just need to flush stdout before exiting nova-manage? I'm
surprised we never ran into this before though.

We need to understand what's going on here - whether it's a matter of just calling flush() like Melanie said, or do we need to add called to LOG.error in some places.

Comment 3 Artom Lifshitz 2023-01-17 19:52:52 UTC
Work item: fix handling of instance locking. From the email thread:

>> 3- VMs in locked state
>>
>> This may be by design, but I'll say it here and let the compute team
>> decide on the correct behavior.
>>
>> On some failures, like the one from step #1, the refresh script leaves
>> the instance in a locked state instead of clearing it.
> 
> ya that kind of a bug.
> we put it in the locked state to make sure the end user cannot make any action like hard rebooting the instace
> while we are messing with the db. that is also why we require the vm to be off so that they cant power it off
> by sshing in.
> 
> regardless of the success or failure the reshsh command shoudl restore the lock state
> 
> so if it was locked before leave it locked and if it was unlocked leave it unlocked.
> so this sound like a bug in our error handeling and clean up

+1

Comment 4 Artom Lifshitz 2023-01-17 19:54:14 UTC
Work item: disconnecting the volume from the correct host. From the email thread:

>> 5- Disconnecting from the wrong host
>>
>> There were cases where the instance said to live in compute#1 but the
>> connection_info in the BDM record was for compute#2, and when the script
>> called `remote_volume_connection` then nova would call os-brick on
>> compute#1 (the wrong node) and try to detach it.
>>
>> In some case os-brick would mistakenly think that the volume was
>> attached (because the target and lun matched an existing volume on the
>> host) and would try to disconnect, resulting in errors on the compute
>> logs.
>>
>> It wasn't a problem (besides creating some confusion and noise) because
>> the removal of the multipath failed since it was in use by an instance.
>>
>> I believe it may be necessary to change the code here:
>>
>>                  compute_rpcapi.remove_volume_connection(
>>                      cctxt, instance, volume_id, instance.host)
>>
>> To use the "host" from the connector properties in the
>> bdb.connection_info if it is present.
> 
> ya that also sound like a clear bug