Work item: Improve output upon error. From the email thread: >> 2- Missing error messages >> >> There were times when the refresh command didn't output any errors on >> the terminal though it had failed. >> >> This happened for example when we run into the issue described in the >> previous point, where exit code from the script would be 1 but there >> would be no message printed on the screen. >> >> This lead support team to believe during the weekend session that the >> command did run successfully, so they didn't understand why the proposed >> steps in our document would not work. > yep that was mentioned by melanie or gibi as well i think on monday that defently can be improved When I heard this I thought maybe the output was being captured to a log file instead of going to the console. But I googled again and found this: https://stackoverflow.com/questions/55325145/python-not-printing-output Maybe we just need to flush stdout before exiting nova-manage? I'm surprised we never ran into this before though. We need to understand what's going on here - whether it's a matter of just calling flush() like Melanie said, or do we need to add called to LOG.error in some places.
Work item: fix handling of instance locking. From the email thread: >> 3- VMs in locked state >> >> This may be by design, but I'll say it here and let the compute team >> decide on the correct behavior. >> >> On some failures, like the one from step #1, the refresh script leaves >> the instance in a locked state instead of clearing it. > > ya that kind of a bug. > we put it in the locked state to make sure the end user cannot make any action like hard rebooting the instace > while we are messing with the db. that is also why we require the vm to be off so that they cant power it off > by sshing in. > > regardless of the success or failure the reshsh command shoudl restore the lock state > > so if it was locked before leave it locked and if it was unlocked leave it unlocked. > so this sound like a bug in our error handeling and clean up +1
Work item: disconnecting the volume from the correct host. From the email thread: >> 5- Disconnecting from the wrong host >> >> There were cases where the instance said to live in compute#1 but the >> connection_info in the BDM record was for compute#2, and when the script >> called `remote_volume_connection` then nova would call os-brick on >> compute#1 (the wrong node) and try to detach it. >> >> In some case os-brick would mistakenly think that the volume was >> attached (because the target and lun matched an existing volume on the >> host) and would try to disconnect, resulting in errors on the compute >> logs. >> >> It wasn't a problem (besides creating some confusion and noise) because >> the removal of the multipath failed since it was in use by an instance. >> >> I believe it may be necessary to change the code here: >> >> compute_rpcapi.remove_volume_connection( >> cctxt, instance, volume_id, instance.host) >> >> To use the "host" from the connector properties in the >> bdb.connection_info if it is present. > > ya that also sound like a clear bug