Description of problem: We have detected that Ceph monitor IPs are hardcoded in table nova.block_device_mappings select * from block_device_mapping where connection_info is not null limit 1\G *************************** 1. row *************************** created_at: 2015-06-22 13:26:11 updated_at: 2015-06-22 13:26:29 deleted_at: 2015-06-22 13:38:28 id: 629 device_name: /dev/vda delete_on_termination: 0 snapshot_id: NULL volume_id: f170f647-0495-4920-b16e-2f6d44a74696 volume_size: 60 no_device: NULL connection_info: {"driver_volume_type": "rbd", "serial": "f170f647-0495-4920-b16e-2f6d44a74696", "data": {"secret_type": "ceph", "device_path": null, "name": "volumes/volume-f170f647-0495-4920-b16e-2f6d44a74696", "secret_uuid": "11424f9e-0414-4162-9e73-68c69bfc6abc", "qos_specs": null, "hosts": ["10.72.0.43", "10.72.3.21", "10.72.3.27", "10.72.3.30", "10.72.3.31"], "auth_enabled": true, "access_mode": "rw", "auth_username": "volumes", "ports": ["6789", "6789", "6789", "6789", "6789"]}} instance_uuid: e41f87a5-492d-4f9f-8726-f7f382f90e06 deleted: 629 source_type: image destination_type: volume guest_format: NULL device_type: disk disk_bus: virtio boot_index: 0 image_id: f773e032-47cc-45ba-bec7-41400e01801f 1 row in set (0.00 sec) If Ceph monitors are replaced this could be a real problem and we do not have any tool to update all these fields on Nova database. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: This is an urgent matter as we need to replace Ceph monitors and we do not really want to execute queries directly on OSP databases. Additional info:
It's worth noting that the connection_info is updated upon hard reboot or stop/start of an instance (CLI commands 'nova reboot --hard' and 'nova stop'/'nova start'). So instances can be refreshed with new Ceph monitor IPs by hard rebooting or stop/starting them.
Let me check with the customer because they are not deploying with OSPd. The installation is from 2014, starting with OSP5 and upgrading rolling out. Could be something coming from the first deployment?
As I mentioned earlier [1], it is possible to refresh instances block device mapping connection_info by 'nova stop <instance>' followed by 'nova start <instance>' or hard-rebooting the instance 'nova reboot --hard <instance>'. Using stop/start is safer as it will gracefully shutdown the instance instead of hard poweroff. After 'nova stop' followed by 'nova start', the instance should have the new Ceph monitor IPs. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1414124#c1
Hi, do we have any update about this BZ?
We have been discussing the possibility of a solution suitable for a large number of instances that would not require numerous individual commands to update instances. Can you provide more detail about the problem? Specifically, we want to know: 1. Does a change in Ceph monitor IPs affect running instances? That is, do running instances lose connection to the monitor after a change in monitor IP? 2. After a monitor IP change, are there specific actions that cause the instance to lose connection to the monitor after they complete? For example, hard reboot, resize, start?
Hello, 1. Does a change in Ceph monitor IPs affect running instances? That is, do running instances lose connection to the monitor after a change in monitor IP? No, it does not affect running instances. Qemu is responsible of the instance's connections with the ceph cluster. Once a monitor has been added/deleted and because the Qemu layer is connected to the ceph cluster it will know that a monitor has been replaced. 2. After a monitor IP change, are there specific actions that cause the instance to lose connection to the monitor after they complete? For example, hard reboot, resize, start? If a monitor IP has changed and nova.block_device_mapping table has not been updated with the new monitor IP, instance connection will still trying to connect to old monitor IP. So, in my opinion here is the procedure nova should follow after a monitor IP has changed: -Change monitor IP -Freeze cinder and nova operations. -Update nova database with new values. -Unfreeze nova and cinder operations. -Once instances are hard rebooted or stop/start, they will get the new monitor IP on the qemu XML file. Thanks.
(In reply to Daniel Dominguez from comment #15) > No, it does not affect running instances. Qemu is responsible of the > instance's connections with the ceph cluster. Once a monitor has been > added/deleted and because the Qemu layer is connected to the ceph cluster it > will know that a monitor has been replaced. Thanks for confirming that. > If a monitor IP has changed and nova.block_device_mapping table has not been > updated with the new monitor IP, instance connection will still trying to > connect to old monitor IP. Yes, I wanted to know if you had noticed specific instance operations that result in the instance losing connection to the monitor. For example, I suspect operations such as: hard reboot, resize, and start from a stop will cause the instance to read the stale monitor IP from the database and cause the instance not to be able to reconnect to the monitor. > So, in my opinion here is the procedure nova > should follow after a monitor IP has changed: > > -Change monitor IP > -Freeze cinder and nova operations. > -Update nova database with new values. > -Unfreeze nova and cinder operations. > -Once instances are hard rebooted or stop/start, they will get the new > monitor IP on the qemu XML file. We have been thinking of an auto-heal approach where we 1) identify the operations that cause the instance to pull stale monitor IP from the database 2) do an auto-heal during those specific operations that queries the current IPs from Cinder and updates the Nova database first before it proceeds with the rest of the operation. That way, the fix is transparent to users and no special action would be needed to update instances. This is something we would work on upstream and bring into OSP.
(In reply to melanie witt from comment #16) > Yes, I wanted to know if you had noticed specific instance operations that > result in the instance losing connection to the monitor. For example, I > suspect operations such as: hard reboot, resize, and start from a stop will > cause the instance to read the stale monitor IP from the database and cause > the instance not to be able to reconnect to the monitor. I think operations such as migrate, live-migration, evacuate, host-evacuate, host-evacuate-live, host-servers-migrate and shelve/unshelve will also cause the instance to read the stale monitor IP from the database and cause the instance not to be able to reconnect to the monitor. > We have been thinking of an auto-heal approach where we 1) identify the > operations that cause the instance to pull stale monitor IP from the > database 2) do an auto-heal during those specific operations that queries > the current IPs from Cinder and updates the Nova database first before it > proceeds with the rest of the operation. That way, the fix is transparent to > users and no special action would be needed to update instances. This is > something we would work on upstream and bring into OSP. That is a much better option. Thanks for your help.
Hi, is there any news about this RFE?
Dropping the FutureFeature keyword, this is really a bugfix.
I am closing this bug as it has not been addressed for a very long time. Please feel free to reopen if it is still relevant.