Bug 1414124 - Ceph Monitor hardcoded IPs in Nova database
Summary: Ceph Monitor hardcoded IPs in Nova database
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Lee Yarwood
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-17 20:25 UTC by Edu Alcaniz
Modified: 2024-03-25 14:58 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-25 14:37:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 579004 0 'None' NEW block_device: Optionally recreate attachments when refreshing connection_info 2021-01-25 14:28:16 UTC
Red Hat Issue Tracker OSP-23306 0 None None None 2023-03-21 18:44:09 UTC

Description Edu Alcaniz 2017-01-17 20:25:20 UTC
Description of problem:
We have detected that Ceph monitor IPs are hardcoded in table nova.block_device_mappings
select * from block_device_mapping where connection_info is not null limit 1\G
*************************** 1. row ***************************
           created_at: 2015-06-22 13:26:11
           updated_at: 2015-06-22 13:26:29
           deleted_at: 2015-06-22 13:38:28
                   id: 629
          device_name: /dev/vda
delete_on_termination: 0
          snapshot_id: NULL
            volume_id: f170f647-0495-4920-b16e-2f6d44a74696
          volume_size: 60
            no_device: NULL
      connection_info: {"driver_volume_type": "rbd", "serial": "f170f647-0495-4920-b16e-2f6d44a74696", "data": {"secret_type": "ceph", "device_path": null, "name": "volumes/volume-f170f647-0495-4920-b16e-2f6d44a74696", "secret_uuid": "11424f9e-0414-4162-9e73-68c69bfc6abc", "qos_specs": null, "hosts": ["10.72.0.43", "10.72.3.21", "10.72.3.27", "10.72.3.30", "10.72.3.31"], "auth_enabled": true, "access_mode": "rw", "auth_username": "volumes", "ports": ["6789", "6789", "6789", "6789", "6789"]}}
        instance_uuid: e41f87a5-492d-4f9f-8726-f7f382f90e06
              deleted: 629
          source_type: image
     destination_type: volume
         guest_format: NULL
          device_type: disk
             disk_bus: virtio
           boot_index: 0
             image_id: f773e032-47cc-45ba-bec7-41400e01801f
1 row in set (0.00 sec)

If Ceph monitors are replaced this could be a real problem and we do not have any tool to update all these fields on Nova database.




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
This is an urgent matter as we need to replace Ceph monitors and we do not really want to execute queries directly on OSP databases.

Additional info:

Comment 1 melanie witt 2017-01-20 17:42:09 UTC
It's worth noting that the connection_info is updated upon hard reboot or stop/start of an instance (CLI commands 'nova reboot --hard' and 'nova stop'/'nova start'). So instances can be refreshed with new Ceph monitor IPs by hard rebooting or stop/starting them.

Comment 2 Edu Alcaniz 2017-02-15 14:20:14 UTC
Let me check with the customer because they are not deploying with OSPd. The installation is from 2014, starting with OSP5 and upgrading rolling out. Could be something coming from the first deployment?

Comment 5 melanie witt 2017-02-18 02:42:50 UTC
As I mentioned earlier [1], it is possible to refresh instances block device mapping connection_info by 'nova stop <instance>' followed by 'nova start <instance>' or hard-rebooting the instance 'nova reboot --hard <instance>'. Using stop/start is safer as it will gracefully shutdown the instance instead of hard poweroff.

After 'nova stop' followed by 'nova start', the instance should have the new Ceph monitor IPs.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1414124#c1

Comment 13 Edu Alcaniz 2017-03-23 13:20:48 UTC
Hi, do we have any update about this BZ?

Comment 14 melanie witt 2017-04-04 17:09:29 UTC
We have been discussing the possibility of a solution suitable for a large number of instances that would not require numerous individual commands to update instances.

Can you provide more detail about the problem? Specifically, we want to know:

1. Does a change in Ceph monitor IPs affect running instances? That is, do running instances lose connection to the monitor after a change in monitor IP?

2. After a monitor IP change, are there specific actions that cause the instance to lose connection to the monitor after they complete? For example, hard reboot, resize, start?

Comment 15 Daniel Dominguez 2017-04-20 09:54:27 UTC
Hello,

1. Does a change in Ceph monitor IPs affect running instances? That is, do running instances lose connection to the monitor after a change in monitor IP?

No, it does not affect running instances. Qemu is responsible of the instance's connections with the ceph cluster. Once a monitor has been added/deleted and because the Qemu layer is connected to the ceph cluster it will know that a monitor has been replaced.

2. After a monitor IP change, are there specific actions that cause the instance to lose connection to the monitor after they complete? For example, hard reboot, resize, start?

If a monitor IP has changed and nova.block_device_mapping table has not been updated with the new monitor IP, instance connection will still trying to connect to old monitor IP. So, in my opinion here is the procedure nova should follow after a monitor IP has changed:

	-Change monitor IP
	-Freeze cinder and nova operations.
	-Update nova database with new values.
	-Unfreeze nova and cinder operations.
	-Once instances are hard rebooted or stop/start, they will get the new monitor IP on the qemu XML file.

Thanks.

Comment 16 melanie witt 2017-04-20 15:20:28 UTC
(In reply to Daniel Dominguez from comment #15) 
> No, it does not affect running instances. Qemu is responsible of the
> instance's connections with the ceph cluster. Once a monitor has been
> added/deleted and because the Qemu layer is connected to the ceph cluster it
> will know that a monitor has been replaced.

Thanks for confirming that.

> If a monitor IP has changed and nova.block_device_mapping table has not been
> updated with the new monitor IP, instance connection will still trying to
> connect to old monitor IP.

Yes, I wanted to know if you had noticed specific instance operations that result in the instance losing connection to the monitor. For example, I suspect operations such as: hard reboot, resize, and start from a stop will cause the instance to read the stale monitor IP from the database and cause the instance not to be able to reconnect to the monitor.

> So, in my opinion here is the procedure nova
> should follow after a monitor IP has changed:
> 
> 	-Change monitor IP
> 	-Freeze cinder and nova operations.
> 	-Update nova database with new values.
> 	-Unfreeze nova and cinder operations.
> 	-Once instances are hard rebooted or stop/start, they will get the new
> monitor IP on the qemu XML file.

We have been thinking of an auto-heal approach where we 1) identify the operations that cause the instance to pull stale monitor IP from the database 2) do an auto-heal during those specific operations that queries the current IPs from Cinder and updates the Nova database first before it proceeds with the rest of the operation. That way, the fix is transparent to users and no special action would be needed to update instances. This is something we would work on upstream and bring into OSP.

Comment 17 Daniel Dominguez 2017-05-08 15:02:11 UTC
(In reply to melanie witt from comment #16)
> Yes, I wanted to know if you had noticed specific instance operations that
> result in the instance losing connection to the monitor. For example, I
> suspect operations such as: hard reboot, resize, and start from a stop will
> cause the instance to read the stale monitor IP from the database and cause
> the instance not to be able to reconnect to the monitor.

I think operations such as migrate, live-migration, evacuate, host-evacuate, host-evacuate-live, host-servers-migrate and shelve/unshelve will also cause the instance to read the stale monitor IP from the database and cause the instance not to be able to reconnect to the monitor.

> We have been thinking of an auto-heal approach where we 1) identify the
> operations that cause the instance to pull stale monitor IP from the
> database 2) do an auto-heal during those specific operations that queries
> the current IPs from Cinder and updates the Nova database first before it
> proceeds with the rest of the operation. That way, the fix is transparent to
> users and no special action would be needed to update instances. This is
> something we would work on upstream and bring into OSP.

That is a much better option. Thanks for your help.

Comment 24 Edu Alcaniz 2018-01-22 07:25:16 UTC
Hi, is there any news about this RFE?

Comment 25 Lee Yarwood 2018-03-19 11:41:02 UTC
Dropping the FutureFeature keyword, this is really a bugfix.

Comment 31 Matthew Booth 2019-10-15 09:28:47 UTC
I am closing this bug as it has not been addressed for a very long time. Please feel free to reopen if it is still relevant.


Note You need to log in before you can comment on or make changes to this bug.