1413152 – Error registering Ceph nodes when importing external Ceph cluster into Storage Console

Bug 1413152 - Error registering Ceph nodes when importing external Ceph cluster into Storage Console

Summary: Error registering Ceph nodes when importing external Ceph cluster into Storag...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	node-monitoring
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3
Assignee:	anmol babu
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1335596 1356451
TreeView+	depends on / blocked

Reported:	2017-01-13 18:30 UTC by Alan Bishop
Modified:	2017-01-19 20:44 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-19 20:44:38 UTC
Embargoed:

Attachments	(Terms of Use)
sosreport for Storage Console VM (13.70 MB, application/x-xz) 2017-01-13 20:28 UTC, Alan Bishop	no flags	Details
View All

Description Alan Bishop 2017-01-13 18:30:06 UTC

Description of problem:

When importing an external cluster that was deployed by the OSP-10 Director, the "Import Cluster" task succeeds but most of the individual Ceph nodes fail to register with the Storage Console.

Version-Release number of selected component (if applicable):

Red Hat Storage Console 2.0
Red Hat Ceph Storage 2.0
OSP-10

How reproducible: Always


Steps to Reproduce:
1. Deploy a Ceph cluster on baremetal hardware using OSP-10 Director
2. Deploy Storage Console in a VM, console agent on OSP Ceph nodes
3. Import the cluster into the Storage Console

Actual results:

The Storage Console shows most Ceph nodes (MON and OSD) with a red X, and hovering over the X displays "Failed." The Import Cluster tasks reports it was successful, and there are no failed tasks.

Each time I repeat the cluster import, one node will have a green check-box and a link to a page that displays stats for that node. It's not always the same node that succeeds. It might be a MON or OSD. But it's always just one node that succeeds.

Expected results:

All OSD and MON nodes are successfully registered with the Storage Console.

Additional info:

Comment 2 Alan Bishop 2017-01-13 19:14:05 UTC

I have an sosreport for the Storage Console VM that I'll post as soon as possible (currently having network issues).

Comment 3 Alan Bishop 2017-01-13 20:28:33 UTC

Created attachment 1240506 [details]
sosreport for Storage Console VM

Comment 4 Daniel Horák 2017-01-16 09:45:42 UTC

For the first look, this seems to me, like problem with duplicated machine-id across the storage nodes.

Could you please check, if the content of file '/etc/machine-id' is different on each storage node?

Comment 5 Alan Bishop 2017-01-16 12:51:26 UTC

The machine IDs are unique because I made them so. 

The first time I tried importing an external cluster they weren't unique (due to Bug #1270860), and that was apparent because a number of import tasks failed with the error, "Unable to add details of node: <Node’s FQDN> to DB, error: Node with id:<Node’s ID> already exists." I was alerted to the duplicate machine ID problem, which I resolved by generating new IDs for each node in the OSP overcloud.

Then I re-deployed a fresh Storage Console VM and tried to import the cluster again. The task failure messages no longer occur now that the machine IDs are unique, but I'm still seeing the symptom in the bug description.

Comment 6 Lubos Trilety 2017-01-17 07:38:02 UTC

I have one question regarding the IDs. After you changed the machine IDs, did you perform https://access.redhat.com/documentation/en/red-hat-storage-console/2.0/single/administration-guide/#troubleshooting_nodes_configuration - issue 2? Note that you could remove all keys from console machine with 'salt -D' command.

BTW thanks for sos report but the most important logs are not there, content of /var/log/skyring and /var/log/salt could be useful too.

(In reply to Alan Bishop from comment #5)
> The machine IDs are unique because I made them so. 
> 
> The first time I tried importing an external cluster they weren't unique
> (due to Bug #1270860), and that was apparent because a number of import
> tasks failed with the error, "Unable to add details of node: <Node’s FQDN>
> to DB, error: Node with id:<Node’s ID> already exists." I was alerted to the
> duplicate machine ID problem, which I resolved by generating new IDs for
> each node in the OSP overcloud.
> 
> Then I re-deployed a fresh Storage Console VM and tried to import the
> cluster again. The task failure messages no longer occur now that the
> machine IDs are unique, but I'm still seeing the symptom in the bug
> description.

Comment 7 Sean Merrow 2017-01-19 18:16:32 UTC

Hi Alan, can you please respond to comment 6, which I have just removed the private flag from? Thanks

Comment 8 Alan Bishop 2017-01-19 18:29:00 UTC

(In reply to Lubos Trilety from comment #6)
No, I did not follow those steps (I think you mean for issue 3, not 2). I had just finished resolving the duplicate machine-id problem, and then I tried another import.

Unfortunately the original setup has been torn down, but I nearly have a clean replacement setup to try the import operation again. That is, I have a totally fresh OSP and Ceph deployment, and fresh Storage Console VM.

I'll try another import and report back.

Comment 9 Alan Bishop 2017-01-19 20:44:38 UTC

I think that was it! ( Issue 3 in https://access.redhat.com/documentation/en/red-hat-storage-console/2.0/single/administration-guide/#troubleshooting_nodes_configuration). This time I modified the machine IDs to ensure they're unique *before* installing the storage console agent, and the import task was successful and all nodes are properly registered.

I think my original trouble was I updated the machine IDs *after* installing the console agent, and didn't know I needed to execute the corrective action outlined in the troubleshooting guide. Thanks, Lubos!

Closing as NOTABUG.

Note You need to log in before you can comment on or make changes to this bug.