Bug 1413152

Summary:

Error registering Ceph nodes when importing external Ceph cluster into Storage Console

Product:

[Red Hat Storage] Red Hat Storage Console

Reporter:

Alan Bishop <alan_bishop>

Component:

node-monitoring

Assignee:

anmol babu <anbabu>

Status:

CLOSED NOTABUG

QA Contact:

sds-qe-bugs

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

alan_bishop, arkady_kanevsky, cdevine, christopher_dearborn, dahorak, dcain, John_walsh, kasmith, kurt_hey, ltrilety, mkarnik, morazi, nthomas, randy_perryman, rkanade, sankarshan, smerrow, sreichar

Target Milestone:

---

Target Release:

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-01-19 20:44:38 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1335596, 1356451

Attachments:

Description	Flags
sosreport for Storage Console VM	none

Description Alan Bishop 2017-01-13 18:30:06 UTC

Description of problem:

When importing an external cluster that was deployed by the OSP-10 Director, the "Import Cluster" task succeeds but most of the individual Ceph nodes fail to register with the Storage Console.

Version-Release number of selected component (if applicable):

Red Hat Storage Console 2.0
Red Hat Ceph Storage 2.0
OSP-10

How reproducible: Always


Steps to Reproduce:
1. Deploy a Ceph cluster on baremetal hardware using OSP-10 Director
2. Deploy Storage Console in a VM, console agent on OSP Ceph nodes
3. Import the cluster into the Storage Console

Actual results:

The Storage Console shows most Ceph nodes (MON and OSD) with a red X, and hovering over the X displays "Failed." The Import Cluster tasks reports it was successful, and there are no failed tasks.

Each time I repeat the cluster import, one node will have a green check-box and a link to a page that displays stats for that node. It's not always the same node that succeeds. It might be a MON or OSD. But it's always just one node that succeeds.

Expected results:

All OSD and MON nodes are successfully registered with the Storage Console.

Additional info:

Comment 2 Alan Bishop 2017-01-13 19:14:05 UTC

I have an sosreport for the Storage Console VM that I'll post as soon as possible (currently having network issues).

Comment 3 Alan Bishop 2017-01-13 20:28:33 UTC

Created attachment 1240506 [details]
sosreport for Storage Console VM

Comment 4 Daniel Horák 2017-01-16 09:45:42 UTC

For the first look, this seems to me, like problem with duplicated machine-id across the storage nodes.

Could you please check, if the content of file '/etc/machine-id' is different on each storage node?

Comment 5 Alan Bishop 2017-01-16 12:51:26 UTC

The machine IDs are unique because I made them so. 

The first time I tried importing an external cluster they weren't unique (due to Bug #1270860), and that was apparent because a number of import tasks failed with the error, "Unable to add details of node: <Node’s FQDN> to DB, error: Node with id:<Node’s ID> already exists." I was alerted to the duplicate machine ID problem, which I resolved by generating new IDs for each node in the OSP overcloud.

Then I re-deployed a fresh Storage Console VM and tried to import the cluster again. The task failure messages no longer occur now that the machine IDs are unique, but I'm still seeing the symptom in the bug description.

Comment 6 Lubos Trilety 2017-01-17 07:38:02 UTC

I have one question regarding the IDs. After you changed the machine IDs, did you perform https://access.redhat.com/documentation/en/red-hat-storage-console/2.0/single/administration-guide/#troubleshooting_nodes_configuration - issue 2? Note that you could remove all keys from console machine with 'salt -D' command.

BTW thanks for sos report but the most important logs are not there, content of /var/log/skyring and /var/log/salt could be useful too.

(In reply to Alan Bishop from comment #5)
> The machine IDs are unique because I made them so. 
> 
> The first time I tried importing an external cluster they weren't unique
> (due to Bug #1270860), and that was apparent because a number of import
> tasks failed with the error, "Unable to add details of node: <Node’s FQDN>
> to DB, error: Node with id:<Node’s ID> already exists." I was alerted to the
> duplicate machine ID problem, which I resolved by generating new IDs for
> each node in the OSP overcloud.
> 
> Then I re-deployed a fresh Storage Console VM and tried to import the
> cluster again. The task failure messages no longer occur now that the
> machine IDs are unique, but I'm still seeing the symptom in the bug
> description.

Comment 7 Sean Merrow 2017-01-19 18:16:32 UTC

Hi Alan, can you please respond to comment 6, which I have just removed the private flag from? Thanks

Comment 8 Alan Bishop 2017-01-19 18:29:00 UTC

(In reply to Lubos Trilety from comment #6)
No, I did not follow those steps (I think you mean for issue 3, not 2). I had just finished resolving the duplicate machine-id problem, and then I tried another import.

Unfortunately the original setup has been torn down, but I nearly have a clean replacement setup to try the import operation again. That is, I have a totally fresh OSP and Ceph deployment, and fresh Storage Console VM.

I'll try another import and report back.

Comment 9 Alan Bishop 2017-01-19 20:44:38 UTC

I think that was it! ( Issue 3 in https://access.redhat.com/documentation/en/red-hat-storage-console/2.0/single/administration-guide/#troubleshooting_nodes_configuration). This time I modified the machine IDs to ensure they're unique *before* installing the storage console agent, and the import task was successful and all nodes are properly registered.

I think my original trouble was I updated the machine IDs *after* installing the console agent, and didn't know I needed to execute the corrective action outlined in the troubleshooting guide. Thanks, Lubos!

Closing as NOTABUG.