Description of problem:
When importing an external cluster that was deployed by the OSP-10 Director, the "Import Cluster" task succeeds but most of the individual Ceph nodes fail to register with the Storage Console.
Version-Release number of selected component (if applicable):
Red Hat Storage Console 2.0
Red Hat Ceph Storage 2.0
How reproducible: Always
Steps to Reproduce:
1. Deploy a Ceph cluster on baremetal hardware using OSP-10 Director
2. Deploy Storage Console in a VM, console agent on OSP Ceph nodes
3. Import the cluster into the Storage Console
The Storage Console shows most Ceph nodes (MON and OSD) with a red X, and hovering over the X displays "Failed." The Import Cluster tasks reports it was successful, and there are no failed tasks.
Each time I repeat the cluster import, one node will have a green check-box and a link to a page that displays stats for that node. It's not always the same node that succeeds. It might be a MON or OSD. But it's always just one node that succeeds.
All OSD and MON nodes are successfully registered with the Storage Console.
I have an sosreport for the Storage Console VM that I'll post as soon as possible (currently having network issues).
Created attachment 1240506 [details]
sosreport for Storage Console VM
For the first look, this seems to me, like problem with duplicated machine-id across the storage nodes.
Could you please check, if the content of file '/etc/machine-id' is different on each storage node?
The machine IDs are unique because I made them so.
The first time I tried importing an external cluster they weren't unique (due to Bug #1270860), and that was apparent because a number of import tasks failed with the error, "Unable to add details of node: <Node’s FQDN> to DB, error: Node with id:<Node’s ID> already exists." I was alerted to the duplicate machine ID problem, which I resolved by generating new IDs for each node in the OSP overcloud.
Then I re-deployed a fresh Storage Console VM and tried to import the cluster again. The task failure messages no longer occur now that the machine IDs are unique, but I'm still seeing the symptom in the bug description.
I have one question regarding the IDs. After you changed the machine IDs, did you perform https://access.redhat.com/documentation/en/red-hat-storage-console/2.0/single/administration-guide/#troubleshooting_nodes_configuration - issue 2? Note that you could remove all keys from console machine with 'salt -D' command.
BTW thanks for sos report but the most important logs are not there, content of /var/log/skyring and /var/log/salt could be useful too.
(In reply to Alan Bishop from comment #5)
> The machine IDs are unique because I made them so.
> The first time I tried importing an external cluster they weren't unique
> (due to Bug #1270860), and that was apparent because a number of import
> tasks failed with the error, "Unable to add details of node: <Node’s FQDN>
> to DB, error: Node with id:<Node’s ID> already exists." I was alerted to the
> duplicate machine ID problem, which I resolved by generating new IDs for
> each node in the OSP overcloud.
> Then I re-deployed a fresh Storage Console VM and tried to import the
> cluster again. The task failure messages no longer occur now that the
> machine IDs are unique, but I'm still seeing the symptom in the bug
Hi Alan, can you please respond to comment 6, which I have just removed the private flag from? Thanks
(In reply to Lubos Trilety from comment #6)
No, I did not follow those steps (I think you mean for issue 3, not 2). I had just finished resolving the duplicate machine-id problem, and then I tried another import.
Unfortunately the original setup has been torn down, but I nearly have a clean replacement setup to try the import operation again. That is, I have a totally fresh OSP and Ceph deployment, and fresh Storage Console VM.
I'll try another import and report back.
I think that was it! ( Issue 3 in https://access.redhat.com/documentation/en/red-hat-storage-console/2.0/single/administration-guide/#troubleshooting_nodes_configuration). This time I modified the machine IDs to ensure they're unique *before* installing the storage console agent, and the import task was successful and all nodes are properly registered.
I think my original trouble was I updated the machine IDs *after* installing the console agent, and didn't know I needed to execute the corrective action outlined in the troubleshooting guide. Thanks, Lubos!
Closing as NOTABUG.