Bug 1353987

Summary: OSD Nodes are not detected
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Chris Blum <cblum>
Component: coreAssignee: Nishanth Thomas <nthomas>
core sub component: events QA Contact: sds-qe-bugs
Status: CLOSED EOL Docs Contact:
Severity: unspecified    
Priority: unspecified CC: branto, gmeno, mbukatov, mkudlej, shtripat
Version: 2   
Target Milestone: ---   
Target Release: 3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1349786    
Bug Blocks:    
Attachments:
Description Flags
api-v2-cluster-<FSID>-osd
none
api-v2-cluster-<FSID>-sync_object-osd_map
none
cthulhu.log
none
Screenshot of the Hosts page none

Description Chris Blum 2016-07-08 15:32:24 UTC
Description of problem:
When deploying a cluster with the current version of USM, the OSDs are not detected. This results in the counter for OSDs to be 0 and the user is not able to create Pools. Pool creation in the frontend is blocked by the message 'There is no OSD in the selected cluster'.
In the backend, the cluster was deployed healthy and 'osd tree' lists the OSDs as expected. Additionally the Roles of the OSD Hosts are correctly displayed in the "Hosts" tab of USM.

Version-Release number of selected component (if applicable):
recent - vagrant

How reproducible:
Always - confirmed by japplewh

Steps to Reproduce:
1. Deploy cluster with USM
2. Try to create pool in new cluster

Actual results:
Pool creation fails because USM does not know of any OSDs

Expected results:
Pool creation succeeds

Additional info:

Comment 2 Martin Bukatovic 2016-07-11 10:30:58 UTC
(In reply to Chris Blum from comment #0)
> When deploying a cluster with the current version of USM, 

Could you add a list of specific versions of rhscon* and ceph packages on RHSC
2.0 server machine? You can do that by running `rpm -qa rhscon*; rpm -qa ceph*`
there.

Comment 3 Chris Blum 2016-07-11 13:09:05 UTC
[root@rhs-c vagrant]# rpm -qa rhscon*; rpm -qa ceph*
rhscon-core-0.0.32-1.el7scon.x86_64
rhscon-core-selinux-0.0.32-1.el7scon.noarch
rhscon-ui-0.0.46-1.el7scon.noarch
rhscon-ceph-0.0.31-1.el7scon.x86_64
ceph-common-10.2.1-13.el7cp.x86_64
ceph-mds-10.2.1-13.el7cp.x86_64
ceph-deploy-1.5.33-1.el7cp.noarch
ceph-installer-1.0.11-1.el7scon.noarch
ceph-selinux-10.2.1-13.el7cp.x86_64
ceph-mon-10.2.1-13.el7cp.x86_64
ceph-osd-10.2.1-13.el7cp.x86_64
ceph-radosgw-10.2.1-13.el7cp.x86_64
ceph-base-10.2.1-13.el7cp.x86_64
ceph-10.2.1-13.el7cp.x86_64
ceph-ansible-1.0.5-19.el7scon.noarch

[root@rhcs1 vagrant]# rpm -qa rhscon*; rpm -qa ceph* <-- One of the MONs
rhscon-agent-0.0.14-1.el7scon.noarch
rhscon-core-selinux-0.0.32-1.el7scon.noarch
ceph-common-10.2.1-13.el7cp.x86_64
ceph-mds-10.2.1-13.el7cp.x86_64
ceph-deploy-1.5.33-1.el7cp.noarch
ceph-release-1-1.el7.noarch
ceph-selinux-10.2.1-13.el7cp.x86_64
ceph-mon-10.2.1-13.el7cp.x86_64
ceph-osd-10.2.1-13.el7cp.x86_64
ceph-radosgw-10.2.1-13.el7cp.x86_64
ceph-base-10.2.1-13.el7cp.x86_64
ceph-10.2.1-13.el7cp.x86_64

[vagrant@rhcs2 ~]$ rpm -qa rhscon*; rpm -qa ceph* <-- One of the OSDs
rhscon-agent-0.0.14-1.el7scon.noarch
rhscon-core-selinux-0.0.32-1.el7scon.noarch
ceph-common-10.2.1-13.el7cp.x86_64
ceph-mds-10.2.1-13.el7cp.x86_64
ceph-deploy-1.5.33-1.el7cp.noarch
ceph-release-1-1.el7.noarch
ceph-selinux-10.2.1-13.el7cp.x86_64
ceph-mon-10.2.1-13.el7cp.x86_64
ceph-osd-10.2.1-13.el7cp.x86_64
ceph-radosgw-10.2.1-13.el7cp.x86_64
ceph-base-10.2.1-13.el7cp.x86_64
ceph-10.2.1-13.el7cp.x86_64

[root@rhcs1 vagrant]# ceph -s
    cluster e78ce715-bd8d-4c17-92f5-0ac87d33a3c2
     health HEALTH_OK
     monmap e3: 3 mons at {rhcs1=192.168.15.100:6789/0,rhcs4=192.168.15.103:6789/0,rhcs5=192.168.15.104:6789/0}
            election epoch 12, quorum 0,1,2 rhcs1,rhcs4,rhcs5
     osdmap e13: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v31: 0 pgs, 0 pools, 0 bytes data, 0 objects
            67940 kB used, 199 GB / 199 GB avail

[root@rhcs1 vagrant]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.19519 root default
-2 0.09760     host rhcs2
 0 0.09760         osd.0       up  1.00000          1.00000
-3 0.09760     host rhcs3
 1 0.09760         osd.1       up  1.00000          1.00000

Comment 4 Nishanth Thomas 2016-07-13 19:04:05 UTC
This bug will be automatically when fix for bz  1349786 is available. Workaround is to use older stable version calamari i.e calamari-server-1.4.2-1

Comment 5 Chris Blum 2016-07-15 17:30:33 UTC
I redeployed today and now get
calamari-server-1.4.6-1.el7cp.x86_64
with this version I can see that the OSD number in USM is 1 (even though there are two OSDs). Even though the number is not correct, I can now create pools and RBDs again - so I'm not blocked any more.

The previous broken version was calamari-server-1.4.5

Comment 6 Chris Blum 2016-07-20 07:03:13 UTC
Since the puddle.ceph repo has been discontinued for RHS-C I moved to the http://download.eng.bos.redhat.com/rcm-guest/ceph-drops/auto/ repositories.
This change can be seen in this commit:
https://github.com/red-hat-storage/RHCS-vagrant/commit/74f1d6cc39686b942fe11566ad47f5879f03da0f

Now I get calamari-server-1.4.7-1.el7cp.x86_64 installed on the MONs and this Bug has reappeared. Sadly I'm currently blocked again since no OSDs are detected in the USM setup and thus I cannot create Pools.

Comment 7 Christina Meno 2016-07-20 18:02:00 UTC
I got access briefly and didn't learn much.

I need some additional details to trace this down.

What I'd like is /var/log/calamari/cthulhu.log*
and the data returned by the calamari API endpoint /api/v2/cluster/<FSID>/sync_object/osd_map

thanks

Comment 8 Chris Blum 2016-07-20 18:46:48 UTC
Created attachment 1182217 [details]
api-v2-cluster-<FSID>-osd

Comment 9 Chris Blum 2016-07-20 18:47:17 UTC
Created attachment 1182218 [details]
api-v2-cluster-<FSID>-sync_object-osd_map

Comment 10 Chris Blum 2016-07-20 18:47:38 UTC
Created attachment 1182220 [details]
cthulhu.log

Comment 11 Christina Meno 2016-07-20 19:42:48 UTC
Based on what I see here. I don't know of a reason why calamari is preventing the OSDs from appearing in storage console.

Shubhendu would you take a look and see if I'm missing something?

Comment 12 Shubhendu Tripathi 2016-07-20 19:51:35 UTC
Looking at api/v2/cluster/<fsid>/osd output I suspect one thing here.
The server names listed for OSDs are without domain name and I feel USM could be having full FQDN names like dhcp42-13.eng.lab.blr.redhat.com.
This name mismatch would result in deletion of OSDs which get created during create cluster flow. Later while syncing the OSD status etc. the host names might not be matching and so OSDs would be deleted from USM DB.

To verify, you can attach a screen shot of hosts lists in USM UI as well here or provide access to the setup, and I can debug more on this.

Comment 13 Chris Blum 2016-07-22 11:36:00 UTC
Created attachment 1182825 [details]
Screenshot of the Hosts page

This shows that the hosts in USM are not known by any FQDN/domain

Comment 14 Chris Blum 2016-07-22 11:38:47 UTC
My upload shows that the Hosts don't have a domain in USM, so I guess that contradicts your theory - any other reason why the OSDs number could be zero?

Comment 15 Shubhendu Tripathi 2016-07-22 17:26:52 UTC
At the moment I cannot think of anything else.
May be you should verify the URL api/v2/cluster/<fsid>/osd for calamari and see if values for server attribute are populated properly.

I feel same even Gregory has asked for in comment#7

It certainly would be better if I get access to this setup and see in detail.

Comment 16 Chris Blum 2016-08-03 19:02:23 UTC
I have just redeployed RHSCon and set all hostname mentions to lowercase letters. Now all hosts are in lowercase in the UI and in salt-key, but the OSDs are still at 0.
Were there any changes in the code regarding this?

Version:   0.0.39
Provider:   Ceph  Version:   0.0.39
Monitoring:   Graphite  Version:   0.9.15
Database:   Mongo DB  Version:   2.6.11

Setup is currently running - ping me in Slack if you want access

Comment 17 Alexandre Marangone 2017-03-13 21:07:34 UTC
I see the same behavior where the OSD count is 0 in the UI.

GET /api/v2/cluster/<fsid>/osd returns nothing from the API:
curl: (52) Empty reply from server
This cluster has a lot of OSDs (960). 

Running the same api call on a test cluster, I get a proper output.

From dmesg, I see the following every time I run the API call:
[Mar13 14:01] /opt/calamari/v invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[  +0.000005] /opt/calamari/v cpuset=/ mems_allowed=0-1
[  +0.000003] CPU: 5 PID: 964516 Comm: /opt/calamari/v Not tainted 3.10.0-514.6.2.el7.x86_64 #1
[  +0.000001] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016
[  +0.000002]  ffff881fb7b81f60 00000000e922335f ffff883e50bfbcc0 ffffffff816861ac
[  +0.000003]  ffff883e50bfbd50 ffffffff81681157 ffffffff812ae86b 00000000000000d0
[  +0.000002]  ffff883e50bfbd20 ffffffff811f16ae fffeefff00000000 000000000000000a
[  +0.000002] Call Trace:
[  +0.000006]  [<ffffffff816861ac>] dump_stack+0x19/0x1b
[  +0.000004]  [<ffffffff81681157>] dump_header+0x8e/0x225
[  +0.000005]  [<ffffffff812ae86b>] ? cred_has_capability+0x6b/0x120
[  +0.000005]  [<ffffffff811f16ae>] ? mem_cgroup_reclaim+0x4e/0x120
[  +0.000006]  [<ffffffff8118475e>] oom_kill_process+0x24e/0x3c0
[  +0.000004]  [<ffffffff810937ee>] ? has_capability_noaudit+0x1e/0x30
[  +0.000002]  [<ffffffff811f3121>] mem_cgroup_oom_synchronize+0x551/0x580
[  +0.000002]  [<ffffffff811f2570>] ? mem_cgroup_charge_common+0xc0/0xc0
[  +0.000003]  [<ffffffff81184fe4>] pagefault_out_of_memory+0x14/0x90
[  +0.000002]  [<ffffffff8167ef47>] mm_fault_error+0x68/0x12b
[  +0.000004]  [<ffffffff81691ed5>] __do_page_fault+0x395/0x450
[  +0.000002]  [<ffffffff81691fc5>] do_page_fault+0x35/0x90
[  +0.000002]  [<ffffffff8168e288>] page_fault+0x28/0x30
[  +0.000003] Task in /system.slice/supervisord.service killed as a result of limit of /system.slice/supervisord.service
[  +0.000002] memory: usage 1048576kB, limit 1048576kB, failcnt 1966
[  +0.000001] memory+swap: usage 1048576kB, limit 9007199254740988kB, failcnt 0
[  +0.000001] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[  +0.000000] Memory cgroup stats for /system.slice/supervisord.service: cache:12KB rss:1048564KB rss_huge:32768KB mapped_file:0KB swap:0KB inactive_anon:4KB ac
[  +0.000014] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  +0.000565] [113030]     0 113030    55517     3340      59        0             0 supervisord
[  +0.000014] [964516]     0 964516   621542   261952     672        0             0 /opt/calamari/v
[  +0.000006] Memory cgroup out of memory: Kill process 966289 (/opt/calamari/v) score 971 or sacrifice child
[  +0.010157] Killed process 964516 (/opt/calamari/v) total-vm:2486168kB, anon-rss:1036312kB, file-rss:11496kB, shmem-rss:0kB


I uncovered this this morning: https://bugzilla.redhat.com/show_bug.cgi?id=1431787 but it turns out that 1GB of RAM is not enough to make this API call. I just bumped the value to 4GB and now I get a complete output.

Comment 20 Shubhendu Tripathi 2018-11-19 05:44:02 UTC
This product is EOL now