Description of problem: When deploying a cluster with the current version of USM, the OSDs are not detected. This results in the counter for OSDs to be 0 and the user is not able to create Pools. Pool creation in the frontend is blocked by the message 'There is no OSD in the selected cluster'. In the backend, the cluster was deployed healthy and 'osd tree' lists the OSDs as expected. Additionally the Roles of the OSD Hosts are correctly displayed in the "Hosts" tab of USM. Version-Release number of selected component (if applicable): recent - vagrant How reproducible: Always - confirmed by japplewh Steps to Reproduce: 1. Deploy cluster with USM 2. Try to create pool in new cluster Actual results: Pool creation fails because USM does not know of any OSDs Expected results: Pool creation succeeds Additional info:
(In reply to Chris Blum from comment #0) > When deploying a cluster with the current version of USM, Could you add a list of specific versions of rhscon* and ceph packages on RHSC 2.0 server machine? You can do that by running `rpm -qa rhscon*; rpm -qa ceph*` there.
[root@rhs-c vagrant]# rpm -qa rhscon*; rpm -qa ceph* rhscon-core-0.0.32-1.el7scon.x86_64 rhscon-core-selinux-0.0.32-1.el7scon.noarch rhscon-ui-0.0.46-1.el7scon.noarch rhscon-ceph-0.0.31-1.el7scon.x86_64 ceph-common-10.2.1-13.el7cp.x86_64 ceph-mds-10.2.1-13.el7cp.x86_64 ceph-deploy-1.5.33-1.el7cp.noarch ceph-installer-1.0.11-1.el7scon.noarch ceph-selinux-10.2.1-13.el7cp.x86_64 ceph-mon-10.2.1-13.el7cp.x86_64 ceph-osd-10.2.1-13.el7cp.x86_64 ceph-radosgw-10.2.1-13.el7cp.x86_64 ceph-base-10.2.1-13.el7cp.x86_64 ceph-10.2.1-13.el7cp.x86_64 ceph-ansible-1.0.5-19.el7scon.noarch [root@rhcs1 vagrant]# rpm -qa rhscon*; rpm -qa ceph* <-- One of the MONs rhscon-agent-0.0.14-1.el7scon.noarch rhscon-core-selinux-0.0.32-1.el7scon.noarch ceph-common-10.2.1-13.el7cp.x86_64 ceph-mds-10.2.1-13.el7cp.x86_64 ceph-deploy-1.5.33-1.el7cp.noarch ceph-release-1-1.el7.noarch ceph-selinux-10.2.1-13.el7cp.x86_64 ceph-mon-10.2.1-13.el7cp.x86_64 ceph-osd-10.2.1-13.el7cp.x86_64 ceph-radosgw-10.2.1-13.el7cp.x86_64 ceph-base-10.2.1-13.el7cp.x86_64 ceph-10.2.1-13.el7cp.x86_64 [vagrant@rhcs2 ~]$ rpm -qa rhscon*; rpm -qa ceph* <-- One of the OSDs rhscon-agent-0.0.14-1.el7scon.noarch rhscon-core-selinux-0.0.32-1.el7scon.noarch ceph-common-10.2.1-13.el7cp.x86_64 ceph-mds-10.2.1-13.el7cp.x86_64 ceph-deploy-1.5.33-1.el7cp.noarch ceph-release-1-1.el7.noarch ceph-selinux-10.2.1-13.el7cp.x86_64 ceph-mon-10.2.1-13.el7cp.x86_64 ceph-osd-10.2.1-13.el7cp.x86_64 ceph-radosgw-10.2.1-13.el7cp.x86_64 ceph-base-10.2.1-13.el7cp.x86_64 ceph-10.2.1-13.el7cp.x86_64 [root@rhcs1 vagrant]# ceph -s cluster e78ce715-bd8d-4c17-92f5-0ac87d33a3c2 health HEALTH_OK monmap e3: 3 mons at {rhcs1=192.168.15.100:6789/0,rhcs4=192.168.15.103:6789/0,rhcs5=192.168.15.104:6789/0} election epoch 12, quorum 0,1,2 rhcs1,rhcs4,rhcs5 osdmap e13: 2 osds: 2 up, 2 in flags sortbitwise pgmap v31: 0 pgs, 0 pools, 0 bytes data, 0 objects 67940 kB used, 199 GB / 199 GB avail [root@rhcs1 vagrant]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.19519 root default -2 0.09760 host rhcs2 0 0.09760 osd.0 up 1.00000 1.00000 -3 0.09760 host rhcs3 1 0.09760 osd.1 up 1.00000 1.00000
This bug will be automatically when fix for bz 1349786 is available. Workaround is to use older stable version calamari i.e calamari-server-1.4.2-1
I redeployed today and now get calamari-server-1.4.6-1.el7cp.x86_64 with this version I can see that the OSD number in USM is 1 (even though there are two OSDs). Even though the number is not correct, I can now create pools and RBDs again - so I'm not blocked any more. The previous broken version was calamari-server-1.4.5
Since the puddle.ceph repo has been discontinued for RHS-C I moved to the http://download.eng.bos.redhat.com/rcm-guest/ceph-drops/auto/ repositories. This change can be seen in this commit: https://github.com/red-hat-storage/RHCS-vagrant/commit/74f1d6cc39686b942fe11566ad47f5879f03da0f Now I get calamari-server-1.4.7-1.el7cp.x86_64 installed on the MONs and this Bug has reappeared. Sadly I'm currently blocked again since no OSDs are detected in the USM setup and thus I cannot create Pools.
I got access briefly and didn't learn much. I need some additional details to trace this down. What I'd like is /var/log/calamari/cthulhu.log* and the data returned by the calamari API endpoint /api/v2/cluster/<FSID>/sync_object/osd_map thanks
Created attachment 1182217 [details] api-v2-cluster-<FSID>-osd
Created attachment 1182218 [details] api-v2-cluster-<FSID>-sync_object-osd_map
Created attachment 1182220 [details] cthulhu.log
Based on what I see here. I don't know of a reason why calamari is preventing the OSDs from appearing in storage console. Shubhendu would you take a look and see if I'm missing something?
Looking at api/v2/cluster/<fsid>/osd output I suspect one thing here. The server names listed for OSDs are without domain name and I feel USM could be having full FQDN names like dhcp42-13.eng.lab.blr.redhat.com. This name mismatch would result in deletion of OSDs which get created during create cluster flow. Later while syncing the OSD status etc. the host names might not be matching and so OSDs would be deleted from USM DB. To verify, you can attach a screen shot of hosts lists in USM UI as well here or provide access to the setup, and I can debug more on this.
Created attachment 1182825 [details] Screenshot of the Hosts page This shows that the hosts in USM are not known by any FQDN/domain
My upload shows that the Hosts don't have a domain in USM, so I guess that contradicts your theory - any other reason why the OSDs number could be zero?
At the moment I cannot think of anything else. May be you should verify the URL api/v2/cluster/<fsid>/osd for calamari and see if values for server attribute are populated properly. I feel same even Gregory has asked for in comment#7 It certainly would be better if I get access to this setup and see in detail.
I have just redeployed RHSCon and set all hostname mentions to lowercase letters. Now all hosts are in lowercase in the UI and in salt-key, but the OSDs are still at 0. Were there any changes in the code regarding this? Version: 0.0.39 Provider: Ceph Version: 0.0.39 Monitoring: Graphite Version: 0.9.15 Database: Mongo DB Version: 2.6.11 Setup is currently running - ping me in Slack if you want access
I see the same behavior where the OSD count is 0 in the UI. GET /api/v2/cluster/<fsid>/osd returns nothing from the API: curl: (52) Empty reply from server This cluster has a lot of OSDs (960). Running the same api call on a test cluster, I get a proper output. From dmesg, I see the following every time I run the API call: [Mar13 14:01] /opt/calamari/v invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 [ +0.000005] /opt/calamari/v cpuset=/ mems_allowed=0-1 [ +0.000003] CPU: 5 PID: 964516 Comm: /opt/calamari/v Not tainted 3.10.0-514.6.2.el7.x86_64 #1 [ +0.000001] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016 [ +0.000002] ffff881fb7b81f60 00000000e922335f ffff883e50bfbcc0 ffffffff816861ac [ +0.000003] ffff883e50bfbd50 ffffffff81681157 ffffffff812ae86b 00000000000000d0 [ +0.000002] ffff883e50bfbd20 ffffffff811f16ae fffeefff00000000 000000000000000a [ +0.000002] Call Trace: [ +0.000006] [<ffffffff816861ac>] dump_stack+0x19/0x1b [ +0.000004] [<ffffffff81681157>] dump_header+0x8e/0x225 [ +0.000005] [<ffffffff812ae86b>] ? cred_has_capability+0x6b/0x120 [ +0.000005] [<ffffffff811f16ae>] ? mem_cgroup_reclaim+0x4e/0x120 [ +0.000006] [<ffffffff8118475e>] oom_kill_process+0x24e/0x3c0 [ +0.000004] [<ffffffff810937ee>] ? has_capability_noaudit+0x1e/0x30 [ +0.000002] [<ffffffff811f3121>] mem_cgroup_oom_synchronize+0x551/0x580 [ +0.000002] [<ffffffff811f2570>] ? mem_cgroup_charge_common+0xc0/0xc0 [ +0.000003] [<ffffffff81184fe4>] pagefault_out_of_memory+0x14/0x90 [ +0.000002] [<ffffffff8167ef47>] mm_fault_error+0x68/0x12b [ +0.000004] [<ffffffff81691ed5>] __do_page_fault+0x395/0x450 [ +0.000002] [<ffffffff81691fc5>] do_page_fault+0x35/0x90 [ +0.000002] [<ffffffff8168e288>] page_fault+0x28/0x30 [ +0.000003] Task in /system.slice/supervisord.service killed as a result of limit of /system.slice/supervisord.service [ +0.000002] memory: usage 1048576kB, limit 1048576kB, failcnt 1966 [ +0.000001] memory+swap: usage 1048576kB, limit 9007199254740988kB, failcnt 0 [ +0.000001] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 [ +0.000000] Memory cgroup stats for /system.slice/supervisord.service: cache:12KB rss:1048564KB rss_huge:32768KB mapped_file:0KB swap:0KB inactive_anon:4KB ac [ +0.000014] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ +0.000565] [113030] 0 113030 55517 3340 59 0 0 supervisord [ +0.000014] [964516] 0 964516 621542 261952 672 0 0 /opt/calamari/v [ +0.000006] Memory cgroup out of memory: Kill process 966289 (/opt/calamari/v) score 971 or sacrifice child [ +0.010157] Killed process 964516 (/opt/calamari/v) total-vm:2486168kB, anon-rss:1036312kB, file-rss:11496kB, shmem-rss:0kB I uncovered this this morning: https://bugzilla.redhat.com/show_bug.cgi?id=1431787 but it turns out that 1GB of RAM is not enough to make this API call. I just bumped the value to 4GB and now I get a complete output.
This product is EOL now