Bug 1353987
Summary: | OSD Nodes are not detected | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | Chris Blum <cblum> | ||||||||||
Component: | core | Assignee: | Nishanth Thomas <nthomas> | ||||||||||
core sub component: | events | QA Contact: | sds-qe-bugs | ||||||||||
Status: | CLOSED EOL | Docs Contact: | |||||||||||
Severity: | unspecified | ||||||||||||
Priority: | unspecified | CC: | branto, gmeno, mbukatov, mkudlej, shtripat | ||||||||||
Version: | 2 | ||||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | 3 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | Type: | Bug | |||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 1349786 | ||||||||||||
Bug Blocks: | |||||||||||||
Attachments: |
|
Description
Chris Blum
2016-07-08 15:32:24 UTC
(In reply to Chris Blum from comment #0) > When deploying a cluster with the current version of USM, Could you add a list of specific versions of rhscon* and ceph packages on RHSC 2.0 server machine? You can do that by running `rpm -qa rhscon*; rpm -qa ceph*` there. [root@rhs-c vagrant]# rpm -qa rhscon*; rpm -qa ceph* rhscon-core-0.0.32-1.el7scon.x86_64 rhscon-core-selinux-0.0.32-1.el7scon.noarch rhscon-ui-0.0.46-1.el7scon.noarch rhscon-ceph-0.0.31-1.el7scon.x86_64 ceph-common-10.2.1-13.el7cp.x86_64 ceph-mds-10.2.1-13.el7cp.x86_64 ceph-deploy-1.5.33-1.el7cp.noarch ceph-installer-1.0.11-1.el7scon.noarch ceph-selinux-10.2.1-13.el7cp.x86_64 ceph-mon-10.2.1-13.el7cp.x86_64 ceph-osd-10.2.1-13.el7cp.x86_64 ceph-radosgw-10.2.1-13.el7cp.x86_64 ceph-base-10.2.1-13.el7cp.x86_64 ceph-10.2.1-13.el7cp.x86_64 ceph-ansible-1.0.5-19.el7scon.noarch [root@rhcs1 vagrant]# rpm -qa rhscon*; rpm -qa ceph* <-- One of the MONs rhscon-agent-0.0.14-1.el7scon.noarch rhscon-core-selinux-0.0.32-1.el7scon.noarch ceph-common-10.2.1-13.el7cp.x86_64 ceph-mds-10.2.1-13.el7cp.x86_64 ceph-deploy-1.5.33-1.el7cp.noarch ceph-release-1-1.el7.noarch ceph-selinux-10.2.1-13.el7cp.x86_64 ceph-mon-10.2.1-13.el7cp.x86_64 ceph-osd-10.2.1-13.el7cp.x86_64 ceph-radosgw-10.2.1-13.el7cp.x86_64 ceph-base-10.2.1-13.el7cp.x86_64 ceph-10.2.1-13.el7cp.x86_64 [vagrant@rhcs2 ~]$ rpm -qa rhscon*; rpm -qa ceph* <-- One of the OSDs rhscon-agent-0.0.14-1.el7scon.noarch rhscon-core-selinux-0.0.32-1.el7scon.noarch ceph-common-10.2.1-13.el7cp.x86_64 ceph-mds-10.2.1-13.el7cp.x86_64 ceph-deploy-1.5.33-1.el7cp.noarch ceph-release-1-1.el7.noarch ceph-selinux-10.2.1-13.el7cp.x86_64 ceph-mon-10.2.1-13.el7cp.x86_64 ceph-osd-10.2.1-13.el7cp.x86_64 ceph-radosgw-10.2.1-13.el7cp.x86_64 ceph-base-10.2.1-13.el7cp.x86_64 ceph-10.2.1-13.el7cp.x86_64 [root@rhcs1 vagrant]# ceph -s cluster e78ce715-bd8d-4c17-92f5-0ac87d33a3c2 health HEALTH_OK monmap e3: 3 mons at {rhcs1=192.168.15.100:6789/0,rhcs4=192.168.15.103:6789/0,rhcs5=192.168.15.104:6789/0} election epoch 12, quorum 0,1,2 rhcs1,rhcs4,rhcs5 osdmap e13: 2 osds: 2 up, 2 in flags sortbitwise pgmap v31: 0 pgs, 0 pools, 0 bytes data, 0 objects 67940 kB used, 199 GB / 199 GB avail [root@rhcs1 vagrant]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.19519 root default -2 0.09760 host rhcs2 0 0.09760 osd.0 up 1.00000 1.00000 -3 0.09760 host rhcs3 1 0.09760 osd.1 up 1.00000 1.00000 This bug will be automatically when fix for bz 1349786 is available. Workaround is to use older stable version calamari i.e calamari-server-1.4.2-1 I redeployed today and now get calamari-server-1.4.6-1.el7cp.x86_64 with this version I can see that the OSD number in USM is 1 (even though there are two OSDs). Even though the number is not correct, I can now create pools and RBDs again - so I'm not blocked any more. The previous broken version was calamari-server-1.4.5 Since the puddle.ceph repo has been discontinued for RHS-C I moved to the http://download.eng.bos.redhat.com/rcm-guest/ceph-drops/auto/ repositories. This change can be seen in this commit: https://github.com/red-hat-storage/RHCS-vagrant/commit/74f1d6cc39686b942fe11566ad47f5879f03da0f Now I get calamari-server-1.4.7-1.el7cp.x86_64 installed on the MONs and this Bug has reappeared. Sadly I'm currently blocked again since no OSDs are detected in the USM setup and thus I cannot create Pools. I got access briefly and didn't learn much. I need some additional details to trace this down. What I'd like is /var/log/calamari/cthulhu.log* and the data returned by the calamari API endpoint /api/v2/cluster/<FSID>/sync_object/osd_map thanks Created attachment 1182217 [details]
api-v2-cluster-<FSID>-osd
Created attachment 1182218 [details]
api-v2-cluster-<FSID>-sync_object-osd_map
Created attachment 1182220 [details]
cthulhu.log
Based on what I see here. I don't know of a reason why calamari is preventing the OSDs from appearing in storage console. Shubhendu would you take a look and see if I'm missing something? Looking at api/v2/cluster/<fsid>/osd output I suspect one thing here. The server names listed for OSDs are without domain name and I feel USM could be having full FQDN names like dhcp42-13.eng.lab.blr.redhat.com. This name mismatch would result in deletion of OSDs which get created during create cluster flow. Later while syncing the OSD status etc. the host names might not be matching and so OSDs would be deleted from USM DB. To verify, you can attach a screen shot of hosts lists in USM UI as well here or provide access to the setup, and I can debug more on this. Created attachment 1182825 [details]
Screenshot of the Hosts page
This shows that the hosts in USM are not known by any FQDN/domain
My upload shows that the Hosts don't have a domain in USM, so I guess that contradicts your theory - any other reason why the OSDs number could be zero? At the moment I cannot think of anything else. May be you should verify the URL api/v2/cluster/<fsid>/osd for calamari and see if values for server attribute are populated properly. I feel same even Gregory has asked for in comment#7 It certainly would be better if I get access to this setup and see in detail. I have just redeployed RHSCon and set all hostname mentions to lowercase letters. Now all hosts are in lowercase in the UI and in salt-key, but the OSDs are still at 0. Were there any changes in the code regarding this? Version: 0.0.39 Provider: Ceph Version: 0.0.39 Monitoring: Graphite Version: 0.9.15 Database: Mongo DB Version: 2.6.11 Setup is currently running - ping me in Slack if you want access I see the same behavior where the OSD count is 0 in the UI. GET /api/v2/cluster/<fsid>/osd returns nothing from the API: curl: (52) Empty reply from server This cluster has a lot of OSDs (960). Running the same api call on a test cluster, I get a proper output. From dmesg, I see the following every time I run the API call: [Mar13 14:01] /opt/calamari/v invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 [ +0.000005] /opt/calamari/v cpuset=/ mems_allowed=0-1 [ +0.000003] CPU: 5 PID: 964516 Comm: /opt/calamari/v Not tainted 3.10.0-514.6.2.el7.x86_64 #1 [ +0.000001] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 3.35 10/20/2016 [ +0.000002] ffff881fb7b81f60 00000000e922335f ffff883e50bfbcc0 ffffffff816861ac [ +0.000003] ffff883e50bfbd50 ffffffff81681157 ffffffff812ae86b 00000000000000d0 [ +0.000002] ffff883e50bfbd20 ffffffff811f16ae fffeefff00000000 000000000000000a [ +0.000002] Call Trace: [ +0.000006] [<ffffffff816861ac>] dump_stack+0x19/0x1b [ +0.000004] [<ffffffff81681157>] dump_header+0x8e/0x225 [ +0.000005] [<ffffffff812ae86b>] ? cred_has_capability+0x6b/0x120 [ +0.000005] [<ffffffff811f16ae>] ? mem_cgroup_reclaim+0x4e/0x120 [ +0.000006] [<ffffffff8118475e>] oom_kill_process+0x24e/0x3c0 [ +0.000004] [<ffffffff810937ee>] ? has_capability_noaudit+0x1e/0x30 [ +0.000002] [<ffffffff811f3121>] mem_cgroup_oom_synchronize+0x551/0x580 [ +0.000002] [<ffffffff811f2570>] ? mem_cgroup_charge_common+0xc0/0xc0 [ +0.000003] [<ffffffff81184fe4>] pagefault_out_of_memory+0x14/0x90 [ +0.000002] [<ffffffff8167ef47>] mm_fault_error+0x68/0x12b [ +0.000004] [<ffffffff81691ed5>] __do_page_fault+0x395/0x450 [ +0.000002] [<ffffffff81691fc5>] do_page_fault+0x35/0x90 [ +0.000002] [<ffffffff8168e288>] page_fault+0x28/0x30 [ +0.000003] Task in /system.slice/supervisord.service killed as a result of limit of /system.slice/supervisord.service [ +0.000002] memory: usage 1048576kB, limit 1048576kB, failcnt 1966 [ +0.000001] memory+swap: usage 1048576kB, limit 9007199254740988kB, failcnt 0 [ +0.000001] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 [ +0.000000] Memory cgroup stats for /system.slice/supervisord.service: cache:12KB rss:1048564KB rss_huge:32768KB mapped_file:0KB swap:0KB inactive_anon:4KB ac [ +0.000014] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ +0.000565] [113030] 0 113030 55517 3340 59 0 0 supervisord [ +0.000014] [964516] 0 964516 621542 261952 672 0 0 /opt/calamari/v [ +0.000006] Memory cgroup out of memory: Kill process 966289 (/opt/calamari/v) score 971 or sacrifice child [ +0.010157] Killed process 964516 (/opt/calamari/v) total-vm:2486168kB, anon-rss:1036312kB, file-rss:11496kB, shmem-rss:0kB I uncovered this this morning: https://bugzilla.redhat.com/show_bug.cgi?id=1431787 but it turns out that 1GB of RAM is not enough to make this API call. I just bumped the value to 4GB and now I get a complete output. This product is EOL now |