Bug 1244322
Summary: | [Cent OS 6.6 UPGRADE]: Monitor crash after upgrade to RHEL 7.1 CEPH-1.3.0 | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | shylesh <shmohan> |
Component: | Documentation | Assignee: | ceph-docs <ceph-docs> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 1.3.0 | CC: | ceph-eng-bugs, dzafman, flucifre, hnallurv, kchai, kdreyer, ngoswami, shmohan, sjust |
Target Milestone: | rc | ||
Target Release: | 1.3.1 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-12-18 09:59:24 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
shylesh
2015-07-17 18:58:02 UTC
I purged the packages on the crashing node and did a fresh install but still mon fails to start with same crash as mentioned above. Is /usr/bin/ceph-mon segfaulting here? Kefu, can you please look into this? ken, yeah, seems like it segfaults at startup. will look at it. shylesh, could you please post more details, ideally the log file of the crashed monitor? it should have the backtrace and more log messages? thanks. Hi Shylesh, Did you upgrade Ceph 1.2.3 on RHEL 6.6 to Ceph 1.3 on RHEL 7.1 in a single step? I mean did you enable Ceph 1.3 repos along with RHEL 7 repos and then ran preupgrade assistant? Because if that has happened then it is not the right way to do it. Ideally the upgrade should be from RHEL 6.6 to RHEL 7.1 first, i.e first the OS upgrade only and not the Ceph upgrade. Once the OS has upgraded to RHEL 7.1, then upgrade Ceph from 1.2.3 to 1.3. That your cluster is in a mixed state currently i.e, all OSDs on Ceph 1.2.3, RHEL 6.6 and two MONs on Ceph 1.3, RHEL 7.1 suggests that you might have done two upgrades (OS + Ceph) in one step. (In reply to Nilamdyuti from comment #7) > Hi Shylesh, > > Did you upgrade Ceph 1.2.3 on RHEL 6.6 to Ceph 1.3 on RHEL 7.1 in a single > step? I mean did you enable Ceph 1.3 repos along with RHEL 7 repos and then > ran preupgrade assistant? > > Because if that has happened then it is not the right way to do it. Ideally > the upgrade should be from RHEL 6.6 to RHEL 7.1 first, i.e first the OS > upgrade only and not the Ceph upgrade. Once the OS has upgraded to RHEL 7.1, > then upgrade Ceph from 1.2.3 to 1.3. > > That your cluster is in a mixed state currently i.e, all OSDs on Ceph 1.2.3, > RHEL 6.6 and two MONs on Ceph 1.3, RHEL 7.1 suggests that you might have > done two upgrades (OS + Ceph) in one step. Nilam, No, I upgraded RHEL first and then upgraded ceph. One mon is in 1.3.0 Async because we couldn't do ISO upgrade due to a bug in calamari which won't allow you to get packages from ISO mount, so in this case we did cdn upgrade so accidentally it went to 1.3.0 Async. Mon2 is the one on which we did proper upgrade from ISO (this is also after upgrading RHEL to 7.1) but here monitor process is crashing. Mon3 I haven't touched yet so its still in 1.2.3 (In reply to shylesh from comment #8) > (In reply to Nilamdyuti from comment #7) > > Hi Shylesh, > > > > Did you upgrade Ceph 1.2.3 on RHEL 6.6 to Ceph 1.3 on RHEL 7.1 in a single > > step? I mean did you enable Ceph 1.3 repos along with RHEL 7 repos and then > > ran preupgrade assistant? > > > > Because if that has happened then it is not the right way to do it. Ideally > > the upgrade should be from RHEL 6.6 to RHEL 7.1 first, i.e first the OS > > upgrade only and not the Ceph upgrade. Once the OS has upgraded to RHEL 7.1, > > then upgrade Ceph from 1.2.3 to 1.3. > > > > That your cluster is in a mixed state currently i.e, all OSDs on Ceph 1.2.3, > > RHEL 6.6 and two MONs on Ceph 1.3, RHEL 7.1 suggests that you might have > > done two upgrades (OS + Ceph) in one step. > > Nilam, > > No, I upgraded RHEL first and then upgraded ceph. One mon is in 1.3.0 Async > because we couldn't do ISO upgrade due to a bug in calamari which won't > allow you to get packages from ISO mount, so in this case we did cdn upgrade > so accidentally it went to 1.3.0 Async. Mon2 is the one on which we did > proper upgrade from ISO (this is also after upgrading RHEL to 7.1) but here > monitor process is crashing. Mon3 I haven't touched yet so its still in 1.2.3 Okay. I get it. Thanks for the clarification! :) Since there's nothing upstream for this today, let's re-target to 1.3.2. (If the fix is trivial and we can land it before 1.3.1 dev freeze, we'll do so.) (gdb) bt #0 0x0000000000000000 in ?? () #1 0x0000000000880db9 in LevelDBStore::LevelDBWholeSpaceIteratorImpl::lower_bound (this=0x32e8110, prefix=..., to=...) at os/LevelDBStore.h:253 Python Exception <type 'exceptions.IndexError'> list index out of range: #2 0x000000000087f8aa in LevelDBStore::get (this=0x3339080, prefix=..., keys=std::set with 1 elements, out=0x7fffffffcdd0) at os/LevelDBStore.cc:194 #3 0x0000000000564b96 in MonitorDBStore::get (this=this@entry=0x33391e0, prefix="monitor", key="magic", bl=...) at mon/MonitorDBStore.h:497 #4 0x000000000054bf2b in main (argc=<optimized out>, argv=0x7fffffffe3d8) at ceph_mon.cc:521 (gdb) f 3 #3 0x0000000000564b96 in MonitorDBStore::get (this=this@entry=0x33391e0, prefix="monitor", key="magic", bl=...) at mon/MonitorDBStore.h:497 497 db->get(prefix, k, &out); Let's see what we can do to get the bug into 1.3.1 — we have a longer development phase than expected when Ken pushed — and it will be painful not to have this worked out until 1.3.2... we would most likely forced to fix it in an errata, so may as well be in the release to begin with. the crash was caused by leveldb-1.7.0-2.el6.x86_64 . it is not necessary a bug in leveldb, chances are that this very package failed to work in rhel7. for example, the ABI could have changed in glibc. after upgrading leveldb to leveldb.x86_64 0:1.12.0-5.el7cp, i have following error: # ceph-mon -i magna105 --public-addr 10.8.128.105 2015-07-24 03:34:06.418136 7f143bcb87c0 -1 unable to read magic from mon data so i am wondering what we have in mon store: $ strings /var/lib/ceph/mon/ceph-magna105/store.db/MANIFEST-000038 leveldb.BytewiseComparator the MANIFEST should contain the important meta data of this leveldb. but it turns out it have barely nothing in it. while a working mon store should have following tables: $ strings ~/dev/ceph/src/dev/mon.a/store.db/MANIFEST-000004 leveldb.BytewiseComparatorM mkfs keyring monitor magic in which, monitor/magic was being retrieved when the monitor crashed. and the worse is that the .sst (which is now .ldb in recent leveldb) file is missing: $ ls /var/lib/ceph/mon/ceph-magna105/store.db/ 000039.log CURRENT LOCK LOG LOG.old MANIFEST-000038 so a wild guess is that the monitor store was nuked by the crash. Shylesh, could you try the upgrade again, but this time please be sure that the system is updated with rhel7 repo, or at least the leveldb is upgraded. but i'd suggest do a upgrade after RHEL 6.6 ceph 1.2.3---->RHEL 7.1 1.3.0, just to avoid some other failures due to possible ABI incompatibility. (In reply to Kefu Chai from comment #14) > the crash was caused by leveldb-1.7.0-2.el6.x86_64 . it is not necessary a > bug in leveldb, chances are that this very package failed to work in rhel7. > for example, the ABI could have changed in glibc. > > after upgrading leveldb to leveldb.x86_64 0:1.12.0-5.el7cp, i have following > error: > > # ceph-mon -i magna105 --public-addr 10.8.128.105 > 2015-07-24 03:34:06.418136 7f143bcb87c0 -1 unable to read magic from mon data > > so i am wondering what we have in mon store: > > $ strings /var/lib/ceph/mon/ceph-magna105/store.db/MANIFEST-000038 > leveldb.BytewiseComparator > > the MANIFEST should contain the important meta data of this leveldb. but it > turns out it have barely nothing in it. > > while a working mon store should have following tables: > > $ strings ~/dev/ceph/src/dev/mon.a/store.db/MANIFEST-000004 > leveldb.BytewiseComparatorM > mkfs > keyring > monitor > magic > > in which, monitor/magic was being retrieved when the monitor crashed. > > and the worse is that the .sst (which is now .ldb in recent leveldb) file is > missing: > > $ ls /var/lib/ceph/mon/ceph-magna105/store.db/ > 000039.log CURRENT LOCK LOG LOG.old MANIFEST-000038 > > so a wild guess is that the monitor store was nuked by the crash. > > Shylesh, could you try the upgrade again, but this time please be sure that > the system is updated with rhel7 repo, or at least the leveldb is upgraded. > but i'd suggest do a upgrade after RHEL 6.6 ceph 1.2.3---->RHEL 7.1 1.3.0, > just to avoid some other failures due to possible ABI incompatibility. This mon is already in RHEL 7.1 1.3.0, what I understood from your comment is "I have to reinstall RHEL 6.6 with ceph 1.2.3 again on this node and then upgrade to RHEL7.1 first then upgrade the ceph from 1.2.3 to ceph 1.3.0" --correct me If I am wrong. Shylesh, the issue was caused by leveldb-1.7.0-2.el6. please note this is a el6 package. my guess is that: to make ceph-mon work, we need to upgrade all its dependencies to their rhel7 versions. in this case, leveldb-1.7.0-2.el6.x86_64 fails. so i wondering if other packages you installed as the dependencies of ceph-mon *before* upgrading the system to rhel7 will work for us. so a safe bet is to upgrade all ceph-mon dependencies to the el7 version. please ping me if you are confused. Shylesh, probably we need step #7 in addition to the recipe you put in https://bugzilla.redhat.com/show_bug.cgi?id=1244322#c0 1.Created a cluster with 3 mons, 3 osd nodes and 1 admin/calamari node with CentOS 6.6 ceph 1.2.2 2.Upgraded the cluster from Centos 6.6 ceph 1.2.2 -----> Centos 6.6 ceph 1.2.3 and I/o was in progress. Upgrade was successful 3.Upgraded the same cluster from Centos 6.6 ceph 1.2.3 -------> RHEL 6.6 ceph 1.2.3 , everything was fine 4. Upgraded the same cluster from RHEL 6.6 ceph 1.2.3 -------> RHEL 7.1 ceph 1.3.0. since there are 3 mons I upgraded one by one 5. First calamari node was upgraded, then picked the first mon due to the bug https://bugzilla.redhat.com/show_bug.cgi?id=1230679 couldn't do iso upgrade on the mon1 so I did CDN upgrade due to which it was moved to RHEL7.1 ceph 1.3.0 Async . 6. Then picked mon2 for upgrade and was able to do ISO based upgrade by following https://bugzilla.redhat.com/show_bug.cgi?id=1230679#c12 workaround so now mon2 is in RHEL7.1 ceph 1.3.0, but after upgrade monitor is not able to start but crashing. 7. upgrade all dependencies of ceph-mon to latest version: something like: yum update `yum deplist ceph-mon 2>/dev/null | grep provider | awk '{print $2}' | uniq` but some of them are not installable. for example, gperftools-libs. Thanks a ton Kefu for tracking down that leveldb issue. There's nothing in the ceph RPM that will cause it to upgrade leveldb from 1.7 to 1.12 when yum upgrades ceph. To be on the safe side, we should update the docs [1] to say "Run yum update on each node" after the "ceph-deploy install" operations (and before starting ceph back up). The "CDN" instructions already include a "yum update" command, but the "ISO" instructions do not. This bug is showing me that it's not safe to simply update the ceph package alone with ceph-deploy, because there could still be el6 packages on the nodes. I'm curious about this leveldb change in particular. Will it be possible to read the old data with leveldb 1.12? Your comment 15 makes me think it is not possible to read leveldb 1.7's old data? [1] https://access.redhat.com/beta/documentation/en/red-hat-ceph-storage-13-installation-guide-for-rhel-x86-64/chapter-26-upgrading-v123-to-v13-for-iso-based-installations Kefu, It worked since you had already upgraded leveldb to leveldb.x86_64 1.12.0-5.el7cp and I upgraded gperftool (not sure if this has really got something to do ) . Now monitor started successfully. but ps shows output something like #root 15460 1.1 0.1 301252 64544 ? Sl 13:45 0:02 ceph-mon -i magna105 --public-addr 10.8.128.105 on magna105 compared to #root 25813 0.1 0.2 311092 76200 ? Sl Jul17 14:24 /usr/bin/ceph-mon -i magna107 --pid-file /var/run/ceph/mon.magna107.pid -c /etc/ceph/ceph.conf --cluster ceph -f on magna107.. Is this ok ?? while upgrading next node I will make sure that all dependency packages are upgraded to el7, otherwise I have to do it manually. Not sure why ceph-deploy install didn't do it though it was pointing to right repo. Let me know if I have to check something else. > Will it be possible to read the old data with leveldb 1.12? Your comment 15 makes me think it is not possible to read leveldb 1.7's old data? ## leveldb-1.7 $ ls mon.b/store.db/ 000005.sst 000006.log CURRENT LOCK LOG LOG.old MANIFEST-000004 ## leveldb-1.18 (1.18 in my case) $ ls mon.b/store.db/ 000005.ldb 000006.log CURRENT LOCK LOG LOG.old MANIFEST-000004 good question, please see https://github.com/google/leveldb/releases: > New sstables will have the file extension .ldb. .sst files will continue to be recognized. Shylesh,
> but ps shows output something like
> #root 15460 1.1 0.1 301252 64544 ? Sl 13:45 0:02 ceph-mon -i magna105 --public-addr 10.8.128.105
>
this was probably started by me. i killed it and started using the sysv init script.
and the ps output looks like that from from magna107 now =)
root 5970 1.1 0.0 256604 21260 ? Sl 09:29 0:00 /usr/bin/ceph-mon -i magna105 --pid-file /var/run/ceph/mon.magna105.pid -c /etc/ceph/ceph.conf --cluster ceph -f
This is a doc change -> resetting component for that. *** Bug 1247711 has been marked as a duplicate of this bug. *** After upgrading the leveldb monitor works fine . as per comment22 an yum update would also bring the mon and osd dependencies to the latest version. Hence marking this as verified. Fixed for 1.3.1 Release. |