Bug 1883197
Summary: | monitors reported down - upgraded from nautilus to pacific - can't decode unknown message type 140 MSG_AUTH=17 | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasishta <vashastr> |
Component: | RADOS | Assignee: | Brad Hubbard <bhubbard> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Manohar Murthy <mmurthy> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 5.0 | CC: | akupczyk, bhubbard, ceph-eng-bugs, dzafman, gfarnum, kchai, nojha, rzarzyns, sseshasa, tserlin, vumrao |
Target Milestone: | rc | Keywords: | UpgradeBlocker |
Target Release: | 5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-09-30 15:12:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Hehe, so it appears to be true. The opposite of what is expected has taken place. The older version has the newer feature and the newer version does not. $ ls -F ceph-14.2.11/ ceph-14.2.11-33.el7cp.src.rpm ceph-16.0.0-5535.el8cp.src.rpm ceph-16.0.0-5535-gebdb8e56e5/ $ find . -name Message.h -print -exec ag MSG_.*140 {} \; ./ceph-16.0.0-5535-gebdb8e56e5/src/msg/Message.h ./ceph-14.2.11/src/msg/Message.h 44:#define MSG_MON_PING 140 So I guess we raced here and grabbed the source for 5 before the upstream commit had gone in. I did notice this however? http://download.englab.bne.redhat.com/brewroot/packages/ceph/16.0.0/5974.el8cp/ Testing with that looks more promising. $ ls -F ceph-14.2.11/ ceph-14.2.11-33.el7cp.src.rpm ceph-16.0.0-5535.el8cp.src.rpm ceph-16.0.0-5535-gebdb8e56e5/ ceph-16.0.0-5974.el8cp.src.rpm ceph-16.0.0-5974-gba395abd/ $ find . -name Message.h -print -exec ag MSG_.*140 {} \; ./ceph-16.0.0-5535-gebdb8e56e5/src/msg/Message.h ./ceph-14.2.11/src/msg/Message.h 44:#define MSG_MON_PING 140 ./ceph-16.0.0-5974-gba395abd/src/msg/Message.h 49:#define MSG_MON_PING 140 So this is probably a non-event since it would take care of itself. Can we see if we can reproduce this with ceph-16.0.0-5974.el8cp since that seems far less likely? (In reply to Brad Hubbard from comment #10) > Hehe, so it appears to be true. The opposite of what is expected has taken > place. The older version has the newer feature and the newer version does > not. > > $ ls -F > ceph-14.2.11/ ceph-14.2.11-33.el7cp.src.rpm ceph-16.0.0-5535.el8cp.src.rpm > ceph-16.0.0-5535-gebdb8e56e5/ > > $ find . -name Message.h -print -exec ag MSG_.*140 {} \; > ./ceph-16.0.0-5535-gebdb8e56e5/src/msg/Message.h > ./ceph-14.2.11/src/msg/Message.h > 44:#define MSG_MON_PING 140 > > So I guess we raced here and grabbed the source for 5 before the upstream > commit > had gone in. I did notice this however? How did that happen? Do we have anything that is warning us on this, at least? (In reply to Yaniv Kaul from comment #11) > > How did that happen? Do we have anything that is warning us on this, at > least? On Sep 17th we pushed a bunch of commits for https://bugzilla.redhat.com/show_bug.cgi?id=1800382 in this commit to the ceph-4.2-rhel-8 branch. commit 1604834060e7ae78a5d1645103927b8789f5ae45 Author: Ken Dreyer <kdreyer> Date: Thu Sep 17 01:00:09 2020 -0400 ceph-14.2.11-24 The upstream equivalent merged the next day. commit 8ba0a61a514d1e5bb7c870cb56da7c7341e0ae69 Merge: 2aae7196536 a8963ccd326 Author: Neha Ojha <nojha> Date: Fri Sep 18 14:31:45 2020 -0700 Merge pull request #35906 from gregsfortytwo/wip-stretch-mode Add a new stretch mode for 2-site Ceph clusters Reviewed-by: Josh Durgin <jdurgin> 1604834060e7ae78a5d1645103927b8789f5ae45 got pulled into ceph-14.2.11-33.el7cp which was built on 23-Sep-2020 but ceph-16.0.0-5535.el8cp, which was built on 15-Sep-2020, did not include 8ba0a61a514d1e5bb7c870cb56da7c7341e0ae69 however ceph-16.0.0-5974.el8cp, which was built on 29-Sep-2020 does. I guess the testing exposed a temporary situation that would have resolved itself long before 5 shipped? As to why it happened this way I can't explain that in any more detail as I wasn't involved. Hi Brad, (In reply to Brad Hubbard from comment #10) > if we can reproduce this with ceph-16.0.0-5974.el8cp since that seems far > less > likely? Tried to upgrade same cluster to 16.0.0-5974.el8cp (ceph-5.0-rhel-8-containers-candidate-83100-20200929173915) Worked fine. Can we have this BZ on ON_QA state ? (We will move it to VERIFIED) Regards, Vasishta Shastry QE, Ceph |
Description of problem: Tried to upgrade a nautilus cluster from pacific, monitor was reported down saying >> can't decode unknown message type 140 MSG_AUTH=17 Version-Release number of selected component (if applicable): ceph-4.2-rhel-8-containers-candidate-38535-20200923165012 (ceph version 14.2.11-33.el8cp) to ceph-5.0-rhel-8-containers-candidate-82880-20200915232213 (16.0.0-5535.el8cp) How reproducible: Tried once Steps to Reproduce: 1. Configure rhcs 4.2 cluster 2. upgrade to 5.0 using ceph-ansible 5.0 Actual results: Monitor reported down saying can't decode unknown message type 140 MSG_AUTH=17 Expected results: All daemons should be up and running Additional info: