Description of problem: Tried to upgrade a nautilus cluster from pacific, monitor was reported down saying >> can't decode unknown message type 140 MSG_AUTH=17 Version-Release number of selected component (if applicable): ceph-4.2-rhel-8-containers-candidate-38535-20200923165012 (ceph version 14.2.11-33.el8cp) to ceph-5.0-rhel-8-containers-candidate-82880-20200915232213 (16.0.0-5535.el8cp) How reproducible: Tried once Steps to Reproduce: 1. Configure rhcs 4.2 cluster 2. upgrade to 5.0 using ceph-ansible 5.0 Actual results: Monitor reported down saying can't decode unknown message type 140 MSG_AUTH=17 Expected results: All daemons should be up and running Additional info:
Hehe, so it appears to be true. The opposite of what is expected has taken place. The older version has the newer feature and the newer version does not. $ ls -F ceph-14.2.11/ ceph-14.2.11-33.el7cp.src.rpm ceph-16.0.0-5535.el8cp.src.rpm ceph-16.0.0-5535-gebdb8e56e5/ $ find . -name Message.h -print -exec ag MSG_.*140 {} \; ./ceph-16.0.0-5535-gebdb8e56e5/src/msg/Message.h ./ceph-14.2.11/src/msg/Message.h 44:#define MSG_MON_PING 140 So I guess we raced here and grabbed the source for 5 before the upstream commit had gone in. I did notice this however? http://download.englab.bne.redhat.com/brewroot/packages/ceph/16.0.0/5974.el8cp/ Testing with that looks more promising. $ ls -F ceph-14.2.11/ ceph-14.2.11-33.el7cp.src.rpm ceph-16.0.0-5535.el8cp.src.rpm ceph-16.0.0-5535-gebdb8e56e5/ ceph-16.0.0-5974.el8cp.src.rpm ceph-16.0.0-5974-gba395abd/ $ find . -name Message.h -print -exec ag MSG_.*140 {} \; ./ceph-16.0.0-5535-gebdb8e56e5/src/msg/Message.h ./ceph-14.2.11/src/msg/Message.h 44:#define MSG_MON_PING 140 ./ceph-16.0.0-5974-gba395abd/src/msg/Message.h 49:#define MSG_MON_PING 140 So this is probably a non-event since it would take care of itself. Can we see if we can reproduce this with ceph-16.0.0-5974.el8cp since that seems far less likely?
(In reply to Brad Hubbard from comment #10) > Hehe, so it appears to be true. The opposite of what is expected has taken > place. The older version has the newer feature and the newer version does > not. > > $ ls -F > ceph-14.2.11/ ceph-14.2.11-33.el7cp.src.rpm ceph-16.0.0-5535.el8cp.src.rpm > ceph-16.0.0-5535-gebdb8e56e5/ > > $ find . -name Message.h -print -exec ag MSG_.*140 {} \; > ./ceph-16.0.0-5535-gebdb8e56e5/src/msg/Message.h > ./ceph-14.2.11/src/msg/Message.h > 44:#define MSG_MON_PING 140 > > So I guess we raced here and grabbed the source for 5 before the upstream > commit > had gone in. I did notice this however? How did that happen? Do we have anything that is warning us on this, at least?
(In reply to Yaniv Kaul from comment #11) > > How did that happen? Do we have anything that is warning us on this, at > least? On Sep 17th we pushed a bunch of commits for https://bugzilla.redhat.com/show_bug.cgi?id=1800382 in this commit to the ceph-4.2-rhel-8 branch. commit 1604834060e7ae78a5d1645103927b8789f5ae45 Author: Ken Dreyer <kdreyer> Date: Thu Sep 17 01:00:09 2020 -0400 ceph-14.2.11-24 The upstream equivalent merged the next day. commit 8ba0a61a514d1e5bb7c870cb56da7c7341e0ae69 Merge: 2aae7196536 a8963ccd326 Author: Neha Ojha <nojha> Date: Fri Sep 18 14:31:45 2020 -0700 Merge pull request #35906 from gregsfortytwo/wip-stretch-mode Add a new stretch mode for 2-site Ceph clusters Reviewed-by: Josh Durgin <jdurgin> 1604834060e7ae78a5d1645103927b8789f5ae45 got pulled into ceph-14.2.11-33.el7cp which was built on 23-Sep-2020 but ceph-16.0.0-5535.el8cp, which was built on 15-Sep-2020, did not include 8ba0a61a514d1e5bb7c870cb56da7c7341e0ae69 however ceph-16.0.0-5974.el8cp, which was built on 29-Sep-2020 does. I guess the testing exposed a temporary situation that would have resolved itself long before 5 shipped? As to why it happened this way I can't explain that in any more detail as I wasn't involved.
Hi Brad, (In reply to Brad Hubbard from comment #10) > if we can reproduce this with ceph-16.0.0-5974.el8cp since that seems far > less > likely? Tried to upgrade same cluster to 16.0.0-5974.el8cp (ceph-5.0-rhel-8-containers-candidate-83100-20200929173915) Worked fine. Can we have this BZ on ON_QA state ? (We will move it to VERIFIED) Regards, Vasishta Shastry QE, Ceph