Bug 1883197

Summary: monitors reported down - upgraded from nautilus to pacific - can't decode unknown message type 140 MSG_AUTH=17
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: RADOSAssignee: Brad Hubbard <bhubbard>
Status: CLOSED CURRENTRELEASE QA Contact: Manohar Murthy <mmurthy>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.0CC: akupczyk, bhubbard, ceph-eng-bugs, dzafman, gfarnum, kchai, nojha, rzarzyns, sseshasa, tserlin, vumrao
Target Milestone: rcKeywords: UpgradeBlocker
Target Release: 5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-30 15:12:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vasishta 2020-09-28 11:58:59 UTC
Description of problem:
Tried to upgrade a nautilus cluster from pacific, monitor was reported down saying
>> can't decode unknown message type 140 MSG_AUTH=17

Version-Release number of selected component (if applicable):
ceph-4.2-rhel-8-containers-candidate-38535-20200923165012
(ceph version 14.2.11-33.el8cp)
to
ceph-5.0-rhel-8-containers-candidate-82880-20200915232213
(16.0.0-5535.el8cp)

How reproducible:
Tried once

Steps to Reproduce:
1. Configure rhcs 4.2 cluster 
2. upgrade to 5.0 using ceph-ansible 5.0

Actual results:
Monitor reported down saying 
can't decode unknown message type 140 MSG_AUTH=17

Expected results:
All daemons should be up and running

Additional info:

Comment 10 Brad Hubbard 2020-09-29 21:50:34 UTC
Hehe, so it appears to be true. The opposite of what is expected has taken
place. The older version has the newer feature and the newer version does not.

$ ls -F
ceph-14.2.11/  ceph-14.2.11-33.el7cp.src.rpm  ceph-16.0.0-5535.el8cp.src.rpm  ceph-16.0.0-5535-gebdb8e56e5/

$ find . -name Message.h -print -exec ag MSG_.*140 {} \;
./ceph-16.0.0-5535-gebdb8e56e5/src/msg/Message.h
./ceph-14.2.11/src/msg/Message.h
44:#define MSG_MON_PING               140

So I guess we raced here and grabbed the source for 5 before the upstream commit
had gone in. I did notice this however?

http://download.englab.bne.redhat.com/brewroot/packages/ceph/16.0.0/5974.el8cp/

Testing with that looks more promising.

$ ls -F
ceph-14.2.11/  ceph-14.2.11-33.el7cp.src.rpm  ceph-16.0.0-5535.el8cp.src.rpm  ceph-16.0.0-5535-gebdb8e56e5/  ceph-16.0.0-5974.el8cp.src.rpm  ceph-16.0.0-5974-gba395abd/

$ find . -name Message.h -print -exec ag MSG_.*140 {} \;
./ceph-16.0.0-5535-gebdb8e56e5/src/msg/Message.h
./ceph-14.2.11/src/msg/Message.h
44:#define MSG_MON_PING               140
./ceph-16.0.0-5974-gba395abd/src/msg/Message.h
49:#define MSG_MON_PING               140

So this is probably a non-event since it would take care of itself. Can we see
if we can reproduce this with ceph-16.0.0-5974.el8cp since that seems far less
likely?

Comment 11 Yaniv Kaul 2020-09-30 06:59:26 UTC
(In reply to Brad Hubbard from comment #10)
> Hehe, so it appears to be true. The opposite of what is expected has taken
> place. The older version has the newer feature and the newer version does
> not.
> 
> $ ls -F
> ceph-14.2.11/  ceph-14.2.11-33.el7cp.src.rpm  ceph-16.0.0-5535.el8cp.src.rpm
> ceph-16.0.0-5535-gebdb8e56e5/
> 
> $ find . -name Message.h -print -exec ag MSG_.*140 {} \;
> ./ceph-16.0.0-5535-gebdb8e56e5/src/msg/Message.h
> ./ceph-14.2.11/src/msg/Message.h
> 44:#define MSG_MON_PING               140
> 
> So I guess we raced here and grabbed the source for 5 before the upstream
> commit
> had gone in. I did notice this however?

How did that happen? Do we have anything that is warning us on this, at least?

Comment 12 Brad Hubbard 2020-09-30 09:20:35 UTC
(In reply to Yaniv Kaul from comment #11)
> 
> How did that happen? Do we have anything that is warning us on this, at
> least?

On Sep 17th we pushed a bunch of commits for
https://bugzilla.redhat.com/show_bug.cgi?id=1800382 in this commit to the
ceph-4.2-rhel-8 branch.

commit 1604834060e7ae78a5d1645103927b8789f5ae45
Author: Ken Dreyer <kdreyer>
Date:   Thu Sep 17 01:00:09 2020 -0400

    ceph-14.2.11-24

The upstream equivalent merged the next day.

commit 8ba0a61a514d1e5bb7c870cb56da7c7341e0ae69
Merge: 2aae7196536 a8963ccd326
Author: Neha Ojha <nojha>
Date:   Fri Sep 18 14:31:45 2020 -0700

    Merge pull request #35906 from gregsfortytwo/wip-stretch-mode

    Add a new stretch mode for 2-site Ceph clusters

    Reviewed-by: Josh Durgin <jdurgin>

1604834060e7ae78a5d1645103927b8789f5ae45 got pulled into ceph-14.2.11-33.el7cp
which was built on 23-Sep-2020 but ceph-16.0.0-5535.el8cp, which was built on
15-Sep-2020, did not include 8ba0a61a514d1e5bb7c870cb56da7c7341e0ae69 however
ceph-16.0.0-5974.el8cp, which was built on 29-Sep-2020 does. I guess the testing
exposed a temporary situation that would have resolved itself long before 5
shipped? As to why it happened this way I can't explain that in any more detail
as I wasn't involved.

Comment 13 Vasishta 2020-09-30 11:40:51 UTC
Hi Brad,

(In reply to Brad Hubbard from comment #10)

> if we can reproduce this with ceph-16.0.0-5974.el8cp since that seems far
> less
> likely?

Tried to upgrade same cluster to 16.0.0-5974.el8cp (ceph-5.0-rhel-8-containers-candidate-83100-20200929173915)
Worked fine.

Can we have this BZ on ON_QA state ? 
(We will move it to VERIFIED)

Regards,
Vasishta Shastry
QE, Ceph