Bug 2215793 - After upgrade from RHCS4 to RHCS 5 MDS will not stabilize, active MDS crashes
Summary: After upgrade from RHCS4 to RHCS 5 MDS will not stabilize, active MDS crashes
Keywords:
Status: CLOSED DUPLICATE of bug 2071592
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 5.3
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 6.1z2
Assignee: Kotresh HR
QA Contact: Hemanth Kumar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-18 19:13 UTC by Bob Emerson
Modified: 2023-08-04 12:16 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-12 06:31:45 UTC
Embargoed:
khiremat: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6846 0 None None None 2023-06-18 19:14:41 UTC

Description Bob Emerson 2023-06-18 19:13:52 UTC
Created attachment 1971455 [details]
Actvie MDS service log capture

Description of problem:
-----------------------
Customer upgraded from RHCS 4 to RHCS 5, there were some issues during the upgrade that we worked through butt ultimately completed a successful upgrade

One issue that seems related is that the upgrade did fail during the MDS service upgrade, this left one of the MDS services masked, these are the 2 nodes where the MDS service is running:

Name:   uswix679.kohlerco.com
Address: 10.20.55.100

Name:   uswix662.kohlerco.com
Address: 10.20.55.72

I had the customer rerun the rolling upgrade playbook and the upgrade completed successful

All services accross the whole cluster are upgraded.

Proma is now the activer MDS will not stay up.

The snippit of the crash:

conn(0x56144468dc00 0x56144fb8a800 :6841 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.363-0500 7fe3402d9700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.196.189:0/4004017584 conn(0x56144ca8f000 0x5614400ff800 :6841 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.391-0500 7fe3402d9700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.197.155:0/596720768 conn(0x56144fca3400 0x56144fb4e800 :6841 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 2), sending RESETSESSION
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.393-0500 7fe340ada700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.197.155:0/3119510130 conn(0x56144fc1d400 0x56144fb51000 :6841 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 2), sending RESETSESSION
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.393-0500 7fe3412db700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.197.155:0/3964568566 conn(0x56144468dc00 0x56144fb4e000 :6841 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 2), sending RESETSESSION
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.405-0500 7fe33dad4700  0 mds.0.server  ignoring msg from not-open sessionclient_reconnect(0 caps 0 realms ) v3
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.405-0500 7fe33dad4700  0 mds.0.server  ignoring msg from not-open sessionclient_reconnect(0 caps 0 realms ) v3
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.406-0500 7fe33dad4700  0 mds.0.server  ignoring msg from not-open sessionclient_reconnect(0 caps 0 realms ) v3
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.406-0500 7fe3402d9700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.197.155:0/596720768 conn(0x56144fb45400 0x56144fbe9800 :6841 s=OPENED pgs=402 cs=1 l=0).fault server, going to standby
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.406-0500 7fe340ada700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.197.155:0/3119510130 conn(0x56144f643c00 0x56144fb8c800 :6841 s=OPENED pgs=1456 cs=1 l=0).fault server, going to standby
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.407-0500 7fe3412db700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.197.155:0/3964568566 conn(0x56144ca8f000 0x5614400ff800 :6841 s=OPENED pgs=1315 cs=1 l=0).fault server, going to standby
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.408-0500 7fe340ada700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.197.155:0/3119510130 conn(0x56144de58000 0x56144fb39000 :6841 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.409-0500 7fe3402d9700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.197.155:0/3964568566 conn(0x56144fb44400 0x56144fb4e800 :6841 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.409-0500 7fe3412db700  0 --1- [v2:10.20.55.100:6840/625657455,v1:10.20.55.100:6841/625657455] >> v1:10.20.197.155:0/596720768 conn(0x561449f03400 0x56144fb3b000 :6841 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: /builddir/build/BUILD/ceph-16.2.10/src/mds/Server.cc: In function 'void Server::_unlink_local(MDRequestRef&, CDentry*, CDentry*)' thread 7fe3392cb700 time 2023-06-18T12:58:13.496055-0500
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: /builddir/build/BUILD/ceph-16.2.10/src/mds/Server.cc: 7950: FAILED ceph_assert(in->first <= straydn->first)
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fe34671e7b8]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  2: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7fe34671e9d2]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  3: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x10da) [0x56143c6e3f5a]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x341) [0x56143c6e8a91]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x56143c7176fb]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  6: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x33) [0x56143c7c08d3]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  7: (MDSContext::complete(int)+0x203) [0x56143c97e383]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  8: (MDSRank::_advance_queues()+0x84) [0x56143c6730d4]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  9: (MDSRank::ProgressThread::entry()+0xc5) [0x56143c6737f5]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  10: /lib64/libpthread.so.0(+0x81ca) [0x7fe3456fd1ca]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  11: clone()
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: *** Caught signal (Aborted) **
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  in thread 7fe3392cb700 thread_name:mds_rank_progr
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: 2023-06-18T12:58:13.496-0500 7fe3392cb700 -1 /builddir/build/BUILD/ceph-16.2.10/src/mds/Server.cc: In function 'void Server::_unlink_local(MDRequestRef&, CDentry*, CDentry*)' thread 7fe3392cb700 time 2023-06-18T12:58:13.496055-0500
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]: /builddir/build/BUILD/ceph-16.2.10/src/mds/Server.cc: 7950: FAILED ceph_assert(in->first <= straydn->first)
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  ceph version 16.2.10-172.el8cp (00a157ecd158911ece116ae43095de793ed9f389) pacific (stable)
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fe34671e7b8]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  2: /usr/lib64/ceph/libceph-common.so.2(+0x2799d2) [0x7fe34671e9d2]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  3: (Server::_unlink_local(boost::intrusive_ptr<MDRequestImpl>&, CDentry*, CDentry*)+0x10da) [0x56143c6e3f5a]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  4: (Server::handle_client_unlink(boost::intrusive_ptr<MDRequestImpl>&)+0x341) [0x56143c6e8a91]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  5: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf1b) [0x56143c7176fb]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  6: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x33) [0x56143c7c08d3]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  7: (MDSContext::complete(int)+0x203) [0x56143c97e383]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  8: (MDSRank::_advance_queues()+0x84) [0x56143c6730d4]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  9: (MDSRank::ProgressThread::entry()+0xc5) [0x56143c6737f5]
Jun 18 12:58:13 uswix679 ceph-mds-uswix679[1827124]:  10: /lib64/libpthread.so.0(+0x81ca) [0x7fe3456fd1ca]








Version-Release number of selected component (if applicable):
----------------------------------------------------------------
ceph-16.2.10




Actual results:

Activer MDS crashes with above asset



Expected results:

MDS to be stable




Additional info:

This Ceph cluster is an external cluster to 2 Openshift clusters

Comment 1 Greg Farnum 2023-06-18 19:34:42 UTC
This backtrace is the metadata corruption assert we've seen many times before. There are KCSes about it and you should consult Patrick's support doc for his early-detection patches (though this isn't an early detection): https://docs.google.com/document/d/1jW3raTDF19TcDfiQCqrw9DGzH_RCB_qLmFnIBu5EWM4/edit#heading=h.lwsj527j2ihk


Note You need to log in before you can comment on or make changes to this bug.