Upstream tracker: #13301
- One more enhancement would be great here.
- As we recommend running both ceph server side daemons and clients to be in same version and if they are not in same version log a warning in ceph.log(cluster logs).
- It will help a lot in case of NOVA/KVM client instances upgrade procedure. For example if nova instances(qemu-kvm) processes are running firefly and ceph cluster daemons are running hammer.
- Now we upgrade our nova-compute nodes to hammer but did not stop and start nova instances(qemu-kvm)processes (needs a down time) or did not live-migrate(do not need downtime)to other nova-computes.
- This causes nova qemu-kvm in memory process still running firefly and server code running hammer code. After that if we will change the tunables to hammer (and bucket algorithm to straw2) all instacnes which are still running firefly code in memory will crash in instance logs:
terminate called after throwing an instance of 'ceph::buffer::malformed_input'
what(): buffer::malformed_input: *unsupported bucket algorithm: 5*
- This feature will help to avoid such mistakes.
- As if we see version mismatch warning after upgrading the clients also in cluster logs(ceph.log) we will come to know we still needs to either stop and start instances or live migrate them.
Hmm, we probably shouldn't store information for every connected client on the mon. However, adding a config that causes a warning when a client with an older version than the config connects would be plausible.
The log line should include the ip of the connecting client.
there two places that we can/should do the auditing:
1. when messenger rejects a connect message from client with CEPH_MSGR_TAG_FEATURES, this happens in the messenger stack,
- in Pipe::accept() // simple msgr
- in AsyncConnection::handle_connect_msg() // async msgr
1. when AuthMonitor authorizes the client and starts a session
- in AuthMonitor::prep_auth()
we can print out the client's address and its feature bits in log if the peer's feature bits is smaller than that of local at the two places above.
but the features bit is but a bitmap. to make the report human readable, we need to have a lookup table from the bits to its corresponding version. but it would be a rough mapping, and we might need to use a offline tool to do this translation.
1. we add feature bits in new point releases,
2. feature bits can be reused once it's deprecated and not checked by server side anymore.
case 1. if server have a feature of 0b111, while client has 0b101, and it has all required features for the assigned policy. and feature 0b010 is deprecated, and reused. monitor set it to 1 as it supports this new feature. so in this case, it's expected.
- but what if client has 0b111, but its 0b010 predates the time when the 0b010 marked deprecated?
case 2. client is rejected by messenger.
- we don't have access to clog. (Messenger does not have a LogClient, which in turn uses Messenger actually).
so, it's not hard to have a log (not clog) implementation. and we need to collect log files from all monitors to find out the distribution of different versions (feature bits) of clients connected (rejected) in the cluster.
but as the version string is not part of our messaging protocol, i don't think it's worthy to add it to MAuth. maybe the version derived from feature bits would suffice.
and maybe we can simply ignore the clients rejected by msgr? as they are pretty visible (not functional).
Implemented as 'ceph features' to report on connect clients clients and 'ceph osd set-require-min-compat-client $releasename' to guard against earlier clients being able to connect.
This is what's merged in luminous:
1) the ability to set a minimum required release for clients, to prevent new connections from older clients ('ceph osd set-require-min-compat-client jewel') - this defaults to jewel in new clusters, and can be viewed as part of 'ceph osd dump'.
2) 'ceph features' to report the total number of clients and daemons at given featuresets and releases (e.g.:
3) logging at debug mon = 10 in the monitor logs for connecting/disconnecting client address and features
Since this bug is pretty cluttered already, I'll close it referencing the relevant PRs:
Further changes can be handled in new tickets.
@Bara, sorry for the latency.
> * If the debugging level for Monitors is set to `10` (`debug mon = 10`), addresses and features of connecting and disconnecting clients are logged to the main cluster log.
this is not accurate. we have a "cluster" log which is persisted by monitor. and can be watched using "ceph -w", in addition to "cluster" log, we can also watch log messages in "audit" channel.
but the "log" mentioned by Josh in the context of
> logging at debug mon = 10 in the monitor logs for connecting/disconnecting client address and features
are normally log messages sent to file in local file system.
 but this is configurable. because if log_to_syslog is enabled, and if syslog is configured to send log to remote syslog server, well, the log will be written to a remote server. this might be out of the scope of this bz. i put this here just for completeness.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.