Currently the containers are stopped if we detect either config change or a RPM update for docker. Perhaps we could make the stops/failovers occur less frequently by either: A) Only checking for RPM update. (Just config changes probably wouldn't cause that Docker wouldn't be able to manage previous containers after its restart. (?)) B) Going a bit further and actually trying to parse what kind of Docker RPM change are we doing during the update. Perhaps we could get away with keeping the containers running if we only change the patch number of RPM, but not the version number. But can we depend on patch-number-only RPM updates being "safe" and never causing the issue where we lose the ability to manage the containers which were left running? C) There was also a suggestion that we'd only stop containers managed by Paunch and Pacemaker. (Also we'd have to think about software managed by external installers like ceph-ansible, so the approach would probably end up being "stop everything except the Neutron-managed containers".) I think we haven't yet completely ruled out the possiblility of the persisting containers still becoming unmanageable though.
Posted a Queens-only patch (Rocky+ has this done differently and already doesn't seem to stop containers on config change of Docker). I tested by editing Docker config by hand and running minor update. The config got set to the state dictated by Puppet, Docker service got restarted (visible in `systemctl status docker` uptime), while containers remained up (visible in `docker ps` uptime). There didn't seem to be any duplicate services running (looked e.g. via `pgrep -a nova-compute`).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3587