Description of problem: When crudini attempts to read/write values to/from a config file, it creates a lock file. The lock file management is weak and can lead to blocking situation. Version-Release number of selected component (if applicable): crudini-0.5-1.el7ost.noarch How reproducible: hard to say. I did hit this bug because the node was rebooted while some operation from the installer was in progress and there was a stale lock file. Steps to Reproduce: 1. touch /etc/nova/.nova.conf.crudini.lck 2. crudini --get /etc/nova/nova.conf neutron admin_auth_url Actual results: strace shows cruding trying over and over to get the lock file and will loop forever. Expected results: The lock file should include a pid information and the lock code should check if that pid is still running and/or generated by another crudini. lack of a matching pid file should be considered as stalled and lock file removed. if the PID is running then report an error instead of looping indefinetely. Additional info:
This is tricky. I wrote these notes about crudini locking: # Note we can't combine these methods to provide separated locks # which are immune to stale file deadlock, as once the separated # file is unlinked or renamed, you introduce a race with 3 or more # users if there is an associated fcntl lock. # Also a pid based method doesn't work with distributed file systems. Note you can call crudini with the --inplace option to avoid this issue. Though there are then these caveats: # Caveats in --inplace mode: # - File must be writeable # - File should be generally non readable to avoid read lock DoS. # - Not Atomic as readers may see incomplete data for a while. # - Not Consistent as multiple (non crudini) writers may overlap. # - Less Durable as existing data truncated before I/O completes. # - Requires write access to file rather than write access to dir. Isn't UNIX great. I'd love to come up with a scheme to avoid these issues. For reference the latest crudini is at: https://github.com/pixelb/crudini/blob/master/crudini
(In reply to Pádraig Brady from comment #4) > This is tricky. Yeps I know :) > I wrote these notes about crudini locking: > > # Note we can't combine these methods to provide separated locks > # which are immune to stale file deadlock, as once the separated > # file is unlinked or renamed, you introduce a race with 3 or more > # users if there is an associated fcntl lock. > > # Also a pid based method doesn't work with distributed file systems. agreed. > > Note you can call crudini with the --inplace option to avoid this issue. > Though there are then these caveats: > > # Caveats in --inplace mode: > # - File must be writeable this isn't a problem > # - File should be generally non readable to avoid read lock DoS. > > # - Not Atomic as readers may see incomplete data for a while. In theory this isn't a problem once the deployment is completed, but we can't assume anything. > # - Not Consistent as multiple (non crudini) writers may overlap. > # - Less Durable as existing data truncated before I/O completes. > # - Requires write access to file rather than write access to dir. > > Isn't UNIX great. I'd love to come up with a scheme to avoid these issues. > For reference the latest crudini is at: > https://github.com/pixelb/crudini/blob/master/crudini I am using the one available in OSP7. We need some kind of fix/agreement for product. I don't have bw to test upstream pieces of code sorry. The issue here boils down to a combination of scripts that need to access config files to get/set some values as services are migrating between cluster nodes. If any of those scripts fail due to "any random reason" the node can potentially be rebooted hard (poweroff/poweron). On reboot, if a lock file is stale, the subsequent operations will continue to fail and we will enter in an infinite loop of lock -> reboot -> lock .... We can adjust the scripts to deal with an error from crudini and avoid the loop (potentially reporting the error up to the user) easily, but a locking infinite loop will trigger a script execution timeout and we can't distinguish it from a non-recoverable failure. If fixing the locking is "impossible", a "simple" try 10 times and then give up, exit 1 would also be acceptable (even tho it will require some extra code changes in different packages.
Raising severity. This is blocking Instance HA deployments and it's causing controller and compute nodes to fail badly when stray lock files are left around.
https://github.com/pixelb/crudini/commit/10875e6a2bb
I have been using 0.7.1-1 for almost 3 days now, with dozens and dozens of failovers and never hit this locking issue. The fix looks good from engineering testing. thanks Fabio
Can you please suggest a way to verify this bug is solved?
Checked on RHEL 7.1 , crudini 0.7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1548