Bug 1223624
Summary: | lock file management is weak and can block | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Fabio Massimo Di Nitto <fdinitto> |
Component: | crudini | Assignee: | Pádraig Brady <pbrady> |
Status: | CLOSED ERRATA | QA Contact: | Itzik Brown <itbrown> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.0 (Kilo) | CC: | abeekhof, apevec, jschluet, lhh, mlopes, ohochman, pbrady, sclewis, tfreger, yeylon |
Target Milestone: | ga | ||
Target Release: | 7.0 (Kilo) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | crudini-0.7-1.el7ost | Doc Type: | Bug Fix |
Doc Text: |
Prior to this update, separate lock files were used while updating config files. In addition, directory entries were not correctly synchronized during an update.
As a result, a crash during this process could cause deadlock issues on subsequent config update attempts, or very occasionally result in corrupted (empty) config files.
This update adds more robust locking and synchronization within the 'crudini' utility. The result is that config file updates are now more robust during system crash events.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2015-08-05 13:24:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1185030, 1251948, 1261487 |
Description
Fabio Massimo Di Nitto
2015-05-21 04:22:48 UTC
This is tricky. I wrote these notes about crudini locking: # Note we can't combine these methods to provide separated locks # which are immune to stale file deadlock, as once the separated # file is unlinked or renamed, you introduce a race with 3 or more # users if there is an associated fcntl lock. # Also a pid based method doesn't work with distributed file systems. Note you can call crudini with the --inplace option to avoid this issue. Though there are then these caveats: # Caveats in --inplace mode: # - File must be writeable # - File should be generally non readable to avoid read lock DoS. # - Not Atomic as readers may see incomplete data for a while. # - Not Consistent as multiple (non crudini) writers may overlap. # - Less Durable as existing data truncated before I/O completes. # - Requires write access to file rather than write access to dir. Isn't UNIX great. I'd love to come up with a scheme to avoid these issues. For reference the latest crudini is at: https://github.com/pixelb/crudini/blob/master/crudini (In reply to Pádraig Brady from comment #4) > This is tricky. Yeps I know :) > I wrote these notes about crudini locking: > > # Note we can't combine these methods to provide separated locks > # which are immune to stale file deadlock, as once the separated > # file is unlinked or renamed, you introduce a race with 3 or more > # users if there is an associated fcntl lock. > > # Also a pid based method doesn't work with distributed file systems. agreed. > > Note you can call crudini with the --inplace option to avoid this issue. > Though there are then these caveats: > > # Caveats in --inplace mode: > # - File must be writeable this isn't a problem > # - File should be generally non readable to avoid read lock DoS. > > # - Not Atomic as readers may see incomplete data for a while. In theory this isn't a problem once the deployment is completed, but we can't assume anything. > # - Not Consistent as multiple (non crudini) writers may overlap. > # - Less Durable as existing data truncated before I/O completes. > # - Requires write access to file rather than write access to dir. > > Isn't UNIX great. I'd love to come up with a scheme to avoid these issues. > For reference the latest crudini is at: > https://github.com/pixelb/crudini/blob/master/crudini I am using the one available in OSP7. We need some kind of fix/agreement for product. I don't have bw to test upstream pieces of code sorry. The issue here boils down to a combination of scripts that need to access config files to get/set some values as services are migrating between cluster nodes. If any of those scripts fail due to "any random reason" the node can potentially be rebooted hard (poweroff/poweron). On reboot, if a lock file is stale, the subsequent operations will continue to fail and we will enter in an infinite loop of lock -> reboot -> lock .... We can adjust the scripts to deal with an error from crudini and avoid the loop (potentially reporting the error up to the user) easily, but a locking infinite loop will trigger a script execution timeout and we can't distinguish it from a non-recoverable failure. If fixing the locking is "impossible", a "simple" try 10 times and then give up, exit 1 would also be acceptable (even tho it will require some extra code changes in different packages. Raising severity. This is blocking Instance HA deployments and it's causing controller and compute nodes to fail badly when stray lock files are left around. I have been using 0.7.1-1 for almost 3 days now, with dozens and dozens of failovers and never hit this locking issue. The fix looks good from engineering testing. thanks Fabio Can you please suggest a way to verify this bug is solved? Checked on RHEL 7.1 , crudini 0.7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1548 |