Description of problem: In '3.6. Recovering Failed Node Hosts' we should mention backing up /opt/rh/ruby193/root/etc/mcollective/. On most systems this information will be stored in a configuration management tool but we should document it here for completeness in case someone wishes to use snapshots for disaster recovery of configuration files.
Brenton, do you think the addition of a note box would be enough for this BZ? I'm thinking something like: "Note: If you wish to use snapshots for disaster recovery of configuration files, it would be important to back up the /opt/rh/ruby193/root/etc/mcollective file" Then I have two questions: 1. How would they back up that file? Is it as easy as saving another version to the same location? (I think the command is: mv /opt/rh/ruby193/root/etc/mcollective.old , but I could be wrong.) 2. The Important box and Step 1.a seem to tell conflicting information. Unless, of course, you can only change the IP address of non-scaled apps, in which case I need to make this clearer in the procedure. So my question is: when attempting to change the IP address of a node, does it matter if the app is scaled or not?
I would simply state the directories and files that should be restored from the backup. Admins will know what to do. I'm fairly certain there was another location in the docs where we explicitly told admins the directories/files to backup. Looking now at step 3 "recreate /etc/passwd entries for all the gears" we should also advise admins to backup that file and restore it. The steps for recreation should only be needed if the backup is lost. The Important box and 1a are slightly different. I'll have to follow up to find out exactly that problem a scaled application will have and the approach for recovery if the host IPs change.
At least part of the problem was addressed by this commit to make the gear registry store host names instead of IP addresses: https://github.com/openshift/origin-server/commit/93ddb0bf34e3f1f538e52fa13bf66c89d14cc0f5
Thanks Miciah. I did vaguely remember a change like that going in. I'll see if anyone else knows the current situation in more detail before diving too deep.
Brenton, sure. Sounds like I should add in a paragraph before the procedure saying something like; "Ensure a backup of the /opt/rh/ruby193/root/etc/mcollective file and the /etc/passwd file have been performed. If not, use the following procedure to recover a failed node host:" Alternatively, I could just sat something like "backup the appropriate files" if you think the Admin will know which files to update and the specific files would be too many to mention. As for the other location of the back up info, the only info I could find in the admin guide was in 3.11 Changing Front-End HTTP Server Plug-in Configuration: http://docbuilder.usersys.redhat.com/20822/#Changing_Front-end_HTTP_Server_Plug-in_Configuration It just has a <replaceable> filename option in the command and doesn't go into any detail. Is this what you meant?
Looking through the docs now I guess I was imagining the section I referred to in the first paragraph of Comment #4. That said, thinking about this as an admin I would would only need to know exactly which files need to be backed up and which can be recreated in the event of a catastrophy. I wouldn't need to know how to backup the files in these cases.
(In reply to Brenton Leanhardt from comment #8) > Looking through the docs now I guess I was imagining the section I referred > to in the first paragraph of Comment #4. > > That said, thinking about this as an admin I would would only need to know > exactly which files need to be backed up and which can be recreated in the > event of a catastrophy. I wouldn't need to know how to backup the files in > these cases. Brenton, in saying this, I had a discussion with Bilhar about adding another section to the Admin Guide named 'Backing up and Recovering Node Hosts' then inside of that section a new section named 'Suggested Files to Back Up' and the already existing 'Recovering a Failed Node Host'. Take a look: http://docbuilder.usersys.redhat.com/20822/#sect-Backing_Up_and_Recovering_Node_Hosts There's not much to it at the moment, but I think it was a gap in the docs that can be improved upon down the line. I've arranged some of the 'Recovering Failed Node Hosts' info about so it is in line with what we've discussed here. Let me know if there's anything you'd like to suggest.
This looks much more clear. I would state that /var/lib/openshift must be backed up as well. In the http://docbuilder.usersys.redhat.com/20822/#Recovering_Failed_Node_Hosts section we say to mount that directory assuming that the admin is using some sort of SAN or something similar. It's really more accurate to say that the directory needs to be backed up. How they make the storage available on the Host is an implementation detail.
Makes sense. Brenton, i added the var/lib/openshift file to the back up section and reworded the sentence you suggested in 3.7.2 to: 'Replace /var/lib/openshift on the new node host with the same file from the original, failed node host." I feel this works more in conjunction with the File to Back Up section. Please, let me knoe if there's anything else. I agree this part of the Admin Guide is a lot clearer now.
One minor change is that /var/lib/openshift is technically a directory. It would be more accurate so say something like, "Replace /var/lib/openshift on the new node host with the content from the original, failed node host."
Done. I also edited the 'Suggested Files to Back Up' section to say "The following is the list of files or directories Red Hat recommends to be backed up in case of node failure:" If there's nothing else I might put this onto QA. Thanks, Brenton.