1060815 – 3.6. Recovering Failed Node Hosts

Bug 1060815 - 3.6. Recovering Failed Node Hosts

Summary: 3.6. Recovering Failed Node Hosts

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	2.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	brice
QA Contact:	ecs-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-02-03 16:54 UTC by Brenton Leanhardt
Modified:	2017-03-08 17:38 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-03-25 01:57:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1071443	0	unspecified	CLOSED	3.6. Recovering Failed Node Hosts	2021-02-22 00:41:40 UTC

Internal Links: 1071443

Description Brenton Leanhardt 2014-02-03 16:54:39 UTC

Description of problem:

In '3.6. Recovering Failed Node Hosts' we should mention backing up /opt/rh/ruby193/root/etc/mcollective/.

On most systems this information will be stored in a configuration management tool but we should document it here for completeness in case someone wishes to use snapshots for disaster recovery of configuration files.

Comment 2 brice 2014-02-12 04:43:46 UTC

Brenton, do you think the addition of a note box would be enough for this BZ? I'm thinking something like:

"Note:
If you wish to use snapshots for disaster recovery of configuration files, it would be important to back up the /opt/rh/ruby193/root/etc/mcollective file"

Then I have two questions:

1. How would they back up that file? Is it as easy as saving another version to the same location? (I think the command is: mv /opt/rh/ruby193/root/etc/mcollective.old , but I could be wrong.)

2. The Important box and Step 1.a seem to tell conflicting information. Unless, of course, you can only change the IP address of non-scaled apps, in which case I need to make this clearer in the procedure. So my question is: when attempting to change the IP address of a node, does it matter if the app is scaled or not?

Comment 4 Brenton Leanhardt 2014-02-12 13:24:28 UTC

I would simply state the directories and files that should be restored from the backup.  Admins will know what to do.  I'm fairly certain there was another location in the docs where we explicitly told admins the directories/files to backup.

Looking now at step 3 "recreate /etc/passwd entries for all the gears" we should also advise admins to backup that file and restore it.  The steps for recreation should only be needed if the backup is lost.

The Important box and 1a are slightly different.  I'll have to follow up to find out exactly that problem a scaled application will have and the approach for recovery if the host IPs change.

Comment 5 Miciah Dashiel Butler Masters 2014-02-12 13:29:09 UTC

At least part of the problem was addressed by this commit to make the gear registry store host names instead of IP addresses: https://github.com/openshift/origin-server/commit/93ddb0bf34e3f1f538e52fa13bf66c89d14cc0f5

Comment 6 Brenton Leanhardt 2014-02-12 14:46:12 UTC

Thanks Miciah.  I did vaguely remember a change like that going in.  I'll see if anyone else knows the current situation in more detail before diving too deep.

Comment 7 brice 2014-02-13 03:57:36 UTC

Brenton, sure. Sounds like I should add in a paragraph before the procedure saying something like;

"Ensure a backup of the /opt/rh/ruby193/root/etc/mcollective file and the /etc/passwd file have been performed. If not, use the following procedure to recover a failed node host:"

Alternatively, I could just sat something like "backup the appropriate files" if you think the Admin will know which files to update and the specific files would be too many to mention.

As for the other location of the back up info, the only info I could find in the admin guide was in 3.11 Changing Front-End HTTP Server Plug-in Configuration:

http://docbuilder.usersys.redhat.com/20822/#Changing_Front-end_HTTP_Server_Plug-in_Configuration

It just has a <replaceable> filename option in the command and doesn't go into any detail. Is this what you meant?

Comment 8 Brenton Leanhardt 2014-02-13 22:33:59 UTC

Looking through the docs now I guess I was imagining the section I referred to in the first paragraph of Comment #4.

That said, thinking about this as an admin I would would only need to know exactly which files need to be backed up and which can be recreated in the event of a catastrophy.  I wouldn't need to know how to backup the files in these cases.

Comment 9 brice 2014-02-18 06:33:25 UTC

(In reply to Brenton Leanhardt from comment #8)
> Looking through the docs now I guess I was imagining the section I referred
> to in the first paragraph of Comment #4.
> 
> That said, thinking about this as an admin I would would only need to know
> exactly which files need to be backed up and which can be recreated in the
> event of a catastrophy.  I wouldn't need to know how to backup the files in
> these cases.

Brenton, in saying this, I had a discussion with Bilhar about adding another section to the Admin Guide named 'Backing up and Recovering Node Hosts' then inside of that section a new section named 'Suggested Files to Back Up' and the already existing 'Recovering a Failed Node Host'. Take a look:

http://docbuilder.usersys.redhat.com/20822/#sect-Backing_Up_and_Recovering_Node_Hosts

There's not much to it at the moment, but I think it was a gap in the docs that can be improved upon down the line. I've arranged some of the 'Recovering Failed Node Hosts' info about so it is in line with what we've discussed here. Let me know if there's anything you'd like to suggest.

Comment 10 Brenton Leanhardt 2014-02-18 12:49:50 UTC

This looks much more clear.  I would state that /var/lib/openshift must be backed up as well.

In the http://docbuilder.usersys.redhat.com/20822/#Recovering_Failed_Node_Hosts section we say to mount that directory assuming that the admin is using some sort of SAN or something similar.  It's really more accurate to say that the directory needs to be backed up.  How they make the storage available on the Host is an implementation detail.

Comment 11 brice 2014-02-19 03:31:12 UTC

Makes sense. Brenton, i added the var/lib/openshift file to the back up section and reworded the sentence you suggested in 3.7.2 to:

'Replace /var/lib/openshift on the new node host with the same file from the original, failed node host."

I feel this works more in conjunction with the File to Back Up section.

Please, let me knoe if there's anything else. I agree this part of the Admin Guide is a lot clearer now.

Comment 12 Brenton Leanhardt 2014-02-19 13:11:28 UTC

One minor change is that /var/lib/openshift is technically a directory.  It would be more accurate so say something like, "Replace /var/lib/openshift on the new node host with the content from the original, failed node host."

Comment 13 brice 2014-02-19 23:08:34 UTC

Done. I also edited the 'Suggested Files to Back Up' section to say "The following is the list of files or directories Red Hat recommends to be backed up in case of node failure:"

If there's nothing else I might put this onto QA.

Thanks, Brenton.

Note You need to log in before you can comment on or make changes to this bug.