Bug 1304321 - Most of the instances fail to boot. Need RCA
Most of the instances fail to boot. Need RCA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder (Show other bugs)
6.0 (Juno)
All Linux
unspecified Severity urgent
: ---
: 8.0 (Liberty)
Assigned To: Jon Bernard
Depends On:
  Show dependency treegraph
Reported: 2016-02-03 05:06 EST by Jaison Raju
Modified: 2016-04-26 14:19 EDT (History)
18 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-02-25 08:53:06 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
console (49.75 KB, image/png)
2016-02-03 05:18 EST, Jaison Raju
no flags Details
console (53.79 KB, image/png)
2016-02-03 05:19 EST, Jaison Raju
no flags Details
For the attachment to comment 26 (40.76 KB, application/x-gzip)
2016-02-10 22:39 EST, Faiaz Ahmed
no flags Details

  None (edit)
Description Jaison Raju 2016-02-03 05:06:08 EST
Description of problem:
All instances (windows/linux) fail to boot with following error :
Customer face issues with rabbitmq partition which was recovered by restart .
Customer uses emc storage backed :


SeaBIOS (version seabios-1.7.5-8.el7)
Machine UUID 4795a173-f718-ab4e-bc34-9a6b62a8c38b

iPXE (http://ipxe.org) 00:03.0 C900 PCI2.10 PnP PMM+BFF97C60+BFEF7C60 C900

Booting from Hard Disk...
.                                                         <================ This "." may imply to read content from disk.
error: not a correct XFS inode.
error: file '/grub2/i386-pc/normal.mod' nof found
Entering rescue mode...
grub rescue>


Windows failed to start. A recent hardware or software change might be the cause. To fix the problem:

  1. Insert your Windows installation disc and restart your computer
  2. Choose your language settings, and then click "Next."
  3. Click "Repair your computer."

If you do not have this disc, contact your system administrator or computer manufacturer for assistance.

   File: \Boot\BCD
   Status: 0xc000014c                                      <========== this may imply disk or file for booting are damaged.
   Info: The Boot Configuration Data for your PC is missing or contains errors.

Version-Release number of selected component (if applicable):

How reproducible:
Always On customer end .

Steps to Reproduce:

Actual results:
Instance boot fails.

Expected results:
Instance boots up the OS.

Additional info:
Comment 3 Jaison Raju 2016-02-03 05:18 EST
Created attachment 1120711 [details]
Comment 4 Jaison Raju 2016-02-03 05:19 EST
Created attachment 1120712 [details]
Comment 16 Jon Bernard 2016-02-04 10:47:41 EST
The only thing I can add is that the order of events is unclear (to me).  If a guest filesystem was mounted on the compute host, that could certainly cause the corruption we're seeing.  But if the corruption happened earlier, and the host mounting was an attempt to investigate and/or repair a previously corrupt filesystem, then this may not be the initial cause.
Comment 18 Sadique Puthen 2016-02-05 02:39:52 EST
Most of their instances (both Linux and Windows) fail to boot. We need to formulate an action plan to recover them together with Engineering.

- Assess the current level of corruption and understand if this can be recovered.
- If this can be recovered, create an action plan together with Engineering and execute.
- I would prefer to have a Bomgar session with CEE, Engineering and Field team (Felix Tsang) to recover one Linux and one Windows vm. Then Engineering can disengage and GSS can do it for rest of the instances.

We can then concentrate on RCA as sev2.
Comment 21 Sadique Puthen 2016-02-05 05:27:12 EST
Right now just heard from SA that they are trying rebuild these instances from images and RCA for this is first priority now so that we can prevent this from happening again.
Comment 26 Eric Sandeen 2016-02-05 17:02:52 EST
Rather than a dd of the first 10mb of the corrupted volume, please use the xfs_metadump tool to capture all metadata.  If filenames and attributes are not considered sensitive information, please use the "-o" option.

Then compress it and attach it to this bug.

Comment 30 Faiaz Ahmed 2016-02-10 22:39 EST
Created attachment 1123080 [details]
For the attachment to comment 26
Comment 37 Sergey Gotliv 2016-02-25 08:53:06 EST
This is not a Cinder and most probably not even an Openstack bug. Once you discover who is running an os-prober reopen and assign to the relevant component.

Note You need to log in before you can comment on or make changes to this bug.