Bug 1591434 - After hard rebooting computes in the setup, the launched instances dont boot due to disks errors
Summary: After hard rebooting computes in the setup, the launched instances dont boot ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ga
: 13.0 (Queens)
Assignee: Giulio Fidente
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-14 17:43 UTC by Alexander Chuzhoy
Modified: 2019-09-09 15:05 UTC (History)
24 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.0.2-36.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-27 13:58:15 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Launchpad 1773449 None None None 2018-06-17 08:16:21 UTC
OpenStack gerrit 576195 'None' MERGED Update CephX client.openstack keyring to use 'profile rbd' 2020-09-03 23:29:52 UTC
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 13:59:06 UTC

Description Alexander Chuzhoy 2018-06-14 17:43:48 UTC
Post minor update: After hard rebooting all nodes in the setup, the launched instances dont boot due to disks errors: 
[   27.588540] end_request: I/O error, dev vda, sector 46275




Environment:
openstack-nova-compute-17.0.3-0.20180420001140.el7ost.noarch
instack-undercloud-8.4.1-4.el7ost.noarch
openstack-nova-scheduler-17.0.3-0.20180420001140.el7ost.noarch
python-novaclient-9.1.1-1.el7ost.noarch
python-nova-17.0.3-0.20180420001140.el7ost.noarch
openstack-nova-placement-api-17.0.3-0.20180420001140.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-33.el7ost.noarch
puppet-nova-12.4.0-3.el7ost.noarch
openstack-nova-conductor-17.0.3-0.20180420001140.el7ost.noarch
ceph-ansible-3.1.0-0.1.rc3.el7cp.noarch
openstack-nova-api-17.0.3-0.20180420001140.el7ost.noarch
openstack-nova-common-17.0.3-0.20180420001140.el7ost.noarch



Steps to reproduce:
1. Deployed beta bits with composable role (networker,messaging,database) + vlan + IPv6.

2.
Minor update the setup to passed_phase2

3.
Launch instance

4. Hard reboot the setup (using power)

5. Start the instance (is in SHUTOFF state)


Result:

The instance although in ACTIVE state doesn't boot properly.

Looking at the console of the instance:

[   27.284597] end_request: I/O error, dev vda, sector 32731
[   27.284597] end_request: I/O error, dev vda, sector 32741
[   27.284597] end_request: I/O error, dev vda, sector 46275
[   27.284597] end_request: I/O error, dev vda, sector 48839
[   27.544440] end_request: I/O error, dev vda, sector 0
[   27.548431] JBD: recovery failed
[   27.563190] EXT3-fs (vda1): error loading journal
[   27.584555] end_request: I/O error, dev vda, sector 16067
[   27.588540] end_request: I/O error, dev vda, sector 16261
[   27.588540] end_request: I/O error, dev vda, sector 16269
[   27.588540] end_request: I/O error, dev vda, sector 16293
[   27.588540] end_request: I/O error, dev vda, sector 16777
[   27.588540] end_request: I/O error, dev vda, sector 16999
[   27.588540] end_request: I/O error, dev vda, sector 17301
[   27.588540] end_request: I/O error, dev vda, sector 17307
[   27.588540] end_request: I/O error, dev vda, sector 26929
[   27.588540] end_request: I/O error, dev vda, sector 27005
[   27.588540] end_request: I/O error, dev vda, sector 32645
[   27.588540] end_request: I/O error, dev vda, sector 32681
[   27.588540] end_request: I/O error, dev vda, sector 32687
[   27.588540] end_request: I/O error, dev vda, sector 32693
[   27.588540] end_request: I/O error, dev vda, sector 32697
[   27.588540] end_request: I/O error, dev vda, sector 32705
[   27.588540] end_request: I/O error, dev vda, sector 32715
[   27.588540] end_request: I/O error, dev vda, sector 32721
[   27.588540] end_request: I/O error, dev vda, sector 32727
[   27.588540] end_request: I/O error, dev vda, sector 32731
[   27.588540] end_request: I/O error, dev vda, sector 32741
[   27.588540] end_request: I/O error, dev vda, sector 46275
[   27.588540] end_request: I/O error, dev vda, sector 48839
[   27.826491] JBD2: recovery failed
[   27.833509] EXT4-fs (vda1): error loading journal
mount: mounting /dev/vda1 on /newroot failed: Invalid argument
FATAL: ==== uh-oh, /dev/vda1 was there, but not after growroot ====
Executing /bin/sh. maybe you can help
/bin/sh: can't access tty; job control turned off
/ # 


Reproduced with another instance and additional reboot.

Comment 1 Alexander Chuzhoy 2018-06-14 18:23:43 UTC
The issue reproduced on a setup without minor update:

openstack-nova-scheduler-17.0.3-0.20180420001140.el7ost.noarch
puppet-nova-12.4.0-3.el7ost.noarch
instack-undercloud-8.4.1-4.el7ost.noarch
python-nova-17.0.3-0.20180420001140.el7ost.noarch
puppet-ceph-2.5.0-1.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-33.el7ost.noarch
ceph-ansible-3.1.0-0.1.rc8.el7cp.noarch
openstack-nova-placement-api-17.0.3-0.20180420001140.el7ost.noarch
openstack-nova-compute-17.0.3-0.20180420001140.el7ost.noarch
openstack-nova-common-17.0.3-0.20180420001140.el7ost.noarch
python-novaclient-9.1.1-1.el7ost.noarch
openstack-nova-api-17.0.3-0.20180420001140.el7ost.noarch
openstack-nova-conductor-17.0.3-0.20180420001140.el7ost.noarch



1)the setup has networker composable role
2)it's a vxlan+ipv6 deployment

Comment 3 Omri Hochman 2018-06-14 19:59:26 UTC
happens on clean deploy of OSP13 RC puddle - IPv4 environments as well.

Comment 4 Omri Hochman 2018-06-14 20:20:58 UTC
It comes back fine if trigger the reboot by ssh reboot.  so the problem doesn't occur by a graceful reboot. 

but power outage scenarios for the compute-nodes will be affected.

Comment 7 Alexander Chuzhoy 2018-06-14 20:45:09 UTC
The issue reproduced if we hard reboot only the computes too.

Comment 8 Omri Hochman 2018-06-14 21:20:41 UTC
a similar issue:  https://bugs.launchpad.net/nova/+bug/1773449

Comment 9 Alexander Chuzhoy 2018-06-14 21:27:09 UTC
runing the following command on one mon host seems to resolve it on the affected setup:
sudo ceph auth caps client.openstack mon 'allow r, allow command "osd blacklist"' osd 'allow rwx'

Comment 10 Dan Smith 2018-06-14 22:08:28 UTC
The disk corruption on hard reboot is exactly what I would expect from writeback caching normally. Our nova config text specifically warns against this for disk_cache_modes including writeback. Now, it sounds like there are other interactions with the rbd driver and its own cache (which I don't understand). Sounds like if the guest is properly detecting the cache mode on the block device and performing flushes, that those should be honored through rbd. But, sure looks like that's not happening. I don't think nova has regressed here or caused this behavior. This seems like either unsafe config, or something that makes it safe being broken in qemu or rbd.

Comment 11 melanie witt 2018-06-14 22:41:40 UTC
Here's a link to qemu-rbd documentation that we looked over today on IRC in #rhos-mgt:

http://docs.ceph.com/docs/master/rbd/qemu-rbd/#qemu-cache-options

where it recommends cache=writeback when rbd_cache=true. We (and others out in the world) have been configuring disk_cachemodes="network=writeback" for performance with ceph and presumably this test (hard reboot of compute hosts with ceph-backed instances) has worked fine in the past.

We should consult with qemu and rbd experts to take a look at this BZ and lend their comments on whether writeback is supposed to be safe with rbd caching and whether anything has regressed in qemu or rbd or qemu-rbd, causing this.

Comment 12 Martin Schuppert 2018-06-15 07:53:20 UTC
(In reply to Alexander Chuzhoy from comment #9)
> runing the following command on one mon host seems to resolve it on the
> affected setup:
> sudo ceph auth caps client.openstack mon 'allow r, allow command "osd
> blacklist"' osd 'allow rwx'

the mon logs on controller-1 show the blacklist deny messages starting 16:40:

Jun 14 16:40:48 controller-1 docker: 2018-06-14 16:40:48.563134 7f45564ba700  0 log_channel(audit) log [INF] : from='client.? [fd00:fd00:fd00:3000::16]:0/3291166411' entity='client.openstack' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "[fd00:fd00:fd00:3000::16]:0/3368966685"}]:  access denied
Jun 14 16:40:48 controller-1 journal: 2018-06-14 16:40:48.563134 7f45564ba700  0 log_channel(audit) log [INF] : from='client.? [fd00:fd00:fd00:3000::16]:0/3291166411' entity='client.openstack' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "[fd00:fd00:fd00:3000::16]:0/3368966685"}]:  access denied

This looks like what is mentioned in [1] when ceph cluster got upgraded from
Red Hat Ceph Storage 2.y to Red Hat Ceph Storage 3.y  

Maybe we need to check the ceph auth settings from before an update and
afterwards to see if they got changed if the same test worked before the
minor update.

[1] https://access.redhat.com/solutions/3377231

Comment 14 Matthew Booth 2018-06-15 11:09:28 UTC
Martin's solution from comment 12 seems the most likely here. The upstream bug referenced in comment 8 was also resolved by fixing ceph auth settings. Please can we check them?

Assuming this is the issue we presumably need to fix whatever configures them.

Comment 15 Matthew Booth 2018-06-15 14:46:12 UTC
(In reply to Matthew Booth from comment #14)
> Martin's solution from comment 12 seems the most likely here. The upstream
> bug referenced in comment 8 was also resolved by fixing ceph auth settings.
> Please can we check them?
> 
> Assuming this is the issue we presumably need to fix whatever configures
> them.

Looking at comment 9 this seems to be confirmed already

Comment 22 Alexander Chuzhoy 2018-06-18 17:01:40 UTC
Checked a deployment with OSP13:
Applied patch https://code.engineering.redhat.com/gerrit/141750 before deploying overcloud.
Deployed overcloud (3 controller,2 computes, 3 ceph)

Started an instance.

hard rebooted the computes.

started again the turned off instance after the computes became available.

Works fine. The issue doesn't reproduce.


Environment:
instack-undercloud-8.4.1-4.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-35.el7ost.noarch
ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch
+
https://code.engineering.redhat.com/gerrit/141750

Comment 27 Yogev Rabl 2018-06-21 14:55:28 UTC
Verified on openstack-tripleo-heat-templates-8.0.2-38.el7ost.noarch

Comment 29 errata-xmlrpc 2018-06-27 13:58:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.