Post minor update: After hard rebooting all nodes in the setup, the launched instances dont boot due to disks errors: [ 27.588540] end_request: I/O error, dev vda, sector 46275 Environment: openstack-nova-compute-17.0.3-0.20180420001140.el7ost.noarch instack-undercloud-8.4.1-4.el7ost.noarch openstack-nova-scheduler-17.0.3-0.20180420001140.el7ost.noarch python-novaclient-9.1.1-1.el7ost.noarch python-nova-17.0.3-0.20180420001140.el7ost.noarch openstack-nova-placement-api-17.0.3-0.20180420001140.el7ost.noarch openstack-tripleo-heat-templates-8.0.2-33.el7ost.noarch puppet-nova-12.4.0-3.el7ost.noarch openstack-nova-conductor-17.0.3-0.20180420001140.el7ost.noarch ceph-ansible-3.1.0-0.1.rc3.el7cp.noarch openstack-nova-api-17.0.3-0.20180420001140.el7ost.noarch openstack-nova-common-17.0.3-0.20180420001140.el7ost.noarch Steps to reproduce: 1. Deployed beta bits with composable role (networker,messaging,database) + vlan + IPv6. 2. Minor update the setup to passed_phase2 3. Launch instance 4. Hard reboot the setup (using power) 5. Start the instance (is in SHUTOFF state) Result: The instance although in ACTIVE state doesn't boot properly. Looking at the console of the instance: [ 27.284597] end_request: I/O error, dev vda, sector 32731 [ 27.284597] end_request: I/O error, dev vda, sector 32741 [ 27.284597] end_request: I/O error, dev vda, sector 46275 [ 27.284597] end_request: I/O error, dev vda, sector 48839 [ 27.544440] end_request: I/O error, dev vda, sector 0 [ 27.548431] JBD: recovery failed [ 27.563190] EXT3-fs (vda1): error loading journal [ 27.584555] end_request: I/O error, dev vda, sector 16067 [ 27.588540] end_request: I/O error, dev vda, sector 16261 [ 27.588540] end_request: I/O error, dev vda, sector 16269 [ 27.588540] end_request: I/O error, dev vda, sector 16293 [ 27.588540] end_request: I/O error, dev vda, sector 16777 [ 27.588540] end_request: I/O error, dev vda, sector 16999 [ 27.588540] end_request: I/O error, dev vda, sector 17301 [ 27.588540] end_request: I/O error, dev vda, sector 17307 [ 27.588540] end_request: I/O error, dev vda, sector 26929 [ 27.588540] end_request: I/O error, dev vda, sector 27005 [ 27.588540] end_request: I/O error, dev vda, sector 32645 [ 27.588540] end_request: I/O error, dev vda, sector 32681 [ 27.588540] end_request: I/O error, dev vda, sector 32687 [ 27.588540] end_request: I/O error, dev vda, sector 32693 [ 27.588540] end_request: I/O error, dev vda, sector 32697 [ 27.588540] end_request: I/O error, dev vda, sector 32705 [ 27.588540] end_request: I/O error, dev vda, sector 32715 [ 27.588540] end_request: I/O error, dev vda, sector 32721 [ 27.588540] end_request: I/O error, dev vda, sector 32727 [ 27.588540] end_request: I/O error, dev vda, sector 32731 [ 27.588540] end_request: I/O error, dev vda, sector 32741 [ 27.588540] end_request: I/O error, dev vda, sector 46275 [ 27.588540] end_request: I/O error, dev vda, sector 48839 [ 27.826491] JBD2: recovery failed [ 27.833509] EXT4-fs (vda1): error loading journal mount: mounting /dev/vda1 on /newroot failed: Invalid argument FATAL: ==== uh-oh, /dev/vda1 was there, but not after growroot ==== Executing /bin/sh. maybe you can help /bin/sh: can't access tty; job control turned off / # Reproduced with another instance and additional reboot.
The issue reproduced on a setup without minor update: openstack-nova-scheduler-17.0.3-0.20180420001140.el7ost.noarch puppet-nova-12.4.0-3.el7ost.noarch instack-undercloud-8.4.1-4.el7ost.noarch python-nova-17.0.3-0.20180420001140.el7ost.noarch puppet-ceph-2.5.0-1.el7ost.noarch openstack-tripleo-heat-templates-8.0.2-33.el7ost.noarch ceph-ansible-3.1.0-0.1.rc8.el7cp.noarch openstack-nova-placement-api-17.0.3-0.20180420001140.el7ost.noarch openstack-nova-compute-17.0.3-0.20180420001140.el7ost.noarch openstack-nova-common-17.0.3-0.20180420001140.el7ost.noarch python-novaclient-9.1.1-1.el7ost.noarch openstack-nova-api-17.0.3-0.20180420001140.el7ost.noarch openstack-nova-conductor-17.0.3-0.20180420001140.el7ost.noarch 1)the setup has networker composable role 2)it's a vxlan+ipv6 deployment
happens on clean deploy of OSP13 RC puddle - IPv4 environments as well.
It comes back fine if trigger the reboot by ssh reboot. so the problem doesn't occur by a graceful reboot. but power outage scenarios for the compute-nodes will be affected.
The issue reproduced if we hard reboot only the computes too.
a similar issue: https://bugs.launchpad.net/nova/+bug/1773449
runing the following command on one mon host seems to resolve it on the affected setup: sudo ceph auth caps client.openstack mon 'allow r, allow command "osd blacklist"' osd 'allow rwx'
The disk corruption on hard reboot is exactly what I would expect from writeback caching normally. Our nova config text specifically warns against this for disk_cache_modes including writeback. Now, it sounds like there are other interactions with the rbd driver and its own cache (which I don't understand). Sounds like if the guest is properly detecting the cache mode on the block device and performing flushes, that those should be honored through rbd. But, sure looks like that's not happening. I don't think nova has regressed here or caused this behavior. This seems like either unsafe config, or something that makes it safe being broken in qemu or rbd.
Here's a link to qemu-rbd documentation that we looked over today on IRC in #rhos-mgt: http://docs.ceph.com/docs/master/rbd/qemu-rbd/#qemu-cache-options where it recommends cache=writeback when rbd_cache=true. We (and others out in the world) have been configuring disk_cachemodes="network=writeback" for performance with ceph and presumably this test (hard reboot of compute hosts with ceph-backed instances) has worked fine in the past. We should consult with qemu and rbd experts to take a look at this BZ and lend their comments on whether writeback is supposed to be safe with rbd caching and whether anything has regressed in qemu or rbd or qemu-rbd, causing this.
(In reply to Alexander Chuzhoy from comment #9) > runing the following command on one mon host seems to resolve it on the > affected setup: > sudo ceph auth caps client.openstack mon 'allow r, allow command "osd > blacklist"' osd 'allow rwx' the mon logs on controller-1 show the blacklist deny messages starting 16:40: Jun 14 16:40:48 controller-1 docker: 2018-06-14 16:40:48.563134 7f45564ba700 0 log_channel(audit) log [INF] : from='client.? [fd00:fd00:fd00:3000::16]:0/3291166411' entity='client.openstack' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "[fd00:fd00:fd00:3000::16]:0/3368966685"}]: access denied Jun 14 16:40:48 controller-1 journal: 2018-06-14 16:40:48.563134 7f45564ba700 0 log_channel(audit) log [INF] : from='client.? [fd00:fd00:fd00:3000::16]:0/3291166411' entity='client.openstack' cmd=[{"prefix": "osd blacklist", "blacklistop": "add", "addr": "[fd00:fd00:fd00:3000::16]:0/3368966685"}]: access denied This looks like what is mentioned in [1] when ceph cluster got upgraded from Red Hat Ceph Storage 2.y to Red Hat Ceph Storage 3.y Maybe we need to check the ceph auth settings from before an update and afterwards to see if they got changed if the same test worked before the minor update. [1] https://access.redhat.com/solutions/3377231
Martin's solution from comment 12 seems the most likely here. The upstream bug referenced in comment 8 was also resolved by fixing ceph auth settings. Please can we check them? Assuming this is the issue we presumably need to fix whatever configures them.
(In reply to Matthew Booth from comment #14) > Martin's solution from comment 12 seems the most likely here. The upstream > bug referenced in comment 8 was also resolved by fixing ceph auth settings. > Please can we check them? > > Assuming this is the issue we presumably need to fix whatever configures > them. Looking at comment 9 this seems to be confirmed already
Checked a deployment with OSP13: Applied patch https://code.engineering.redhat.com/gerrit/141750 before deploying overcloud. Deployed overcloud (3 controller,2 computes, 3 ceph) Started an instance. hard rebooted the computes. started again the turned off instance after the computes became available. Works fine. The issue doesn't reproduce. Environment: instack-undercloud-8.4.1-4.el7ost.noarch openstack-tripleo-heat-templates-8.0.2-35.el7ost.noarch ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch + https://code.engineering.redhat.com/gerrit/141750
Verified on openstack-tripleo-heat-templates-8.0.2-38.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086