Bug 567113 - ext4 journal_data open O_SYNC etc. fails with EINVAL, blocks postgresql wal default sync method
Summary: ext4 journal_data open O_SYNC etc. fails with EINVAL, blocks postgresql wal d...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 13
Hardware: All
OS: Linux
high
medium
Target Milestone: ---
Assignee: Eric Sandeen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-02-21 18:07 UTC by Bruno Wolff III
Modified: 2010-05-05 19:25 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-05-05 19:25:32 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Bruno Wolff III 2010-02-21 18:07:13 UTC
Description of problem:
When upgrading systems from f12 to f13, postgres' server stopped working. When I got to one where I really needed it working, I looked into the problem and found that by setting wal_sync_method = fsync instead of getting the default, things started working again.

Version-Release number of selected component (if applicable):
postgresql-server-8.4.2-6.fc13.i686

How reproducible:
100%

Steps to Reproduce:
1. Install F13 with postgresql-server
2. service postgres initdb
3. service postgres start
4. examine pg_log/*
  
Actual results:
PANIC:  could not open file "pg_xlog/000000010000000000000000" (log file 0, segment 0): Invalid argument

Expected results:


Additional info:

Comment 1 Tom Lane 2010-02-22 00:18:25 UTC
Hm, I tried to install f13 so I could investigate this, but today's boot.iso freezes at the welcome screen :-(.  Could you provide a bit more detail about how you installed this?  Also, what filesystem did you pick?

Comment 2 Bruno Wolff III 2010-02-22 02:56:48 UTC
I did yum updates for the two systems I had been seeing this on. Both have /var/lib/pgsql on ext3.
However, this got me to try it on another system where /var/lib/pgsql is on an ext4 file system and it does not exhibit this problem. This machine was also yum updated, though doesn't go as far back as the other 2.
One of the problem machines had been running the corresponding f12 version of postgres and once I had things going again the existing data was usable.
Maybe there was a change to ext3 support that broke the default sync type (open_datasync according to the comments).
These 3 machines are all i686. Tomorrow I'll have access to an x86_64 machine (I am locked out right now due to openssh/selinux issue) and I'll report what I find there.

Comment 3 Bruno Wolff III 2010-02-22 03:04:43 UTC
Other notes. I tried switching selinux to permissive and checking the audit log and didn't find any reason to believe selinux is causing the problem.
I tried using both existing data directories (from the corresponding f12 version of postgres) and fresh initdb's. The results didn't seem to be related to that.
I haven't tried doing initdb's on different file systems on the same machine, though I think I have one that has both. If you want I could look at that?

Comment 4 Bruno Wolff III 2010-02-22 03:06:31 UTC
I also haven't tried any specific fsync types other than fsync. If you think it would, be useful I could try all of the different possible ones and see what results I get?

Comment 5 Tom Lane 2010-02-22 03:18:56 UTC
Yeah, please, on both points.  I think this is almost certainly a kernel regression, but we ought to try to narrow it down as much as possible before bugging those folk about it.

Comment 6 Bruno Wolff III 2010-02-24 06:09:27 UTC
I did some more testing but results seem mysterious to me. Whether open_sync and open_datasync worked seemed to be tied to the machine, but not the filesystem type.
All three machines should have packages at the same version, though not all had the same set of packages installed. 2 and 3 are mostly identical hardware, but 2 has a couple of extra pata drives thrown in. None are using lvm, all are using raid 1, and machines 1, 3 and 4 use luks.
However machines 1, 2 and 4 use noatime and barrier=1 as mount options and machine 3 doesn't. I'll be checking that shortly.
machine1 x86_64 ext3 fsync OK
machine1 x86_64 ext3 fsync_writethrough invalid
machine1 x86_64 ext3 open_sync fail
machine2 i686 ext3 default fail
machine2 i686 ext3 open_datasync fail
machine2 i686 ext3 fdatasync OK
machine2 i686 ext3 fsync OK
machine2 i686 ext3 fsync_writethrough invalid
machine2 i686 ext3 open_sync fail
machine2 i686 ext4 default fail
machine2 i686 ext4 open_datasync fail
machine2 i686 ext4 fdatasync OK
machine2 i686 ext4 fsync OK
machine2 i686 ext4 fsync_writethrough invalid
machine2 i686 ext4 open_sync fail
machine3 i686 ext4 default OK
machine3 i686 ext4 open_datasync OK
machine3 i686 ext4 fdatasync OK
machine3 i686 ext4 fsync OK
machine3 i686 ext4 fsync_writethrough invalid
machine3 i686 ext4 open_sync OK

Comment 7 Bruno Wolff III 2010-02-24 06:27:58 UTC
It doesn't look like noatime nor barrier=1 makes a difference. I am going to check journalling type next.

Comment 8 Bruno Wolff III 2010-02-24 06:35:27 UTC
Setting the default mount option (with tune2fs) journal_data triggers the behavior.

Comment 9 Tom Lane 2010-02-26 22:41:20 UTC
OK, I finally got f-13 alpha rc4 installed here, and I confirm it: things are fine by default, but after you set

tune2fs -o journal_data /dev/sdaX

and reboot, the database no longer starts.  strace'ing the postmaster shows it fails here:

[pid  1662] open("pg_xlog/000000010000000000000000", O_RDWR|O_SYNC|O_DIRECT|O_LARGEFILE) = -1 EINVAL (Invalid argument)

I suspect the kernel is unhappy about either O_SYNC or O_DIRECT, but in any case this is a regression from previous behavior.  Reassigning to kernel to see what they say about it.

Reproduced here with:
kernel-PAE-2.6.33-1.fc13.i686

Comment 10 Bruno Wolff III 2010-04-07 04:02:55 UTC
If this isn't fixed by final release, it would probably be good to put something in the release notes about it under postgresql.

Comment 11 Eric Sandeen 2010-05-05 17:33:19 UTC
Sorry for not seeing this bug sooner.

There is no ->direct_IO method for ext3 or ext4 in journalled data mode - never has been, AFAIK.

In my testing, ext3 & ext4 fail the same way, it's not unique to ext4.

open("/mnt/test/blah", O_WRONLY|O_CREAT|O_TRUNC|O_DIRECT, 0666) = -1 EINVAL (Invalid argument)

__dentry_open() does this:

        /* NB: we're sure to have correct a_ops only after f_op->open */
        if (f->f_flags & O_DIRECT) {
                if (!f->f_mapping->a_ops ||
                    ((!f->f_mapping->a_ops->direct_IO) &&
                    (!f->f_mapping->a_ops->get_xip_mem))) {
                        fput(f);
                        f = ERR_PTR(-EINVAL);
                }
        }

so, you get EINVAL.

If this "worked" before, I suspect that at a minimum you weren't actually getting direct IO...

(spot-checks RHEL5 ... same deal)

Did this really used to work?  I'm doubtful.

-Eric

Comment 12 Bruno Wolff III 2010-05-05 17:43:04 UTC
It worked in that postgres used to run with that setting. I have no idea if direct IO was really used or not, as I didn't test on that level.
Maybe the error code being returned changed or something of that kind.

Comment 13 Tom Lane 2010-05-05 18:02:29 UTC
I'm pretty sure that postgres' behavior didn't change in this respect between f12 and f13.  Are you sure that you had the filesystem set to journal_data mode before?

Comment 14 Bruno Wolff III 2010-05-05 18:14:18 UTC
I don't believe I touched anything related to the journal mode setting at the time I started seeing the postgresql server stop working. It's possible that some other behavior changed and that while I had requested data_journal, I wasn't really getting it.
I am running software raid 1 and I remember that for a long time that I was getting errors related to that and write barriers. So maybe journalling was being disabled because lack of support of write barriers with software raid?

Comment 15 Eric Sandeen 2010-05-05 18:22:02 UTC
Barriers shouldn't be related.

From what I see, it's not that DIO would fail, it's that O_DIRECT open would fail as well, at least since 2.6.18 kernels.

-Eric

Comment 16 Bruno Wolff III 2010-05-05 19:03:24 UTC
If this is the expected behavior, then it's not a regression and we can close this?
Probably not too many people will run into this.
It might make sense to document this somewhere in postgres.

Comment 17 Eric Sandeen 2010-05-05 19:06:56 UTC
Well, I do not think it is a regression, I just wish we knew for sure why it seemed to work for you before.  :)  Fine by me to close it though; there isn't any ->direct_IO for data=journal ... not even in rhel4 (just checked)

-Eric

Comment 18 Bruno Wolff III 2010-05-05 19:25:32 UTC
All of my machines are running F13 now. I have one that I can wipe, but I am short on time. I'll have rebuilding that machine with F11 or early F12 low on priority list.
If I can get postgres to work with the direct IO setting and data journaling, I'll try upgrading the kernel and see if that changes things.
But considering the low impact, I won't be rushing to do this relative to all of the other Fedora stuff I am supposed to be working on.


Note You need to log in before you can comment on or make changes to this bug.