Bug 836442
Summary: | shutdown -r now in virtual machine often leaves problems | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Trever Adams <trever> |
Component: | qemu | Assignee: | Fedora Virtualization Maintainers <virt-maint> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 17 | CC: | amit.shah, berrange, cfergeau, devrim, dwmw2, esandeen, hhorak, itamar, knoel, pbonzini, rjones, scottt.tw, tgl, virt-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-04-03 15:05:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Trever Adams
2012-06-29 06:27:42 UTC
This seems reminiscent of some reports that we saw back when postgres did not fsync the postmaster.pid file while creating it. 9.1.4 certainly does so, though, which says to me that this is probably a bug in the VM software (ie, failing to pass through an fsync request). Out of curiosity, what is in the postmaster.pid file exactly when you see this error? As the pid is on a tmpfs, I cannot explain this behavior at all. It will take me a while to reproduce this bug. I will post an answer as soon as I am able. You've got /var/lib/pgsql/data in a tmpfs? That's uncommon ... but if so, this is really weird, because the tmpfs ought to come up empty after a reboot. Sorry. I know most pids are in /var/run and my brain just burped. Yes, this pid is NOT tmpfs. Sorry for the confusion. I was unable to get a pid problem on the reboot I just did. I will keep trying until I can show you the contents. Trever, if you encounter the issue again, it would be good to attach also postgresql-related messages from the systemd log. For booting with more verbose systemd log you can use "log_buf_len=1M systemd.log_level=debug systemd.log_target=kmsg" as kernel parameters. The PID file is zero length. Jul 8 13:46:07 machine pg_ctl[8339]: pg_ctl: invalid data in PID file "/var/lib/pgsql/data/postmaster.pid" Jul 8 13:46:07 machine systemd[1]: postgresql.service: control process exited, code=exited status=1 Jul 8 13:46:07 machine systemd[1]: Unit postgresql.service entered failed state. I have appended those things ot my boot string, but I will have ot wait for them to take effect. (In reply to comment #6) > The PID file is zero length. That's pretty much what I expected. This indicates that the filesystem metadata showing that the PID file exists reaches disk, but the actual file contents do not, despite the fsync() call that PG has made. Now it could be that you're using unsafe filesystem settings, but otherwise this is a kernel or VM bug. I'm going to make a guess and reassign this to qemu, but that's mostly a guess ... /dev/mapper/vg_machine-lv_root on / type ext4 (rw,relatime,seclabel,data=ordered) is how it is mounted in the vm /dev/mapper/Family-lv_family on /var/Family type ext4 (rw,relatime,seclabel,data=ordered) is how where the drive images are stored is mounted I use virtio for network and drives (drive images in this case). I can provide more information when it is requested. I have tried to include everything I have that may be releavent. Is your VM stored in a qcow2 file and does qemu segfault during the reboot? (bug 836913) You would have to enable coredumps and make sure they are being captured centrally by something like abrt, or if you don't want to use abrt, then try a technique like the one described here: http://people.redhat.com/~rjones/xen-stress-tests/ (By the way the last thing there about restarting services to set the ulimit probably doesn't work with systemd) There is an ext4 bug which causes data loss on shutdown. Perhaps this is the same bug. Will try to search for it. Candidate bugs: https://bugzilla.redhat.com/show_bug.cgi?id=721114 https://bugzilla.redhat.com/show_bug.cgi?id=749527 https://bugzilla.redhat.com/show_bug.cgi?id=818172 (all dups of each other - the last one contains patches) Those patches are in 2.3.36, so F17 should have them. "It started showing up for me about three weeks ago." - and it was ok before then? Do you have logs of what packages got updated (either on the host or the guests, I guess?) So the .pid file is showing up as 0-length on a reboot? But a .pid file shouldn't even persist through a reboot should it? This is older postgresql, but: [root@sandeen ~]# /etc/init.d/postgresql status postmaster is stopped [root@sandeen ~]# ls /var/lib/pgsql/data/postmaster.pid ls: /var/lib/pgsql/data/postmaster.pid: No such file or directory [root@sandeen ~]# /etc/init.d/postgresql start Starting postgresql service: [ OK ] [root@sandeen ~]# ls /var/lib/pgsql/data/postmaster.pid /var/lib/pgsql/data/postmaster.pid [root@sandeen ~]# /etc/init.d/postgresql stop Stopping postgresql service: [ OK ] [root@sandeen ~]# ls /var/lib/pgsql/data/postmaster.pid ls: /var/lib/pgsql/data/postmaster.pid: No such file or directory so if things shut down cleanly there should be no pid file at all, right? I am using raw images as I plan on migrating these to real hardware at some point. And, no, qemu doesn't segfault on me. The .pid is 0 length but in existence. It seems to me either there is an fs problem, or a stop or start of postgresql problem. It might be the ext4 bug, but I believe I only see it with postgresql pid file. I cannot look at the bugs. Apparently there are security related things which lock me out. Should I attach the entire yum.log (a few months worth because of some logrotate problem)? I am not sure on the exact time the bug started. What are the timestamps on the pid file, maybe that's interesting, to see how they relate to the time of shutdown? I think it's up to /usr/bin/postmaster to create & remove the pidfile. Perhaps stracing it would help? Does a manual /etc/init.d/postmaster [stop|start] do the right thing outside of system shutdown? I will have to wait until the next failure to get timestampes and other things. I don't like to reboot these vms or the host machine more than I must. Manually shutting down (service postgresql restart) has failed to produce the problem. Is it possible for you to clone the VM and run the tests on a clone? It might be possible if I can run it on a different machine. The host in question is about maxed. I have access to others which are similar hardware, but not identical. [root@machine ~]# stat /var/lib/pgsql/data/postmaster.pid File: `/var/lib/pgsql/data/postmaster.pid' Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fd00h/64768d Inode: 1704046 Links: 1 Access: (0600/-rw-------) Uid: ( 26/postgres) Gid: ( 26/postgres) Context: system_u:object_r:postgresql_db_t:s0 Access: 2012-07-20 00:15:22.736309720 -0600 Modify: 2012-07-19 19:39:01.596744269 -0600 Change: 2012-07-19 19:39:01.596744269 -0600 Birth: - [root@machine ~]# uptime 00:17:46 up 4:40, 1 user, load average: 0.64, 1.63, 1.38 [root@machine ~]# date Fri Jul 20 00:17:51 MDT 2012 So, it appears that the file is being modified on shutdown or reboot. I haven't done the math. The machine was rebooted for glibc/kernel changes about that time (the uptime is from the vm not the host). The host appears to be about the same. so at 00:17:46 you'd been up for 4h 40m, soo... last boot was at about 19:37:46 (you could verify in /var/log/messages I suppose), and the file was modified at 19:39:01...? If I figured right it looks more like it's been modified during bootup, not shutdown? I think you're going to need to find a way to trace this to get more info. I am not sure how to do a trace where this happens on startup. It is happening less and less, but it still happens occasionally. Eric, I haven't been able to trigger this any other place but boot. I haven't been able to trigger it with an strace going. It is getting harder and harder to trigger. Interestingly enough, most of the time I do not have to remove the pid file, just restart now. I have not been able to trigger this in the last few months. I think it may be fixed. I know several problems with disk corruption with kqemu, etc. have been fixed and I cannot trigger this any more. Ok, I'm going to go ahead &close this. If it comes up again, please do re-open. Thanks, -Eric |