Description of problem: I have a 4 node cluster running gfs2 on top of a EMC SAN for a while now, and since couple of months ago we are randomly experiencing heavy write slowdowns. Write rate goes down from about 30 MB/s to 10 kB/s. It can affect 1 node or more at the same time. Umount and mount solves the problem on the affected node, but after some random time (hours, days) happens again. Version-Release number of selected component (if applicable): Centos 5.6 X86_64 How reproducible: is random Steps to Reproduce: 1.just wait until it happens Actual results: file write is 10kB/s max. no matter the file size Expected results: about 30 MB/s, as usual Additional info: Operating system: Centos 5.6 x86_64 Kernel: 2.6.18-238.9.1.el5 Cman: cman-2.0.115-68.el5_6.3 Gfs2-utils: gfs2-utils-0.1.62-28.el5_6.1 3 nodes fibre channel 4gb 1 node on iscsi 1gb
Hi Ramiro, Have you opened a ticket with RH support (assuming you have a RH subscription) or with EMC? Which EMC storage do you have? Thanks!
Hi, all systems are running Centos 5.6 and I don't have any RH subscription at the moment. Our storage is a EMC Clariion CX3-10 Thank you
Hi Ramiro, Red Hat bugzilla is meant primarily to track customer issues we have with RHEL. If you use CentOS, we certainly appreciate hearing about the issues but you are probably better off posting to the upstream lists. For GFS2, use: cluster-devel As an EMC customer, you can probably open a ticket through them. One thing to try and avoid is thrashing locks between nodes. For example, running find on multiple nodes, adding and creating files in a shared directory or possibly running a backup application could all contribute to un-even performance. Thanks!
Hi Ric, I've created this report because I was suggested to do so on linux-cluster list. Also, some months ago, Steven Whitehouse sent some guidelines as how one should report an issue and he said that members of the community should report potential bugs through here. I appologize if I'm out of the line here, just triyng to follow protocol in order to get some assistance. As for EMC, we don't have support right now. But even if we did, I don't think it has to do with it since umounting and mounting solves the problem and other nodes (connected to the storage) with other filesystems (ext3) doesn't have any problems. I've gone through all our applications (web apps) and found nothing that could be compromising performance. The only application that might affect performance is a backup client (amanda) which runs once a day. But it has been running since a while now and we saw no problems in the past. Thank you for your time.
Yes, I would like to encourage people to report bugs here. Provided its clear what is being reported, then thats ok. There are a couple of likely causes, so let me ask some questions to try and narrow down the possibilities: Firstly, how full is the filesystem in question? Secondly, are you able to identify a file on the filesystem which has been created while the write speed was very slow? If, and assume it is a reasonable size, say 1000 blocks or more, then take a look at it with the filefrag tool. That will show you all the extents. If the extents are very small (e.g. only a few blocks each) then you may have hit a known problem. If so then we have already fixed this and I can close this bug as a dup of that one. This is more likely if the filesystem is getting close to being full - you won't hit this bug on a filesystem that is nearly empty. Also note that this particular bug does not affect either Fedora/upstream or RHEL6. If that turns out not to be the case, then the next most likely issue is contention between the nodes as Ric mentioned above. We have a document which describes how to deal with that situation, but it is only accessible to customers with an RHN account at the moment, I'm afraid.
Hi Steve, 1. The filesystem is 60% full. 2. I've copied a file (2632 blocks) from an affected node and here is what i've got: from affected node: filefrag /mnt/gfs/xymon-4.3.0.tar.gz /mnt/gfs/xymon-4.3.0.tar.gz: 652 extents found the same file copied from unaffected node to another location: filefrag /mnt/gfs/tmp/xymon-4.3.0.tar.gz /mnt/gfs/tmp/xymon-4.3.0.tar.gz: 157 extents found a file created on 2009 and about the same size: filefrag /mnt/gfs/awstats/data/awstats022009.test.txt /mnt/gfs/awstats/data/awstats022009.test.txt: 1 extent found 3. Where can I find that document? I don't have any entitlements right now but maybe I can access it. Thank you.
Sorry, the filesystem is 66% full. total: 200G used: 132G free: 69G
It looks like you have more extents than I'd expect for only ~2600 blocks on the affected node, and your fs is probably full enough to have hit the problem. The document is, I'm afraid, only available to those with an RHN account, but the same information has been repeated by myself (and others) many times on the mailing lists. It is a question of taking into account the cache performance of GFS2 in order to gain the most from it. I'm going to close this bz as a dup of the existing, known issue that is almost certainly causing the problem you've reported. I would also suggest that you consider either moving to a paid support contract with Red Hat or to use Fedora. In the latter case, the particular problem has been fixed for a long time now, and it is much more uptodate. The problem with CentOS is that they are not picking up updates in a timely manner, so that you are likely to run into a few issues that have long since been fixed in other distros.
*** This bug has been marked as a duplicate of bug 683155 ***
Steve, is this problem fixed in RHEL 5? Is Fedora OK for production use? Isn't Fedora's policy to give updates for only 13 months on each release? Cheers
Yes, it will be fixed in RHEL 5.7, and it has also been released in z-stream for 5.4.z and 5.6.z, however these z releases do not make their way into CentOS to the best of my knowledge, so there will be no released CentOS with this fix until 5.7. Yes, it is Fedora policy to only release updates for a fairly limited period of time. My point was not that Fedora was suitable for production use, but more that it will get fixes much quicker than CentOS which has to wait both for the RHEL process and then for its own processes before updates are ready. So Fedora is good for evaluation use, and also if you are willing to do your own support. CentOS has its uses, but it does appear to be a bit behind the time wrt GFS2. I don't often use any of the Debian package manager based distros (Ubuntu, etc) but they may potentially offer a solution closer to what you are looking for (a balance between update frequency and stability) though I can't vouch for that. If on the other hand, you want the support to be done for you, and at risk of sounding like an advert, there is really no substitute for paid-for distros.