Description of problem: Heartbeat`s init.d scripts stopped working after update. Version-Release number of selected component (if applicable): heartbeat-3.0.4-1.el6.x86_64 How reproducible: Try to restart service Steps to Reproduce: 1. service heartbeat status or service heartbeat start 2. 3. Actual results: Nothing (empty line) Expected results: heartbeat OK [pid 1377 et al] is running on et [et]... Or any error. Additional info: In logs there is no any information related
I wonder if this is maybe SELinux related? Can you try "setenforce 0"?
Also a 'sh -x /etc/init.d/heartbeat start' output might be good if you can attach it.
This appears to be related to the update of the resource-agents package (resource-agents-3.9.2-21.el6_4.8). The HA_BIN variable is set in /usr/lib/ocf/resource.d/heartbeat/.ocf-directories. In previous versions of the package, this was set to the following: : ${HA_BIN:=/usr/lib64/heartbeat} In resource-agents-3.9.2-21.el6_4.8, HA_BIN is set to the following: : ${HA_BIN:=/usr/libexec/heartbeat} The heartbeat package from EPEL places the heartbeat binary in the /usr/lib64/heartbeat directory. A workaround of symlinking /usr/lib64/heartbeat/heartbeat to /usr/libexec/heartbeat/heartbeat seems to work for now.
This isn`t SELinux related. Nick Hope, thanks that worked!
Would someone be willing to file a new bug or move this over to resource-agents then? it would be better if it would get addressed there if possible.
I cannot find resource-agents on components list.
Are you looking under RHEL? https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%206
David, it looks like this is a consequence of multilib support and PATH handling. Either heartbeat or resource-agents should ship a compatibility symlink. In any case this bug doesn´t affect RHEL since we don´t ship or support heartbeat.
I'd suggest doing the same change for HA_BIN in heartbeat package.
Cross-filed case 00979415 on the Red Hat customer portal as it breaks all of our existing legacy heartbeat setups as it seems. The case requests that it's solved either or by the compatibility symlink and if the compat symlink gets refused for RHEL that Red Hat supports the EPEL package maintainer.
(In reply to Robert Scheck from comment #11) > Cross-filed case 00979415 on the Red Hat customer portal as it breaks all of > our existing legacy heartbeat setups as it seems. The case requests that > it's > solved either or by the compatibility symlink and if the compat symlink gets > refused for RHEL that Red Hat supports the EPEL package maintainer. We have no problem to help the EPEL package maintainer. The reason why the path was changed was to accommodate multilib requirement for RHEL. I don't honestly know if using a compatibility symlink will cause multilib issues, but definitely we will find a solution. As for the customer case, please remember that we do not support heartbeat in RHEL and that even resource-agents for pacemaker (the heartbeat portions) are still TechPreview (not supported) by RHEL.
(In reply to Fabio Massimo Di Nitto from comment #12) > I don't honestly know if using a compatibility symlink will cause multilib > issues, but definitely we will find a solution. That's why I wonder why you use %{_libexecdir} instead of %{_libdir} now, but that's out of my target. > As for the customer case, please remember that we do not support heartbeat > in RHEL and that even resource-agents for pacemaker (the heartbeat portions) > are still TechPreview (not supported) by RHEL. Yes, I am aware about that. But it's still not nice to break things thus I try to provide valuable feedback - and still care about our own customers at work. You also should keep in mind that AFAIK Linbit (the DRBD developers) still supports Heartbeat and their customers won't be amused about this, I guess. Correct me but a customer case is the only real chance to get some attention to IMHO not so well thought package changes during RHEL 6.x (sorry!).
FYI, from my side (heartbeat maintainer in EPEL), I'm happy to try and make changes or add any interested folks who would like to co-maintain that might have more time than I do to help maintain.
(In reply to Robert Scheck from comment #13) > (In reply to Fabio Massimo Di Nitto from comment #12) > > I don't honestly know if using a compatibility symlink will cause multilib > > issues, but definitely we will find a solution. > > That's why I wonder why you use %{_libexecdir} instead of %{_libdir} now, but > that's out of my target. > > > As for the customer case, please remember that we do not support heartbeat > > in RHEL and that even resource-agents for pacemaker (the heartbeat portions) > > are still TechPreview (not supported) by RHEL. > > Yes, I am aware about that. But it's still not nice to break things thus I > try > to provide valuable feedback - and still care about our own customers at > work. > Yes we all agree. That's why we will find a fix in one way or another (that being in resource-agents or heartbeat in EPEL). > You also should keep in mind that AFAIK Linbit (the DRBD developers) still > supports Heartbeat and their customers won't be amused about this, I guess. Linbit ships their own set of packages. I doubt they will be affected by this changes. But then again, we never claimed full support to allow us to change packages as necessary. EPEL and RHEL packaging guidelines are different. > Correct me but a customer case is the only real chance to get some attention > to IMHO not so well thought package changes during RHEL 6.x (sorry!). Not really no.. the bug was getting attention without the customer case. GSS can't do much either way. They don't maintain EPEL nor they provide support for TP components. We can also agree that this breakage could have been avoided tho.
(In reply to Kevin Fenzi from comment #14) > FYI, from my side (heartbeat maintainer in EPEL), I'm happy to try and make > changes or add any interested folks who would like to co-maintain that might > have more time than I do to help maintain. Let's wait Wed for David to come back and discuss quickly the correct fix.
My colleague pointed me to this bug. If you get other bugs regarding heartbeat or resource-agents, feel free to add me to Cc proactively, so later I can not pretend I did not know about it ;-) A new *resource-agents* version, 3.9.6 is overdue, it was announced to be released last month :-/ As soon as I find the time we'll release that. (Which should be by the end of this November (honestly!) (or someone else takes over)). Then I can move all binaries and other stuff that does not belong into libdir (according to what guidelines? can someone shoot me a link please?) in the *heartbeat* package to libexecdir as well, also drop the useless legacy init script dependency on the HA_BIN definition meanwhile split into the resource-agents package, tag a heartbeat 3.0.6, and have that require resource-agents >= 3.9.6 for good measure. Meanwhile, symlinks or patching the heartbeat init script is necessary for recent (newer than ~ July 2013) resource-agents with old heartbeat. AFAICS, the only heartbeat "dependency" on that variable is in fact the use of $HA_BIN/heartbeat in the init script, so only patching that would be an option as well, but the the heartbeat package would still violate the "multilib guidelines", I guess (did I mention I'd like a pointer as to which guidelines to apply?), by putting executable binaries into libdir. Other ideas? Lars
*** Bug 1028957 has been marked as a duplicate of this bug. ***
(In reply to Lars Ellenberg from comment #17) > My colleague pointed me to this bug. > If you get other bugs regarding heartbeat or resource-agents, > feel free to add me to Cc proactively, > so later I can not pretend I did not know about it ;-) > > A new *resource-agents* version, 3.9.6 is overdue, > it was announced to be released last month :-/ > > As soon as I find the time we'll release that. > (Which should be by the end of this November (honestly!) > (or someone else takes over)). > > Then I can move all binaries and other stuff that does not belong into libdir > (according to what guidelines? can someone shoot me a link please?) Here are a couple of documents I found searching google. 1. https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/s1-filesystem-fhs.html "/usr/lib, used for object files and libraries that are not designed to be directly utilized by shell scripts or users" "/usr/libexec, contains small helper programs called by other programs" 2. http://www.centos.org/docs/5/html/Deployment_Guide-en-US/s1-filesystem-fhs.html "lib/ contains object files and libraries that are not designed to be directly utilized by users or shell scripts" "libexec/ directory contains small helper programs called by other programs" > in the *heartbeat* package to libexecdir as well, > also drop the useless legacy init script dependency on the > HA_BIN definition meanwhile split into the resource-agents package, > tag a heartbeat 3.0.6, and have that require resource-agents >= 3.9.6 > for good measure. Sounds great :) > > Meanwhile, symlinks or patching the heartbeat init script is necessary > for recent (newer than ~ July 2013) resource-agents with old heartbeat. > > AFAICS, the only heartbeat "dependency" on that variable is in fact the > use of $HA_BIN/heartbeat in the init script, so only patching that > would be an option as well, If the init script is the only reference to HA_BIN, then that seems like the best fix. > but the the heartbeat package would still > violate the "multilib guidelines", I guess (did I mention I'd like a pointer > as to which guidelines to apply?), by putting executable binaries into > libdir. Have heartbeat install the binaries in /usr/libexec/heartbeat as well. But do not depend on the resource-agent package to create or even use that directory (no coupling of the two packages) > Other ideas? > > Lars
would it be worthwhile to also file a bug against EPEL for hearbeat (if not allready done)?
(In reply to Christoph Galuschka from comment #20) > would it be worthwhile to also file a bug against EPEL for hearbeat (if not > allready done)? Actually, this bug should be moved to heartbeat.
David, can you do that, or is a new bug required? Thanks
Is there any workaround or intermediate solution that could go into the EPEL heartbeat package? I read about the symlink, so is there any reason that we do not simply put that one into the heartbeat package in EPEL?
(In reply to Robert Scheck from comment #23) > Is there any workaround or intermediate solution that could go into the EPEL > heartbeat package? I read about the symlink, so is there any reason that we > do > not simply put that one into the heartbeat package in EPEL? I do not maintain that package, but from what I've gathered there's a workaround involving a symlink in the /usr/libexec/heartbeat folder that points to a binary in the /usr/lib/heartbeat folder.
David: Thanks
(In reply to David Vossel from comment #24) > I do not maintain that package, but from what I've gathered there's a > workaround involving a symlink in the /usr/libexec/heartbeat folder that > points to a binary in the /usr/lib/heartbeat folder. I tried to rebuild heartbeat with that symlink for the time being, however any heartbeat rebuild on RHEL 6.5 fails with "error: 'HA_LIBHBDIR' undeclared (first use in this function)" due to cluster-glue-libs-devel-1.0.5-6.el6. It was first reported via bug #869826.
please just fix the heartbeat init script for now. Something like this sed -i -e 's,\$HA_BIN/heartbeat,$HEARTBEAT,g' heartbeat/init.d/heartbeat.in sed -i -e $'/^### END INIT INFO/ a\\\n\\\nHEARTBEAT=@libdir@/heartbeat\n\n' heartbeat/init.d/heartbeat.in [sorry, right now I'm deep in other deep shit ;-) or I'd prepare, test and commit upstream this "hotfix" myself] There is no point using "$HA_BIN" in this init script at all. Much less so now that variable is in fact provided by an other project which just happens to be some decendend of something that used to be part of heartbeat...
As long as heartbeat does not build on RHEL 6 anymore, we are even not able to ship this workaround, but users have to execute these sed calls themself.
uhm, those sed calls where meant to be executed in a heartbeat source checkout ;) but yes, similar would work to patch the init script in a "live" system. But you are right, as long as heartbeat does not currently rebuild at all for your setups, this does not really help much.
(quote Lars Ellenberg from comment #29) ... > But you are right, as long as heartbeat does not currently rebuild at all > for your setups, this does not really help much. And if rebuilding it is going to take some work, you might consider adding a little more work while you're at it and repackaging it so it doesn't depend on any of the current linux-ha stuff like cluster-glue and resource-agents. The rationale is that a) heartbeat is unmaintained legacy code while the curent stuff is still a moving target so there's sure to be more incompatible changes coming to break things. And that "things" heartbeat tends to be used for are mission-critical installations and when those break b) people tend to get really pissed off at Fedora, RedHat, and Linux-HA, which is not something anyone wants (including the pissed-off: it's bad for our blood pressure).
(In reply to Dimitri Maziuk from comment #30) > (quote Lars Ellenberg from comment #29) > ... > > But you are right, as long as heartbeat does not currently rebuild at all > > for your setups, this does not really help much. > > And if rebuilding it is going to take some work, you might consider adding a > little more work while you're at it and repackaging it Yes, as has been mentioned before, we likely need to repackage it according to the "guidelines", put binaries in "libexec", put dynamically loadable stuff in lib{,64}. > so it doesn't depend > on any of the current linux-ha stuff like cluster-glue and resource-agents. Certainly impossible. That has all been one monolithic package. When the package split was done, the messaging and ipc was moved into "glue", so heartbeat will always have to depend on glue. And the resource-agents, well, have been moved into "resource-agents", and what remains in heartbeat is only wrappers to use the "ocf" resource agents from the old "haresources" ResourceManager script. Which means this dependency is *also* a hard dependency. > The rationale is that a) heartbeat is unmaintained legacy code We very much *do* maintain it. We do use it in production a lot, and we do use it in production also with pacemaker on top. Just that there have not been much code changes in a while does not mean it is unmaintained. It's "stable". It is pretty ugly in its insides at many places, but so are many other projects. But it does work. So appart from the occasional bug report about possible misbehaviour in certain corner cases, there won't be any "developement" happening: it just has no intention to go anywhere, being "feature complete". > while the curent stuff is still a moving target so there's sure to be more > incompatible changes coming to break things. Uhm, no. Glue was no moving target at all, it only tried to keep up with all the breakage caused by pacemaker progess ;-) And after having rewritten much everything that used to be glue, using libqb as messaging layer, using its own re-written-from scratch lrmd, using its own stonithd and so on, pacemaker no longer depends on glue in any way (unless you try to use it on top of heartbeat, as it then needs to use that messaging layer). > And that "things" heartbeat > tends to be used for are mission-critical installations and when those > break b) people tend to get really pissed off at Fedora, RedHat, and > Linux-HA, which is not something anyone wants (including the pissed-off: > it's bad for our blood pressure). Never change a running system ;-) Or use a supported stack. (You also did notice that this "does not build" breakage is because of some incompatibility with a 3.5 years old glue-devel, and that heartbeat *does* build fine against the current glue devel, right?)
(In reply to Lars Ellenberg from comment #31) ... > We very much *do* maintain it. > We do use it in production a lot, > and we do use it in production also with pacemaker on top. > > Just that there have not been much code changes in a while > does not mean it is unmaintained. > It's "stable". and > Or use a supported stack. When I did my Software Engineering 101 "maintained" meant "supported". (Admittedly it wasn't this century and I sort of stopped paying attention sometime after design patterns. So maybe it doesn't anymore, what do I know.) > Glue was no moving target at all, it only tried to keep up with all the > breakage caused by pacemaker progess ;-) "Keep up" by "not moving" is way too zen for me. So cluster-glue is "stable" and does not have to keep up with pacemaker progress anymore, great. Could there be a package heartbeat-resource-agents that similarly doesn't have to keep up while standing still? > Never change a running system ;-) Why bother releasing updates if I'm supposed to never install them?
(In reply to Dimitri Maziuk from comment #32) > (In reply to Lars Ellenberg from comment #31) > ... > > We very much *do* maintain it. > > We do use it in production a lot, > > and we do use it in production also with pacemaker on top. > > > > Just that there have not been much code changes in a while > > does not mean it is unmaintained. > > It's "stable". > > and > > > Or use a supported stack. > > When I did my Software Engineering 101 "maintained" meant "supported". > (Admittedly it wasn't this century and I sort of stopped paying attention > sometime after design patterns. So maybe it doesn't anymore, what do I know.) I don't think "maintained" and "supported" are exact synonyms. But even for your narrow definition: SuSE supports it. Linbit supports it. Others may, too... > > Glue was no moving target at all, it only tried to keep up with all the > > breakage caused by pacemaker progess ;-) > > "Keep up" by "not moving" is way too zen for me. Think prey and archer (standing still, but keeping up his aim) > So cluster-glue is "stable" and does not have to keep up with pacemaker > progress anymore, great. Could there be a package heartbeat-resource-agents > that similarly doesn't have to keep up while standing still? Hey I have been pissed of by all those unneccessary and incompatible package splits and breakages all the time, believe it or not. But neither the "unmaintained" nor the "moving target" thing was your central point, I hope, and the "unmaintained" simply struck a nerve here, I very much dislike heartbeat being spoken of as not maintained, when it is. > > Never change a running system ;-) > > Why bother releasing updates if I'm supposed to never install them? But it is ok to blame upstream heartbeat for a breakage that happened in the 3.5 years old RHEL-only glue package? I was simply trying to get through that * you are blaming the wrong package * you are suggesting the wrong cure... But alas, there is no point. it's never the fault of he who broke it, it is always he who does not longer work. I'll add an attachment to "make heartbeat compile again" in a minute, to make you all happy again... who cares for the blame game, after all, we all want "it" to "just work", right? *sigh*
Created attachment 830886 [details] make heartbeat compile against rhel cluster-glue-libs-devel 1.0.5 and fix init script make heartbeat compile against rhel cluster-glue-libs-devel 1.0.5 and fix init script which was broken by recent HA_BIN redefinition in resource-agents. May be incomplete, and resulting package is untested. But should get you going. Let me know the final fix you settle on, so I can push similar in heartbeat upstream, too.
Thanks very much Lars. I actually dug into this earlier in the week, but didn't get a fully building package, I was going to look again today. ;) Anyhow, I took your patch and added another one to fix another compile issue and have a scratch build: http://koji.fedoraproject.org/koji/taskinfo?taskID=6241293 Could folks please test this and provide feedback? If it looks good, I can push an update.
Kevin: I will try to test those builds next week on monday (hopefully). Thanks for providing them.
(In reply to Kevin Fenzi from comment #35) > Could folks please test this and provide feedback? If it looks good, I can > push an update. /etc/init.d/ patch gets heartbeat started -- I have it running since the day it broke. There are 11 other binaries and a bunch os .so's in subdirs in /usr/lib64/heartbeat installed by heartbeat rpm. There are 3 binaries in /usr/libexec/heartbeat installed by resource-agents. Does anyone know if anything else in the location formerly known as $HA_BIN is ever used?
(In reply to Lars Ellenberg from comment #33) > SuSE supports it. > Linbit supports it. As a centos user filing a bug against epel rpm, I'm glad to hear that. Especially from a guy known to start his replies on linux-ha mailing list with "are you a paying suse customer?" You have a day job, we get it, so do I, so does Kevin. Until linbit and suse start paying him for his effort, the fact that they support their customers is irrlevant here. > I was simply trying to get through that > * you are blaming the wrong package > * you are suggesting the wrong cure... I am saying that *epel* package X is not being developed or updated, nor supported by anything other than goodness of Kevin's heart. It depends on *rhel* package Y that is being developed as part of redhat's effort to bring software Z into their distro. To me that spells high chance of X being broken by an update to Z, which will be unfixable in the current package layout. You are suggesting that will never happen -- after all the history of linux-ha and all the splits and forks you yourself aren't happy about. I hope you're right, it'd sure make my life easier.
(In reply to Kevin Fenzi from comment #35) > > http://koji.fedoraproject.org/koji/taskinfo?taskID=6241293 > > Could folks please test this and provide feedback? If it looks good, I can > push an update. Kevin: I tested today with two VMs (x64 and i386) and 6.5 and it is looking good. I will do some more testing on real iron and for a longer period tomorrow.
Kevin. I now have the new heartbeat also running on real iron machines (6.5 with current resource-agents) where IPs are monitored - so far looking good.
(In reply to Dimitri Maziuk from comment #38) > (In reply to Lars Ellenberg from comment #33) > > > SuSE supports it. > > Linbit supports it. > > As a centos user filing a bug against epel rpm, I'm glad to hear that. > Especially from a guy known to start his replies on linux-ha mailing list > with "are you a paying suse customer?" You know, thank you, but I'm the Lars. (Ellenberg) *That* is *the other* Lars ;-) (Marowsky-Brée). > You have a day job, we get it, so do I, so does Kevin You realized that I was certainly NOT fighting with Kevin, but telling you (Dimitri) that even for your incorrect and too narrow definition of "maintained", you are wrong? > Until linbit and suse > start paying him for his effort, the fact that they support their customers > is irrlevant here. Absolutely. > I am saying that *epel* package X is not being developed or updated, You are complaining that an unsupported stack stopped working. And that trying to fix that by rebuilding the no longer working package does not work either, because yet an other package dropped a define and heartbeat has not compiled on the platform you chose *for over three years* And if I say "don't complain that it broke, you used an unsupported stack. Your options are to either use a supported stack, or fix who has broken it (not who was broken by it)", then you complain even louder, and forbid me to speak? :-) I would have skipped this comment altogether if you had not mistaken me for lmb; me is lge. So if you insist on arguing with me further about definitions, wording, and the blame game, take it to private mail ...
Guys, if you rebuild heartbeat anyways, please use current mercurial tip not 3 years old 3.0.4. Ok, "current" as in, was committed 8 month ago. (Strange. I thought I wrote those patches together with those other 2012 ones.) There are several highly relevant fixes. Flaky network (first packet drop, then communication loss) could * potentially cause heartbeat core to eat up 100 % cpu, * potentially preventing heartbeat from ever connecting to that node again And * potentially heartbeat would segfault given bad timing of a node dead event * potentially heartbeat would not even notice a node as dead if it had massive packet loss just before that * in certain situations (again: packet loss helps to trigger it) the ccm would not converge, so nodes would not agree on membership If it helps I can tag that as 3.0.6 "soon". I'll cross-post this comment in the other bug, too.
Well, how about we push this old one out with the fix for this issue now... and then when you tag 3.0.6 push it out as soon as it's available? I'd prefer to get people working as they were before without too many changes in one update...
heartbeat-3.0.4-2.el6 has been submitted as an update for Fedora EPEL 6. https://admin.fedoraproject.org/updates/heartbeat-3.0.4-2.el6
(In reply to Lars Ellenberg from comment #41) > You know, thank you, but I'm the Lars. (Ellenberg) > *That* is *the other* Lars ;-) (Marowsky-Brée). Sorry, brain fart. Always knew working on weekend's bad for me. > And if I say "don't complain that it broke, you used an unsupported stack. > Your options are to either use a supported stack, > or fix who has broken it (not who was broken by it)", > then you complain even louder, and forbid me to speak? No, I'm saying we can't fix who has broken it because it has nothing to do with any of us. It's RHEL pulling in the other stack in order to pull in the other other stack (RDO) -- the situation known as "too many cooks". I suppose I can add resource-agents to yum.conf's exclude list...
> (In reply to Lars Ellenberg from comment #41) PS. I get it, it's heartbeat's bug: wrong path in /etc/init.d script. This time. Next time RHEL fixes something else in resource-agents and it won't be. What matters is heartbeat will be broke again.
Package heartbeat-3.0.4-2.el6: * should fix your issue, * was pushed to the Fedora EPEL 6 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=epel-testing heartbeat-3.0.4-2.el6' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-EPEL-2013-12278/heartbeat-3.0.4-2.el6 then log in and leave karma (feedback).
Lars, Kevin, thank you very much for your time and work! The update in EPEL testing works here fine and as expected. No issues so far. Great work! :)
Kevin: Is it possible the change to heartbeat also changed the behaviour of ifconfig (as it does no longer return the HA-IP)? 'ip addr list' works however.
(In reply to Christoph Galuschka from comment #49) > Kevin: Is it possible the change to heartbeat also changed the behaviour of > ifconfig (as it does no longer return the HA-IP)? 'ip addr list' works > however. My guess is it's whatever resource-agent that ends up handling IPAddr that did it. I've a vague memory that the pacemaker's resource agent's been doing that for quite some time, it just probably never made it into redhat until now. Which is why I was bitching upthread: wrong $HA_BIN location is not the only thing that changed. RHEL is making more changes to resource-agents and they have no reason to maintain compatibility with EPEL's heartbeat RPM.
Can we stop piling on unrelated issues here please? The ip addr thing I seem to recall was a difference between using: /etc/ha.d/resource.d/IPaddr and /etc/ha.d/resource.d/IPaddr2 resources, but I don't recall fully. If you have another concrete bug, please file a new bug on it. Thanks.
(In reply to Kevin Fenzi from comment #51) > The ip addr thing I seem to recall was a difference between using: > /etc/ha.d/resource.d/IPaddr > and > /etc/ha.d/resource.d/IPaddr2 No. Unless you're saying the Lars got unstuck in space-time and rewrote several haresources files here to invoke IPaddr2 instead of IPaddr while I was yum-updating resource-agents.
heartbeat-3.0.4-2.el6 has been pushed to the Fedora EPEL 6 stable repository. If problems still persist, please make note of it in this bug report.
Hi almighties, just applied this minor update to our few cluster and guess what -> clusters is dead . I explain below : This new version update( 3.0.4-1.el6 to 3.0.4-2.el6 ) just broke our clusters 's unicast fonctionnality taking origine to this new patch puches by this bugreport version. related broken patch : heartbeat-3.0.4-duplicate-ucast.patch the result is heartbeat cannot start cause ucast (used in /etc/ha.d/ha.cf) cannot work with following error in logs : info: glib: Starting serial heartbeat on tty /dev/ttyS1 (19200 baud) info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on br1 info: glib: ucast: bound send socket to device: br1 ERROR: glib: ucast: error setting option SO_REUSEPORT(w): Protocol not available ERROR: make_io_childpair: cannot open ucast br1 CRIT: Emergency Shutdown: Master Control process died. CRIT: Killing pid 11194 with SIGTERM CRIT: Killing pid 11198 with SIGTERM CRIT: Killing pid 11199 with SIGTERM CRIT: Emergency Shutdown(MCP dead): Killing ourselves. When i downgrade to version 3.0.4-1.el6 it's all working back well. So the patch applied in this bug report create a regression on unicast functionality. Please rollback or finish/stabilize the patch "heartbeat-3.0.4-duplicate-ucast.patch". I can test a new version if you want me to , before you push it to stable REPO. Regards, aurelien Lemaire from Smile Hosting.
Actual bug is: SO_REUSEPORT defined by headers, but not supported by kernel. > --- Comment #17 from Smile hosting <hosting> --- > just applied this minor update to our few cluster and guess what -> clusters is > dead . I explain below : > > This new version update( 3.0.4-1.el6 to 3.0.4-2.el6 ) just broke our clusters > 's unicast fonctionnality taking origine to this new patch puches by this > bugreport version. > > related broken patch : heartbeat-3.0.4-duplicate-ucast.patch > > the result is heartbeat cannot start cause ucast (used in /etc/ha.d/ha.cf) > cannot work with following error in logs : > info: glib: Starting serial heartbeat on tty /dev/ttyS1 (19200 baud) > info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on br1 > info: glib: ucast: bound send socket to device: br1 > ERROR: glib: ucast: error setting option SO_REUSEPORT(w): Protocol not > available > When i downgrade to version 3.0.4-1.el6 it's all working back well. > So the patch applied in this bug report create a regression on unicast > functionality. No, it does not. But at the time that -1 binary package was built, SO_REUSEPORT was not defined... when the -2 binary package was built, apparently the define was there. But your kernel does not support it (yet). If you try to rebuild the -1 package now, against the same headers the -2 package was built, it will break with compile time error. That compile time error was what said patch tries to trivially fix. Only that this then breaks at runtime when compiled against too recent headers but run against too old linux kernel. See upstream mercurial for my attempt at fixing this: http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/37f57a36a2dd I suggest you update to upstream mercurial, or replace your ucast patch with the above. Cheers, Lars -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
Hi Lars, Thansk for your answser. I now owe you the context : I'm in full vanilla centos6 (as this bugreport talk about) up-to-date without any home-cook rebuild of any package. with vanille EL6 REpo for heartbeat packages uname -a : Linux HOSTNAME 2.6.32-358.14.1.el6.x86_64 #1 SMP Tue Jul 16 23:51:20 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Vanilla package : me-filer5:~# rpm -qa |egrep 'heartbeat|kernel-' heartbeat-libs-3.0.4-1.el6.x86_64 (working version) heartbeat-3.0.4-1.el6.x86_64 (working version) kernel-2.6.32-358.14.1.el6.x86_64 The EPEL EL6 repo currently proposed an update for both heartbeat package to : heartbeat-libs-3.0.4-2.el6.x86_64 (not working version) heartbeat-3.0.4-2.el6.x86_64 (not working version) The patch i refered to is the one i noticed in a diff of the -1 and -2 SRC.rpm of heartbeat package made by the EPEL heartbeat package maintener. I confirmed the -2 version of those package does not work anymore on a vanilla EL6 with vanilla EL6 kernel which i supposed is not intended but unfotunate. Hope it helps understanding the situation. regards, Aurelien Lemaire
As this might solve it - with regards to the kernel age - the one you use is from 6.4 and thus from July last year. 2.6.32-431.5.1 would be the current version. All I can add is, those packages from EPEL work fine for me.
Please also note that Fedora EPEL 6 officially only supports the latest version of RHEL/CentOS 6, so currently RHEL 6.5. It might work (or not) with RHEL 6.4, 6.3, etc. Aside of that I can not see any issues here...
Hi, Now i owe you all my facepalm meaculpa. My Puppet servant start excluding my kernel from updates since the 2.6.32-358 .Thus i was indeed using an old kernel. In conclusion after fix and update: with vanilla UP-TO-DATE kernel the EPEL heartbeat package work like a charm with following packages version : rpm -qa |egrep 'heartbeat|kernel-2.6' heartbeat-libs-3.0.4-2.el6.x86_64 kernel-2.6.32-431.el6.x86_64 heartbeat-3.0.4-2.el6.x86_64 kernel-2.6.32-358.14.1.el6.x86_64 My bad. Regards, Aurélien Lemaire
Hi all, I have an issue where I have configured heartbeat to run on a 2 node httpd cluster, heartbeat seems to be running when i check logs and I see that node1 comes up on web page, but when i shutdown heartbeat so that node2 would failover, it does not work. This is the log i see on node1... tailf /var/log/ha-log Jan 28 09:48:04 node1 heartbeat: [2420]: info: Configuration validated. Starting heartbeat 3.0.4 Jan 28 09:48:04 node1 heartbeat: [2421]: info: heartbeat: version 3.0.4 Jan 28 09:48:04 node1 heartbeat: [2421]: info: Heartbeat generation: 1422435302 Jan 28 09:48:04 node1 heartbeat: [2421]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth0 Jan 28 09:48:04 node1 heartbeat: [2421]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth0 - Status: 1 Jan 28 09:48:04 node1 heartbeat: [2421]: info: G_main_add_TriggerHandler: Added signal manual handler Jan 28 09:48:04 node1 heartbeat: [2421]: info: G_main_add_TriggerHandler: Added signal manual handler Jan 28 09:48:04 node1 heartbeat: [2421]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Jan 28 09:48:04 node1 heartbeat: [2421]: info: Local status now set to: 'up' Jan 28 09:48:04 node1 heartbeat: [2421]: info: Link node1:eth0 up. Jan 28 09:50:05 node1 heartbeat: [2421]: WARN: node node2: is dead Jan 28 09:50:05 node1 heartbeat: [2421]: info: Comm_now_up(): updating status to active Jan 28 09:50:05 node1 heartbeat: [2421]: info: Local status now set to: 'active' Jan 28 09:50:05 node1 heartbeat: [2421]: WARN: No STONITH device configured. Jan 28 09:50:05 node1 heartbeat: [2421]: WARN: Shared disks are not protected. Jan 28 09:50:05 node1 heartbeat: [2421]: info: Resources being acquired from node2. harc(default)[2433]: 2015/01/28_09:50:05 info: Running /etc/ha.d//rc.d/status status mach_down(default)[2469]: 2015/01/28_09:50:05 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down(default)[2469]: 2015/01/28_09:50:05 info: mach_down takeover complete for node node2. Jan 28 09:50:05 node1 heartbeat: [2421]: info: mach_down takeover complete. Jan 28 09:50:05 node1 heartbeat: [2421]: info: Initial resource acquisition complete (mach_down) /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.31.29.243)[2501]: 2015/01/28_09:50:05 INFO: Resource is stopped Jan 28 09:50:05 node1 heartbeat: [2434]: info: Local Resource acquisition completed. harc(default)[2588]: 2015/01/28_09:50:06 info: Running /etc/ha.d//rc.d/ip-request-resp ip-request-resp ip-request-resp(default)[2588]: 2015/01/28_09:50:06 received ip-request-resp 172.31.29.243 OK yes ResourceManager(default)[2611]: 2015/01/28_09:50:06 info: Acquiring resource group: node1 172.31.29.243 httpd /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.31.29.243)[2639]: 2015/01/28_09:50:06 INFO: Resource is stopped ResourceManager(default)[2611]: 2015/01/28_09:50:06 info: Running /etc/ha.d/resource.d/IPaddr 172.31.29.243 start IPaddr(IPaddr_172.31.29.243)[2737]: 2015/01/28_09:50:06 INFO: Adding inet address 172.31.29.243/20 with broadcast address 172.31.31.255 to device eth0 IPaddr(IPaddr_172.31.29.243)[2737]: 2015/01/28_09:50:06 INFO: Bringing device eth0 up IPaddr(IPaddr_172.31.29.243)[2737]: 2015/01/28_09:50:06 INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-172.31.29.243 eth0 172.31.29.243 auto not_used not_used /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.31.29.243)[2723]: 2015/01/28_09:50:06 INFO: Success Jan 28 09:50:16 node1 heartbeat: [2421]: info: Local Resource acquisition completed. (none) Jan 28 09:50:16 node1 heartbeat: [2421]: info: local resource transition completed. node2 i see this tailf /var/log/ha-log Jan 28 09:27:22 node2 heartbeat: [1646]: info: Configuration validated. Starting heartbeat 3.0.4 Jan 28 09:27:22 node2 heartbeat: [1647]: info: heartbeat: version 3.0.4 Jan 28 09:27:22 node2 heartbeat: [1647]: info: Heartbeat generation: 1422435301 Jan 28 09:27:22 node2 heartbeat: [1647]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth0 Jan 28 09:27:22 node2 heartbeat: [1647]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth0 - Status: 1 Jan 28 09:27:22 node2 heartbeat: [1647]: info: G_main_add_TriggerHandler: Added signal manual handler Jan 28 09:27:22 node2 heartbeat: [1647]: info: G_main_add_TriggerHandler: Added signal manual handler Jan 28 09:27:22 node2 heartbeat: [1647]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Jan 28 09:27:22 node2 heartbeat: [1647]: info: Local status now set to: 'up' Jan 28 09:27:22 node2 heartbeat: [1647]: info: Link node2:eth0 up. Jan 28 09:29:23 node2 heartbeat: [1647]: WARN: node node1: is dead Jan 28 09:29:23 node2 heartbeat: [1647]: info: Comm_now_up(): updating status to active Jan 28 09:29:23 node2 heartbeat: [1647]: info: Local status now set to: 'active' Jan 28 09:29:23 node2 heartbeat: [1647]: WARN: No STONITH device configured. Jan 28 09:29:23 node2 heartbeat: [1647]: WARN: Shared disks are not protected. Jan 28 09:29:23 node2 heartbeat: [1647]: info: Resources being acquired from node1. Jan 28 09:29:23 node2 heartbeat: [1656]: info: No local resources [/usr/share/heartbeat/ResourceManager listkeys node2] to acquire. harc(default)[1655]: 2015/01/28_09:29:23 info: Running /etc/ha.d//rc.d/status status mach_down(default)[1685]: 2015/01/28_09:29:23 info: Taking over resource group 172.31.29.243 ResourceManager(default)[1712]: 2015/01/28_09:29:23 info: Acquiring resource group: node1 172.31.29.243 httpd /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.31.29.243)[1740]: 2015/01/28_09:29:23 INFO: Resource is stopped ResourceManager(default)[1712]: 2015/01/28_09:29:23 info: Running /etc/ha.d/resource.d/IPaddr 172.31.29.243 start IPaddr(IPaddr_172.31.29.243)[1838]: 2015/01/28_09:29:23 INFO: Adding inet address 172.31.29.243/20 with broadcast address 172.31.31.255 to device eth0 IPaddr(IPaddr_172.31.29.243)[1838]: 2015/01/28_09:29:23 INFO: Bringing device eth0 up IPaddr(IPaddr_172.31.29.243)[1838]: 2015/01/28_09:29:23 INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-172.31.29.243 eth0 172.31.29.243 auto not_used not_used /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.31.29.243)[1824]: 2015/01/28_09:29:23 INFO: Success mach_down(default)[1685]: 2015/01/28_09:29:23 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down(default)[1685]: 2015/01/28_09:29:23 info: mach_down takeover complete for node node1. Jan 28 09:29:23 node2 heartbeat: [1647]: info: mach_down takeover complete. Jan 28 09:29:23 node2 heartbeat: [1647]: info: Initial resource acquisition complete (mach_down) Jan 28 09:29:33 node2 heartbeat: [1647]: info: Local Resource acquisition completed. (none) Jan 28 09:29:33 node2 heartbeat: [1647]: info: local resource transition completed. ^Z [1]+ Stopped tailf /var/log/ha-log [root@ip-172-31-29-242 ~]# tailf /var/log/ha-log IPaddr(IPaddr_172.31.29.243)[1838]: 2015/01/28_09:29:23 INFO: Adding inet address 172.31.29.243/20 with broadcast address 172.31.31.255 to device eth0 IPaddr(IPaddr_172.31.29.243)[1838]: 2015/01/28_09:29:23 INFO: Bringing device eth0 up IPaddr(IPaddr_172.31.29.243)[1838]: 2015/01/28_09:29:23 INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-172.31.29.243 eth0 172.31.29.243 auto not_used not_used /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_172.31.29.243)[1824]: 2015/01/28_09:29:23 INFO: Success mach_down(default)[1685]: 2015/01/28_09:29:23 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down(default)[1685]: 2015/01/28_09:29:23 info: mach_down takeover complete for node node1. Jan 28 09:29:23 node2 heartbeat: [1647]: info: mach_down takeover complete. Jan 28 09:29:23 node2 heartbeat: [1647]: info: Initial resource acquisition complete (mach_down) Jan 28 09:29:33 node2 heartbeat: [1647]: info: Local Resource acquisition completed. (none) Jan 28 09:29:33 node2 heartbeat: [1647]: info: local resource transition completed.