Bug 115066

Summary: Multi-JVM caching for CMS does not work
Product: [Retired] Red Hat Enterprise CMS Reporter: Jon Orris <jorris>
Component: otherAssignee: Vadim Nasardinov <vnasardinov>
Status: CLOSED WONTFIX QA Contact: Jon Orris <jorris>
Severity: medium Docs Contact:
Priority: medium    
Version: nightlyCC: vnasardinov
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-09-05 17:33:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jon Orris 2004-02-06 00:12:41 UTC
Description of problem:
@40086

Multi-JVM caching for CMS does not work. 

Client code of CacheTable is violating it's fundamental assumption.

This will cause every page hit to CMS content sections, public sites,
etc, on Node 1 to invalidate the caches of Node 2...Node N, every time.

In addition, this fundamental assumption seems highly questionalbe.

1) Client code is violating fundamental assumption:
Vadim recently added comments to CacheTable.put detailing this assumption:
  "As a performance optimization, we try to be clever and prevent
valid objects from being flushed out unnecessarily from peer nodes in
the above scenario.  This is accomplished by including in the "remove"
message the hash code of the object that may need removal from peer
nodes.  If the hash code sent by node X matches the hash code of the
object mapped to the same key on node Y, then the cache entry 
maintained by Y need not be updated."

The important bit:
   "Note that for this method to work, any cached object must hash to
the same value regardless of the node on which the hash code is computed."

Entries into the table violate this assumption:
SiteNode - SiteNode, indirectly, overrides hashCode(). The override
defaults to 
DomainObject, which defaults to DataObjectImpl, which defaults to OID.
OID's hash code is unlikely to remain the same across JVMS, and, in
fact, does not.

    public int hashCode() {
        // here we rely on the values collection's hashcode method
        // to base its hashcode on the hashcodes of the contained values.
        return (m_type.getBasetype().hashCode() + m_values.hashCode());
    }

BaseDispatcher.ApplicationSpec - What is cached for the content
section admin pages.
This does not override hashCode().

2) Causes constant cache invalidation:
  The invalidation logic for DynamicList's removal of outdated entries
is (abbreviated):
  public void removeIfOutdated(String key, int hashCode) {
    // ...
    TimeStamped ts = getElementAt(node);
    if (ts.getValue().hashCode() != hashCode) {
      log(m_cacheID, "OUTDT", key);
      removeElementAt(node);

Because the hash codes are not the same across JVMS, the DynamicList
will invalidate an entry _every time_ it gets a removeIfOutdated
message from the CacheServlet. I've attached an example at the end of
this ticket.

Fixing the hash codes fixes this problem.


3) Questionable fundamental assumption
  Aside from wanting cross-jvm hash codes, I don't understand why hash
codes are used at all. 
  The logic of cache invalidation here seems to be the following:

    Node 1: See if I have Foobar in my cache.
    Node 1: Nope. Get it from the database & put in my cache.
    Node 1: Send a message to Nodes 2...N about my entry, since I have
the most up to date version.
    Node 2: Node 1 tells me it has Foobar version 7. Do I have Foobar,
and is it outdated?
    Node 2: Yep, mine's older. Better invalidate that entry, or..
    Node 2: Got the same version, or don't have it at all. Do nothing.

  This suggests a test for identity, or really state, between cached
objects. Hash code is not,  and _should never be_, a function of 
mutable state. Hash code should always be computed against immutable
variables. Thus, I don't understand why hash code is used at all.

Example of constant invalidation, by refreshing the content section
admin page.

Machine 1
caching.DynamicList - SiteNodeCache:  PUT  /content/
caching.CacheTable - Put key /content/ in cache called at
com.arsdigita.web.PathMapCache.get(PathMapCache.java:95)
caching.CacheTable - Object is of type class
com.arsdigita.kernel.SiteNode with hash code: 22759851


Machine 2
web.PathMapCache - normalizedPath=/content/
caching.DynamicList - BaseDispatcherCache:  MISS /content/
web.PathMapCache - cache miss for: /content/
web.PathMapCache - db hit for /content/
caching.DynamicList - BaseDispatcherCache:  PUT  /content/
caching.CacheTable - Put key /content/ in cache called at
com.arsdigita.web.PathMapCache.get(PathMapCache.java:95)
caching.CacheTable - Object is of type class
com.arsdigita.web.BaseDispatcher$ApplicationSpec with hash code: 3872665
caching.CacheServlet - notifying peers


Machine 1
caching.CacheServlet - Got remove request from goodeats.boston.redhat.com
caching.CacheServlet - Removing /content/ from cache BaseDispatcherCache
caching.DynamicList - BaseDispatcherCache: OUTDT /content/
caching.CacheTable - Removed  entry with key /content/ hash 3872665

Comment 1 Vadim Nasardinov 2004-02-06 13:31:58 UTC
> This suggests a test for ... state ... between cached objects. Hash
> code is not, and _should never be_, a function of mutable
> state. Hash code should always be computed against immutable
> variables.

"hashCode" should be computed against the same variables that "equals"
is computed against, in order to ensure that the implementations of
"hashCode" and "equals" are consistent with each other, as required by
these methods' contracts.

So, the question becomes, should "equals" be computed against mutable
variables?  More often than not, the answer seems to be a resounding
yes.

|$ cat Main.java 
|import java.awt.Point;
|
|public class Main {
|    public final static void main(String[] args) {
|        Point p1 = new Point(0, 0);
|        Point p2 = new Point(1, 1);
|        log("p1", p1);
|        log("p2", p2);
|        log (p1.equals(p2) ? "p1 equals p2" : "p1 does not equal p2");
|        log("Mutating p2");
|        p2.setLocation(0, 0);
|        log("p2", p2);
|        log (p1.equals(p2) ? "p1 equals p2" : "p1 does not equal p2");
|    }
|
|    private static void log(String name, Point p) {
|        log("Point " + name + ": (x,y)=(" + p.x + "," +
|            p.y + "); hash=" + p.hashCode());
|                          
|    }
|
|    private static void log(String msg) {
|        System.out.println(msg);
|    }
|}
|$ javac Main.java
|$ java -cp . Main
|Point p1: (x,y)=(0,0); hash=0
|Point p2: (x,y)=(1,1); hash=-2116026368
|p1 does not equal p2
|Mutating p2
|Point p2: (x,y)=(0,0); hash=0
|p1 equals p2


So, using hash codes the way CacheTable does it is not entirely
brain-dead.  If two objects are equal, their hash codes are equal
(within the same JVM within the overlap between lifetimes of the
respective objects).  Restated another way, if two objects have
non-matching hash codes, then we know for a fact that they are not
equal (within the same JVM).

So far, so good.  Now, if you drop the caveat about the JVM being the
same, you arrive at the rationale for CacheTable's current behavior.
The question is, _can_ you drop the caveat?  If you run identical JVM
versions across nodes, then you can, for a carefully selected subset
of objects.  I'm 99.9% sure the above reasoning holds true for Strings.



Comment 2 Vadim Nasardinov 2004-02-06 13:51:48 UTC
Xref to a related earlier ticket: bug 111560.

Comment 3 Jon Orris 2004-02-06 16:05:03 UTC
I both agree and disagree with Vadim's comment :)

My original statement was, in a sense, unrelated to the problem at
hand. On reflection, using hashCode as the cache value makes some
sense after all, given the relation between the equals() and
hashCode() contracts. This assumes, of course, that both are computed
against a meaningful set of state parameters. (more on this in comment
4 ).

The part that I disagree on, in a sense, is:
>So, the question becomes, should "equals" be computed against mutable 
> variables? More often than not, the answer seems to be a 
> resounding yes.

On one hand, yes, if the state parameters change, then equals should
obviously change, as two objects will no longer be equal. Due to the
contract of hashCode, it should change as well. This causes a problem
, however. If the object is a key in a HashMap, changing it's hashCode
will likely result in evil. 

This is why, among other reasons, a key Java idiom is to have
everything be immutable unless it is absolutely neccessary for it to
mutate state. In the example above, a Point class should be an
immutable object, and a new one constructed instead of changing the
existing object. 

This is really more a problem with Java, wherein important contracts
conflict, and the only way they are maintained is by programmer
discipline, not the language.


Comment 4 Jon Orris 2004-02-06 16:05:35 UTC
The question, then, is what hashCode should we be using? For SiteNode,
what is the meaning of it's state? Why is it being cached at all? What
attributes of SiteNode should change to cause it to be flused from the
cache?


Comment 5 Jon Orris 2004-02-06 16:10:31 UTC
>If you run identical JVM versions across nodes, then you can, for a 
>carefully selected subset of objects.  I'm 99.9% sure the above 
>reasoning holds true for Strings.

This is, I believe, true. In sun 1.4.2, at least, String.hashCode() is
a deterministic function of the characters in the String. I tested a
'solution' to this problem by having SiteNode.hashCode() call
getOID().toString().hashCode(), and similarly for ApplicationSpec.
This eliminated the 'constant flush' problem.

This isn't a full solution, at least for SiteNode, as OID represents
identity, not state. ApplicationSpec.toString(), though, should
represent state.

Comment 6 Vadim Nasardinov 2004-02-06 17:34:15 UTC
> This isn't a full solution, at least for SiteNode, as OID represents 
> identity, not state. 
 
I think it's fine for this to be a partial solution.  As I pointed out 
in bug 111560, the main reason this thing kinda works is because it 
implements an LRU expulsion policy, which is independent of this whole 
hash-code hokeyness issue. 
 
I don't see a quick way to fix the flakiness of CacheTable. 

Comment 7 Vadim Nasardinov 2004-02-10 14:02:12 UTC
To quote Graydon Hoare
http://post-office.corp.redhat.com/archives/java-project/2004-February/msg00027.html
Message-ID: <4028217D.4080701>

> overly-distributed programs usually lead to application programmers
> trying to solve, in amongst their application code, research-grade
> "hard" networking problems


Comment 8 Richard Li 2004-02-11 19:48:24 UTC
My conclusion:

1. The general problem is too hard to solve in this timeframe, so we
won't do it.
2. Changing hashcode in SiteNode seems risky given the fact that
SiteNodes might be hashed elsewhere. Moreover, I don't think the
performance gain will be significant (and certainly doesn't exist on
single JVM systems).
3. Since ApplicationSpec is a new class and isn't used anywhere else,
we can go ahead and change hashCode on ApplicationSpec only.

So, we're only going to change ApplicationSpec, unless anyone has
strenuous objections...

Comment 9 Daniel Berrangé 2004-02-11 19:59:36 UTC
Reading through the comments I gather that the problem with
SiteNode.hashCode() is that it varies across machines. Since we don't
want to change the hashCode method on SiteNode, the obvious solution
is to simply not store SiteNodes in the cache table directly. Create a
then wrapper class that generates a hashcode based on the URL of the
site node rather than its OID. Since URL is a string this should be
stable acrosss JVMs.

class SiteNodeWrapper {
    int m_hashcode;
    SiteNode m_node;

    SiteNodeWrapper(SiteNode node) {
       m_node  =node;
       m_hashcode = node.getURL().hashcode();
     }

    int hashcode() {
       return m_hashcode;
    }
    SiteNode getNode() {
        return m_node;
    }
}

This should enable us to fix richard's point 2. fairly simply, no ?

Comment 10 Vadim Nasardinov 2004-02-11 20:08:42 UTC
If you implement hashCode() for SiteNodeWrapper, you also want
to implement equals().  Otherwise, you'll pull half of your hair
out trying to debug why you can't find entries that you know
for a fact you put into CacheTable.

Other than that, it looks like it'll work.


Comment 11 Richard Li 2004-02-11 20:09:51 UTC
Do we think it's worth implementing to reduce the number of cache
flushes in multi-JVM situations?

Comment 12 Jon Orris 2004-02-11 21:12:59 UTC
I think so. Remember that as it stands, the cache flushes will occur
for every page hit, propagating to every node. This not only defeats
caching altogether, but increases network traffic.

Fixing SiteNode.hashCode() itself shouldn't be a problem, unless
values are stored in the database. In that case, we'd have a much
bigger MJVM problem which would require us to fix it anyhow.


Comment 13 Vadim Nasardinov 2004-02-11 21:16:12 UTC
I'm not sure I understand the part about cache flushes occurring
"for every page hit."



Comment 14 Jon Orris 2004-02-11 21:21:58 UTC
To clariify, this is 'every page hit on a given node'.
What's happening in my original ticket:

1) Go to content section in Node 1
2) Node 1 in web.PathMapCache fetches a site node & puts it in its cache.
3) Node 1 sends 'invalidate' messages to every other node.
4) Node 2-N, if it has the SiteNode, has a different hash code, and
flushes it.
5) Goto content section in Node P, and the same 2-4 sequence occurs again.



Comment 15 Vadim Nasardinov 2004-02-11 21:30:35 UTC
Ah, ok. So, under heavy load, we defeat our own caching efforts like
so (I removed the term "site node" from your description, because I
have trouble telling it apart from the notion of "node in a cluster"):

    1) Go to content section in Node 1
    2) Node 1 in web.PathMapCache fetches the content section and
       caches it.
    3) Node 1 sends an 'invalidate' message to Node 2.
    4) Node 2 evicts the content section, if it's been cached already.
    5) Go to content section in Node 2.  Since the content section has
       been evicted in step 4, Node 2 refetches and caches it.
    6) Node 2 sends "invalidate" to Node 1.  Lather, rinse, repeat.

Is that correct?


Comment 16 Jon Orris 2004-02-11 22:25:06 UTC
yes, that's it.

Comment 17 Vadim Nasardinov 2004-02-16 20:56:33 UTC
Fixed on the trunk in changes 40430 and 40442.

Comment 18 David Lawrence 2006-07-18 03:27:29 UTC
QA_READY has been deprecated in favor of ON_QA. Please use ON_QA in the future.
Moving to ON_QA.

Comment 19 Jon Orris 2006-09-05 17:33:42 UTC
Closing old tickets