Monday, January 23, 2012

EF EntityObject Value Hashing

Another "problem" I've run into recently. How do you efficiently compare entity objects to determine if changes have been made when the objects are based off of views constructed from multiple external tables without hashing the entire object and it's values? This one took a bit of digging and testing. Hashing each row was my solution. I started off trying to just serialize the EntityObject in XML or Binary. XML was an issue due to sanitization of the data and was memory heavy. Binary didn't have such issues, but when comparing the binary hashes, which could grow quite large efficiency was an issue. Writing the initial hash of 200,000 rows consumed only 15 minutes, but comparing the binary between two sets (identical in fact), the comparison took an hour. I ended up using a custom GetValueHashCode method and associated Hash field in a partial class of the view object. While this takes considerably longer to code and maintain, the payoffs are immense and it's more than feasible. I'll let you know how it pans out in a month or so...

Watch your arithmetic overflow! I replaced 17/23 with smaller primes, but if you're storing them in a database check your datatypes. I'd also suggest using a HashSet when comparing hashes (StackOverflow- MSDN).

public partial class myview
{
  private int _hash;
  
  public int Hash
  {
    get { return this.GetValueHashCode(); }
    set { this._hash = value; }
  }

  public override GetHashCode()
  {
    var hash = 2;

    unchecked // Overflow is find, just wrap
    {
      if (this.field1 != null)
        hash = hash * 3 + this.field1.GetHashCode();
      if (this.field2 != null)
        hash = hash * 3 + this.field2.GetHashCode();
      if (this.field3 != null)
        hash = hash * 3 + this.field3.GetHashCode();
    }

    return hash;
}

Source again, StackOverflow, full of peeps smarter than me :P

7 comments:

  1. I noticed you are not overriding the System.Object version of GetHashCode in your code.

    Is there a reason for this? It appears even in the StackOverflow version they are overriding that method.

    I guess it really does not matter but it seems like if there is already a function for this purpose and its marked as virtual then changing behavior by overriding would be a better way to go. That way users of your class who may have only received a binary would not accidentally call the wrong one. I know in this case it's a fairly small class but it's good to get into a habit of thinking about what the outside world sees.

    Just my 2 cents.

    ReplyDelete
  2. Thanks for the feedback man! I opted to not override the existing gethashcode as I wasn't sure if the override was "type" specific. Makes perfect sense now that I think about it. I'll update the post, and my code.

    ReplyDelete
  3. Ok, now remember why I didn't override. I have a particular view that is made up of about 50 columns which generates a hashcode larger than all numerics except BigInteger. I can't override if the return type isn't int.

    ReplyDelete
  4. But isn't that the point of having the unchecked keyword? I think it would ok to "overflow" the int the main thing is that the hashcode you generate is the same number when the object is the same.

    ReplyDelete
  5. Thanks again Ryan, really appreciate the feedback...

    ReplyDelete
  6. With the code above and some serious experimentation I was able to construct a change-sync process for a typical 3 hour full import which can determine if any data has changed in less than 5 minutes, and process the changes in a fraction of the full import time.

    ReplyDelete