Tuesday, March 29, 2011

Silverlight: Adding Google Streets

This post was moved to my real blog: Silverlight: Adding Google Streets

Monday, March 28, 2011

Linq2Sql: Changing the Database Schema at Runtime (without XMLs)

Our company has too many databases. Not only we have a database server for development, another ~5 server per version for QA/integration, we also have a production database server for each country we deal with.

I don’t really know why but in our development server we have a database per country (which is valid) with different database schemas: BOB for our regular schema and BOB_JPN for japan. In our DAL (Data Access Layer) code out code that uses IWorkspace takes the DB Schema from an application config file but we also have code that uses Linq2Sql (since it is faster) that has the DB Schema hard coded in the designer.cs code:

  1. [global::System.Data.Linq.Mapping.TableAttribute(Name="BOB.SOME_TABLE")]
  2. public partial class SOME_TABLE
  3. {
  4.     
  5.     private int _OBJECTID;
  6.     
  7.     private short _TypeId;

Until today we changed the designer code by deleting the schema and then the query is done with the user default schema, which in japan is BOB_JPN. My team leader has done this “simply” by changing the code in the designer.cs file.

 

I decided I can’t allow this to continue. I had two options:

  1. Change the DB Schema in Linq2Sql
  2. Writing the entity code by hand – without the DB Schema
  3. Using some other technology (such as ADO.Net) instead of Linq2Sql
  4. Use my team leader way and change the designer code

I of course preferred using Linq2Sql, simply because the code works and we are going to production soon.

My first Google search “linq2sql config db schema” was a bust .

My second search “change linq mapping in runtime” found this:

External Mapping Reference (LINQ to SQL) – using external XML files, nothing on runtime changes

LINQ to SQL - Tailoring the Mapping at Runtime – again too complicated, it was something like build the xml in runtime and tailor it in…

On my third search I decided to think outside the box (actually I decided to run away from the box): “reflection change attribute at runtime”

Change Attribute's parameter at runtime

My first trial was with the code that was marked as not working (I hoped the bug was fixed since 2008):

  1. private void ChangeSchema()
  2. {
  3.     if (DefaultGisConfigSection.Instance.SchemaName.ToUpper().CompareTo("BOB") == 0)
  4.         return;
  5.  
  6.     ChangeTableAttribute(typeof (STREET));
  7. }
  8.  
  9. private void ChangeTableAttribute(Type table)
  10. {
  11.     var tableAttributes = (TableAttribute[])
  12.                           table.GetCustomAttributes(typeof (TableAttribute), false);
  13.     tableAttributes[0].Name = DefaultGisConfigSection.Instance.SchemaName + "." + TableName;
  14. }

Didn’t work.

My second trial didn’t work either:

  1. private void ChangeTableAttribute(Type table)
  2. {
  3.     TypeDescriptor.AddAttributes(table,
  4.                                  new TableAttribute
  5.                                      {Name = DefaultGisConfigSection.Instance.SchemaName + "." + TableName});
  6. }

But I think this time it’s more my reflection code than anything else. The Exception in both cases was:

Test method Shepherd.Core.Dal.Tests.SomeTest threw exception:
System.Data.SqlClient.SqlException: Invalid object name 'BOB.SOME_TABLE'.

at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning()
at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlDataReader.ConsumeMetaData()
at System.Data.SqlClient.SqlDataReader.get_MetaData()
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async)
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result)
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
at System.Data.Common.DbCommand.ExecuteReader()
at System.Data.Linq.SqlClient.SqlProvider.Execute(Expression query, QueryInfo queryInfo, IObjectReaderFactory factory, Object[] parentArgs, Object[] userArgs, ICompiledSubQuery[] subQueries, Object lastResult)
at System.Data.Linq.SqlClient.SqlProvider.ExecuteAll(Expression query, QueryInfo[] queryInfos, IObjectReaderFactory factory, Object[] userArguments, ICompiledSubQuery[] subQueries)
at System.Data.Linq.SqlClient.SqlProvider.System.Data.Linq.Provider.IProvider.Execute(Expression query)
at System.Data.Linq.DataQuery`1.System.Collections.Generic.IEnumerable<T>.GetEnumerator()
at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
at System.Linq.Enumerable.ToList(IEnumerable`1 source)

For the last try before yelling quits I decided to try and see what Microsoft is doing behind the scenes – I activated the Debug Symbols option (this is something I don’t encourage anyone to try because the last time I did this with the ESRI symbols it just didn’t remove itself for several days and the performance was bad!).

Going over the code I reached this method in C:\Projects\SymbolCache\src\source\.NET\4\DEVDIV_TFS\Dev10\Releases\RTMRel\ndp\fx\src\DLinq\Dlinq\Mapping\AttributedMetaModel.cs\1599186\AttributedMetaModel.cs:

  1. internal MetaTable GetTableNoLocks(Type rowType) {

It did:

  1. TableAttribute[] attrs = (TableAttribute[])root.GetCustomAttributes(typeof(TableAttribute), true);

And got the original table name – Now I know what must be changed!

But I can’t since that was my first trial (going into the framework code here just got me into more trouble, at the end of a very long (~200 lines) method:

  1. [System.Security.SecurityCritical// auto-generated
  2. private unsafe static object[] GetCustomAttributes(
  3.     RuntimeModule decoratedModule, int decoratedMetadataToken, int pcaCount,
  4.     RuntimeType attributeFilterType, bool mustBeInheritable, IList derivedAttributes, bool isDecoratedTargetSecurityTransparent)

in C:\Projects\SymbolCache\src\source\.NET\4\DEVDIV_TFS\Dev10\Releases\RTMRel\ndp\clr\src\BCL\System\Reflection\CustomAttribute.cs\1305376\CustomAttribute.cs I found that the attribute is being built by:

 

Going into that got me this:

vs2010-step-into-error

in text:

---------------------------
Microsoft Visual Studio
---------------------------
File Load

Some bytes have been replaced with the Unicode substitution character while loading file C:\Projects\SymbolCache\src\source\.NET\4\DEVDIV_TFS\Dev10\Releases\RTMRel\ndp\fx\src\DLinq\Dlinq\Mapping\Attributes.cs\1305376\Attributes.cs with Unicode (UTF-8) encoding. Saving the file will not preserve the original file contents.
---------------------------
OK  
---------------------------

The result is an empty TableAttribute, the value for the attribute comes from an unsafe method:

  1. [System.Security.SecurityCritical// auto-generated
  2. [ResourceExposure(ResourceScope.None)]
  3. [MethodImplAttribute(MethodImplOptions.InternalCall)]
  4. private unsafe extern static void _GetPropertyOrFieldData(
  5.     RuntimeModule pModule, byte** ppBlobStart, byte* pBlobEnd, out string name, out bool bIsProperty, out RuntimeType type, out object value);

I give up, I should have given up when Google failed me but… I have decided to post it as a question stackoverflow:

Modifying Class Attribute on Runtime

 

I next decided to change the attribute by inheritance:

  1. [TableAttribute(Name = "SOME_TABLE")]
  2. public class SomeTable:SOME_TABLE

That got me the Exception:

System.InvalidOperationException: Data member 'Int32 OBJECTID' of type 'Project.Dal.SOME_TABLE' is not part of the mapping for type 'SomeTable'. Is the member above the root of an inheritance hierarchy?

Basically that is not possible because of a limitation in Linq2Sql this (damn!).

 

At the end I have chosen option 2, writing the code by hand. But I learned that going into .Net inner code is a trial of sanity, you either find what you are looking for and lose your sanity or you give up…

 

//TODO: Missing some code in the middle, given the previous paragraph do I really want to look for it?!?

 

Keywords: Linq2Sql, reflection, DB Schema

IceRocket Tags: ,,

Wednesday, March 16, 2011

File GeoDatabase: Getting the Workspace

I have created FileWorkspaceUtils that inherits from WorkspaceUtils, it adds the functions GetRows and GetFeatures that return the raw IRow and IFeature data. In WorkspaceUtils I preferred that the low level programmer won’t even know he has something called IRow or IFeature.

  1. public class FileWorkspaceUtils:WorkspaceUtils
  2. {
  3.     public FileWorkspaceUtils(IFeatureWorkspace workspace) : base(workspace)
  4.     {
  5.     }
  6.  
  7.     public List<IRow> GetRows(string tableName)
  8.     {
  9.         var result = new List<IRow>();
  10.         DoActionOnSelectRows(tableName, null, row => result.Add(row.Clone()));
  11.         return result;
  12.     }
  13.  
  14.     public List<IFeature> GetFeatures(string layerName)
  15.     {
  16.         var result = new List<IFeature>();
  17.         DoActionOnSelectFeatures(layerName, null, feature => result.Add(feature.Clone()));
  18.         return result;
  19.     }
  20. }

//TODO: Post on the wonder of Extension Methods (row.Clone())

I have added code to WorkspaceProvider so that it will return the FileWorkspaceUtils (independent of File/Personal GeoDatabase):

  1. private const string PersonalGeoDatabaseFileExtension = ".MDB";
  2. private const string FileGeoDatabaseFileExtension = ".GDB";
  3.  
  4. /// <summary>
  5. /// Get a File WorkspaceUtils for Personal and File GeoDatabase
  6. /// </summary>
  7. /// <param name="filePath"></param>
  8. /// <returns></returns>
  9. public FileWorkspaceUtils GetFileWorkspace(string filePath)
  10. {
  11.     var extension = (Path.GetExtension(filePath) ?? String.Empty).ToUpper();
  12.     if (extension.CompareTo(PersonalGeoDatabaseFileExtension) == 0)
  13.         return CreatePersonalGeoDatabaseWorkspace(filePath);
  14.     if (extension.CompareTo(FileGeoDatabaseFileExtension) == 0)
  15.         return CreateFileGeoDatabaseWorkspace(filePath);
  16.  
  17.     throw new NotImplementedException("The only supported file types are mdb and gdb. Not: " + extension);
  18. }
  19.  
  20. private FileWorkspaceUtils CreatePersonalGeoDatabaseWorkspace(string filePath)
  21. {
  22.     AccessWorkspaceFactory workspaceFactory = new AccessWorkspaceFactoryClass();
  23.  
  24.     var workspace = workspaceFactory.OpenFromFile(filePath, 0);
  25.     return new FileWorkspaceUtils((IFeatureWorkspace)workspace);
  26. }
  27.  
  28. private FileWorkspaceUtils CreateFileGeoDatabaseWorkspace(string filePath)
  29. {
  30.     FileGDBWorkspaceFactory workspaceFactory = new FileGDBWorkspaceFactoryClass();
  31.  
  32.     var workspace = workspaceFactory.OpenFromFile(filePath, 0);
  33.     return new FileWorkspaceUtils((IFeatureWorkspace)workspace);
  34. }

The only problem is it doesn’t work, my unit tests that just check GetFileWorkspace throws a COMException:

Test method CompanyName.GIS.Core.Esri.Tests.WorkspaceProviderTests.GetWorkspace_ValidPersonalGeoDB_GetFileWorkspaceUtils threw exception:
System.Runtime.InteropServices.COMException: Exception from HRESULT: 0x80040228
at ESRI.ArcGIS.DataSourcesGDB.AccessWorkspaceFactoryClass.OpenFromFile(String fileName, Int32 hWnd)
at Core.Esri.WorkspaceProvider.CreatePersonalGeoDatabaseWorkspace(String filePath) in WorkspaceProvider.cs: line 200
at Core.Esri.WorkspaceProvider.GetFileWorkspace(String filePath) in WorkspaceProvider.cs: line 189
at Core.Esri.Tests.WorkspaceProviderTests.GetWorkspace_ValidPersonalGeoDB_GetFileWorkspaceUtils() in WorkspaceProviderTests.cs: line 55

The problem was caused by Licensing, I changed EsriInitilization to contained the old style licensing as well (the one with IAoInitialize, the new stuff is using RuntimeManager):

All my unit tests (427 tests) pass, so it works…

  1. public class EsriInitilization
  2. {
  3.     private static bool _isStarted = false;
  4.  
  5.     public static void Start()
  6.     {
  7.         if (_isStarted)
  8.             return;
  9.  
  10.         if (!Initialize(ProductCode.Server, esriLicenseProductCode.esriLicenseProductCodeArcServer))
  11.         {
  12.             if(!Initialize(ProductCode.Engine, esriLicenseProductCode.esriLicenseProductCodeEngineGeoDB))
  13.             {
  14.                 throw new ApplicationException(
  15.                     "Unable to bind to ArcGIS license Server nor to Engine. Please check your licenses.");
  16.             }
  17.         }
  18.         _isStarted = true;
  19.     }
  20.  
  21.     private static bool Initialize(ProductCode product, esriLicenseProductCode esriLicenseProduct)
  22.     {
  23.         if (RuntimeManager.Bind(product))
  24.         {
  25.             IAoInitialize aoInit = new AoInitializeClass();
  26.             aoInit.Initialize(esriLicenseProduct);
  27.             return true;
  28.         }
  29.         return false;
  30.     }
  31. }

That still throw an exception, this time simply because IFeature refused to be cloned – though it implemented ESRI’s IClone interface. The error I got was:

//TODO: Write error and new code

//TODO: Post after writing about Extension Method (TODO above)

Resources:

Esri Forum: COM Exception 0x80040228 When Opening a Personal Geodatabase

 

Keywords: License, COM, exception, IWorkspace, engine, Server, ArcGis, ESRI, Unit tests, MDB, GDB

Saturday, March 12, 2011

Semantic Similarities

For the last year I have been working on my final project for my Masters Degree in Computer Science. My college, the Academic College of Tel-Aviv-Yaffo, doesn’t employ a Thesis but uses a combination of a Final test (with the material of core subjects from both the Bachelor and the M.Sc. degrees) and a final project worked on with one of the Doctors/Professors in my college. The project I am working on is on semantic similarities with Professor Gideon Dror and I am nearly done, all that is left is to present my work in front of my professor and a faculty member.

I have decided to first present my work here and then actually do the presentation.

The project was done mostly in Python (which before hand I had no knowledge of) and it’s first part was done as a possible contribution to the NLTK library.

The first part of the project was about implementing methods to find semantic similar words using an input triplets of Context Relation. Context Relation triplet are two words and their relation to each other extracted from a sentence.It was shocking to find in the end that NLTK hasn’t implemented a way to extract Context Relations from a text (they have a few demos done by human hand) and it seems that to implement this a knowledge linguistics that I just don’t posses.

The second part of the project was to extract the Semantic Similarities of words from the web site Yahoo Answers. The idea is that with enough data extracted from different categories an algorithm can be used to determine the distance of the words.

 

On to the presentation:

Similar0

Similar1

For this discussion we will ignore the part “without being identical”. In this project identical is included in similar.

Similar2

Are Horse and Butterfly similar? The first response should be of course NO, but of course it depends comparing horse to butterfly to house reveals that horse and butterfly are similar it just depends on the context…

Similar3[3]

Likewise comparing a horse to a zebra the response would be YES. But looking at a sentence such as:

The nomad has ridden the ____

and looking at horse, zebra and camel which is more similar in this context?

Similar4

This time the only similarity in these words are the way they are written and pronounced. Their context relation should be very dissimilar no matter the text. But imagine using a naive algorithm that only counts the number of words in a text, is it really inconceivable to have a close number of occurrences of these words?

Similar5

Humans use similarities to learn new things, a zebra is similar to a horse with stripes. But it is also used as a tool for our memory, in learning new names it helps to associate the name with the person by using something similar. For example to remember the name Olivia it could be useful to imagine that person with olive like eyes.

In software the search engine use similar words to get greater results, for example a few days ago I searched for a driver chair and one of the top results was a video of a driver seat.

Possible future uses for similar words could be in AI software. There is a yearly contest named the Loebner Prize for “computer whose responses were indistinguishable from a human's”. If we could teach a computer a baseline of sentences and then advance it by using similar words (like the learning of humans) it could theoretically be “indistinguishable from a human's”.

Imagine having the AI memorize chats, simply by extracting chats in Facebook or Twitter. Then have the AI extend those sentences with similar words. For example, in a real chat:

- Have you fed the dog?

Could be extended to:

- Have you fed the snake?

(some people do have pet snakes… and I can imagine a judge trying to trip an AI with this kind of question…)

Similar6

A simple definition is if we had a sentence containing word and we replaced word with word’ and the sentence is still correct the words are Semantic Similar. From now on Similarity is actually Semantic Similarity.

Similar7

From the examples we can see that Similarity is all about Context, Algorithm and Text.

Similar8

As we could see in the examples the Context of the words makes a large difference whether or not two words are similar. Unlike Algorithm and Text, it has nothing to do with the implementation of finding the similarity.

Similar9

Some Algorithm use Context Relation to give value to the context in which the words are in. Extracting Context Relation from text is a very complicated task and has yet to have an implementation in NLTK, the library does have a couple of examples that were created by human means.

Similar10

Looking at the all the words with the distance of 4 words from the word Horse. One of the Algorithms we will examine use this as a simpler Context aspect for the Algorithm.

Similar11

Another form of Context extraction is separating the text based on category. Then each category adds a different Similarity value and those can be added together.

Similar12

Algorithms that ignore the Context of the word are therefore less accurate than those that do but they are also more complex. It can be simply because they use Context Relation (with it’s complex extraction) or using a words radios which just mean individual work for each word – more complexity.

All the Algorithms use some form of counting mechanism to determine the Similarity/Distance between the words.

Similar13

Depending on the Algorithm a different scoring is done for each word. The the Algorithm determines how to convert that score into the Distance between the words, which just means calculating the Similarity.

Similar14

Text is a bit misplaced here because it is a part of the Context and is used inside the Algorithms. Choosing the right text therefore is as essential a part as choosing the right Algorithm.

But imagine a text that contain only the words:

This is my life. This is my life…

All the practical Algorithms shown here will tell you that “this” and “life” are Similar words – based on this text alone.

Similar15

In my second implementation of Similarity Algorithms I used extracted text from several categories of Yahoo Answers. Yahoo Answers is a question+answer repository that contains thousands of questions and answers. For my Algorithms I had to extract 2GB of data from the site (just so I had enough starting data).

Similar16

The Algorithms can be separated to two groups: those that use Context Relation (and therefore until an extractor for Context Relation is implemented are purely theoretical), and those that use Category Vector as a form of Context for the words.

Similar17

All the Context Relation Algorithms use this two inner classes: Weight and Measure. Weight is the inner class that give a score for the Context Relation, the Weight is important since a Context Relation that appears only once in a text should not have the same score as one that appeared ten times. The Measure inner class calculates the distance between two words using the Weight inner class. Using only this classes the user can be given a Similarity value of two words.

The Algorithms in this section implement a near-neighbor searches. We use them to find the K most similar words in the text not just how similar the words are.

Similar18

Taken from James R. Curran (2004)-From Distributional to Semantic Similarity

In my Theoretical work I implemented some of the inner classes of Weight and Measure from James R. Curran paper From Distributional to Semantic Similarity.

Similar19

I am not going to go into lengthy discussion on how they work because the paper discusses all of this.

I am going to say that the Similarities turn out different for each combination of Weight X Measure and that it is fairly easy to set a combination up or to implement a new Weight/Measure class.

Similar20

The classes I choose to implement taken from Scaling Distributional Similarity to Large Corpora, James Gorman and James R. Curran (2006). This Classes are used to find the K most similar words to a given word.

Similar21

The simplest algorithm is a brute force one. First we calculate the Distance Matrix between our word and all the rest of the words in the text and then we search for the K most Similar words.

The disadvantage for this algorithm is that calculation for finding the K-nearest words for “hello” can’t be reused for the word “goodbye” (actually only one calculation can be reused here and that is between “hello” and “goodbye”).

I am not going to go into the other implementations here since they are more complex. I might write another post in the future about those algorithms.

If you interested the Python implementation can be found here (or you can just read Scaling Distributional Similarity to Large Corpora).

Similar22

There are two practical Algorithms that I have implemented.

Similar23

This simple algorithm is very fast and can be preprocessed for even faster performance. By simple saving the count of each word per category, the Algorithm can be made as fast as reading the preprocessed file. In small examples of just 50MB data the Algorithm took only a few seconds to extract a result. Using the full data of 2GB it takes ~10 minutes to have a result for ~350 pairs of compare words. Though because of the large amount of data the data must be opened in chunks (a chunk per category) or an Out of Memory Exception is thrown.

Similar24

The end of the Algorithm is identical to the first Algorithm but where the words radios Algorithm has clearly more vectors. Not only that preprocessing of this data is both time consuming (takes ~5 days) but also space consuming (from 2GB to 15GB) – just preprocessing the data caused at least 10 Out of Memory Exceptions (Python doesn’t have an automatic Garbage Collection so after every category I had to call gc.Collect() manually).

The calculation time for ~350 pairs of compare words was ~25 hours, which of course can’t be used in real time AI conversations. Though with the preprocessing it doesn’t matter if there are 350 or 35k words to compare – it will take approximately the same time. For example three categories of ~120MB with ~350 pairs take ~56 minutes but 3 pairs take ~30 minutes.

Similar25

It’s important to note that both Algorithms have close result, for example bread,butter has a Similarity of 0.56 which is pretty high.

Similar26

As can be seen the Basic has almost always greater result than Words Radios. Not only that Basic has some weird result such as Maradona is more Similar to football than Soccer though in many places (not USA) use them as synonyms, whereas Words Radios seem to think soccer is more similar to football.

Since the Words Radios actually uses a form of Context Relation (though not very lexical) it is considerably more accurate.

Similar27

Remember I claimed it was all about the text? well this results were done with just a few categories and suddenly Arafat is similar to Jackson, how weird is that?

Another difference is the calculation time the Simple Algorithm takes ~17 seconds where the Words Radios Algorithm takes ~56 minutes.

BTW remember night and knight? Well the simple algorithm returned 0.79 Similarity for those 3 categories… And the Words Radios returned 0.48 Similarity.

Similar28

So do you have any questions? Suggestion? Too long? Too short?

Tell me what you think…

 

Keywords: similarity, NLTK, search, AI

IceRocket Tags: ,,,