The 3 Most important things in Text Mining: Duplication, duplication, duplication

The classic adage of the 3 most important things in real estate: location, location, location, takes on new life in investigation on the Internet. Without controls on duplication, syndication, feeds, spam, plagiarism, re-tweet, and so on, the problem of information overload grows explosively in investigation. A systematic investigation requires elimination of redundancy and duplication, and investigating curbstoning is no different. Take for example, a recent analysis of Craigslist Autos “For sale by owner”. Craigslist does to some extent try to control spamming of ads by preventing reposting of the same ad within 24 hours, but for the most part people can get around that relatively easily.  The goal is to separate the true curbstoners from those that are “eager beavers” trying to sell their car by posting 20 ads with the same information in it.  The results are as follows:

Reducing Duplicate Ads

Using Harmari Reports can reduce the duplicates to be reviewed down to less than 3%

With the Harmari Reports, what used to be 2000 suspicious ads is now only 1136, thanks to the proprietary de-duplication algorithms used in the Harmari Engine.  This data reduction leads to a productivity increase of 76% due to the reduced size of data to review by hand.