I have been an avid reader of Hacker News for long. I very strongly feel that it should be on the reading list of every serious developer out there. As a voracious consumer of this site, I think we can add some features to Hacker News which will make it even more awesome. I have realized the importance of such small useful features like dead video link removal on some other sites like Quora and Youtube in the past and have tried my hands at hacking to come up with simple solutions for the same. One such feature which Hacker News needs in my opinion is duplicate detection in the submitted posts. Let’s delve deeper into it.
Hacker News is a crowd-driven community where users submit some relevant links and others vote on it. Using such an organic system of voting, it automatically surfaces the best content to the front page of Hacker News. Unfortunately, it seems Hacker News doesn’t have quality checks at the entry point, when a user uploads content to its site. Often different users upload slightly different versions, URLs or titles of the same logical article and it starts appearing as duplicate posts in Hacker News. This leads to the following issues :
- Duplicate content in general is a bad user experience. Users want to see unique , interesting content on Hacker News.
- Similar versions of the same article often end up sharing the votes from the community. This can potentially lead to loss of votes for the main article and it might end up not surfacing on the front page of Hacker News.
The process of finding duplicates in a collection of records is called deduplication. I quickly hacked up a simple script to see if this issue can be easily solved on Hacker News. The script can be found here and sample outputs for 3 days is here .
I followed a very simple approach for de-duplicating hacker news with immediate positive results. Here is the methodology :
- Crawl top 25 pages of Hacker News to get URLs and title of all articles. Each page has roughly 30 articles, so I ended up extracting around 750 URLs as part of my script run.
- Compute pair-wise Jaccard similarity of all articles with each other and output the articles whose title similarity is greater than 0.5.
Here is what I deduced by observing the duplicate results :
- On the average I was able to find that between 2-5% articles are duplicate every day.
- Most of the duplicate postings occur for some important event that happened in the community that day like acquisitions etc. For example, Cisco acquiring OpenDNS was posted thrice 1 2 3, Greece crisis was submitted multiple times too etc.
- Many of these duplicate posts can be easily spotted just by doing a simple URL normalization. For example, https vs http etc. Others could easily be found by doing data pre-processing like stopword removal, stemming etc. , followed by usage of some String similarity measure like Jaccard Index.
Ideally whenever a user submits a new post, a simple de-duplication check can be done and if a duplicate is found, the user can be prompted with a Did you mean kind of feature in Stackoverflow or Quora. If the user chooses the shown result, user’s vote should be used to increase the score of the main article. If the user thinks that his article is different, then he should be allowed to submit his content.
I believe duplicate content problem on Hacker News can be solved with little effort. It would be awesome if this feature could be implemented in our beloved site in the future
The post De-duplicating Hacker News appeared first on Everyone has a story to tell !!.