February 17th 2010

Building the crawler was the easiest part of this project.

All this crawler does is take a seed blog (my blog) URL, run through all the links in its front page and store the ones that look like a blogpost URL. I assume that all the blogs in this world are linked to on at least one other blog. Thus all of them will get indexed in this world if the spider is given enough time ad memory.

This is the code for the crawler. Its in Python and is quite easy. Please run through it and let me know if there is any other way to optimise it further:

import sys
import re
import urllib2
import urlparse
from pysqlite2 import dbapi2 as sqlite

conn = sqlite.connect('/home/spider/blogSearch.db')
cur = conn.cursor()

tocrawltpl = cur.execute('SELECT * FROM blogList where key=1')
for row in tocrawltpl:
tocrawl = set([row[1]])

linkregex = re.compile("")\s*href=[\'|\"](.*?)[\'|\"].*?>

while 1:

&nbsp &nbsp &nbsp &nbsp try:
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp crawling = tocrawl.pop()
&nbsp &nbsp &nbsp &nbsp except KeyError:
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp raise StopIteration

&nbsp &nbsp &nbsp &nbsp url = urlparse.urlparse(crawling)

&nbsp &nbsp &nbsp &nbsp try:
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp response = urllib2.urlopen(crawling)
&nbsp &nbsp &nbsp &nbsp except:
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp continue

&nbsp &nbsp &nbsp &nbsp msg = response.read()
&nbsp &nbsp &nbsp &nbsp links = linkregex.findall(msg)

&nbsp &nbsp &nbsp &nbsp for link in (links.pop(0) for _ in xrange(len(links))):
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp if link.endswith('.blogspot.com/'):
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp if link.startswith('/'):
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp link = 'http://' + url[1] + link
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp elif link.startswith('#'):
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp link = 'http://' + url[1] + url[2] + link
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp elif not link.startswith('http'):
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp link = 'http://' + url[1] + '/' + link

&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp select_query='SELECT * FROM blogList where url="%s"' %link
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp crawllist = cur.execute(select_query)
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp flag=1
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp for row in crawllist:
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp flag=0

&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp if flag:
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp tocrawl.add(link)
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp insert_query='INSERT INTO blogList (url) VALUES ("%s")' %link
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp cur.execute(insert_query)
&nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp conn.commit()

This post first appeared on Paritosh The Geek, please read the originial post: here

People also like

Machine Learning : The Crawler

Related Articles

Machine Learning : The Crawler

Related Articles

Share the post

Subscribe to Paritosh The Geek

Thank you for your subscription