In part 1 of this post I described a solution to two problems I have: first, I, for some reason want to participate in the world of Facebook (because someone, somewhere must care what I think about things), and second, there is a lot of information out there and only a very small percentage is useful to me. See below (The useful information is magnified 1 billion times so you can see it on the chart):

The solution proposed was an artificial intelligence algorithm that gathers articles from the massive troves of Internet crap, filters out the gems, and then posts those to FB on my behalf. This post will describe the program that I use to collect and classify articles from the Internet for me.

The moving parts for the whole system

There are 5 pieces.

  • The news aggregator
  • The trainer
  • The classifier
  • The reader
  • The poster

In this post we will describe the aggregator, the trainer, and the classifier.

The Aggregator

This is a simple RSS reader written in ruby. I use the Feedzirra gem to pull in articles from a list of feeds contained in a yaml file.

I then insert certain attributes from each feed item into a sqlite db.

Here is some of the code. I am sure that some ruby guy out there can show me how to do this whole thing in one line. Until then, I will write my code so I can read it. Sometimes I even use comments!

db ="articles.db") 
counter = 0"rss_feeds.txt", "r").each_line do |line|
	feed = Feedzirra::Feed.fetch_and_parse(line)
    if !feed.nil? && !feed.kind_of?(Fixnum) && feed.entries
	feed.entries.each {|entry|
			#check to make sure the link is not already in there
			exists_query = "select * from articles where link = \"#{entry.url}\";"
			rows = db.execute(exists_query)
			if rows.size == 0
				counter = counter + 1
				clean_summary =  Nokogiri::HTML(entry.summary).inner_text
				clean_title = entry.title.sanitize
                db.execute("insert into articles

The Database

The sqlite database schema looks like this:

CREATE TABLE articles(id integer primary key,
actual_rank int, 
estimated_rank int,
read int, 
title varchar(200), 
summary_simple varchar(2000), 
summary_complete varchar(2000), 
published datetime, 
categories varchar(300),
raw_entry varchar(4000), 
source varchar(200), 
link varchar(200), 
posted integer, #if the article has been posted to FB
post_identifier integer #a number unique to the post);

In the table is included some metadata that will be used by the classifier. The “actual rank” is the rank that I, the human reader, gave the article. The “estimated rank” is the rank that the classifier gives the article.

The Classifier

I tried a bunch of different ways to do this. I tried the python nltk naive Bayesian classifier. It worked well. I also tried the ruby classifier gem. It is not nearly as robust as the nltk, but for my purposes it performed just as accurately and quickly as the python classifier. It was also simpler to use.

The method is straightforward. I programatically read all the ranked articles in the DB and use the data as a training set. Then I classify the new, unranked articles.

Building up the training data

A Bayesian classifier needs training data. I created this data myself since I would be using it to classify future articles for myself. It needed to learn what I like. To do this, I created a simple Sinatra app and threw in some Bootstrap stuff to serve me the articles. I spent a couple days reading and ranking. Here are the rules:

0: Don’t ever show me anything like this again.

1: This is pretty stupid, but I acknowledge that it exists. NBC’s fall lineup falls into this category.

2: This is not worth clicking on. Reading the headline was enough information. Celebrity gossip goes here.

3: I clicked the link. I read part of the article. Meh. Politics go here.

4: This was a cool article.

5: This was an amazing article. I want to share it with my friends and they better love it as much as I did. I cannot tell you what qualifies as a five, but I know it when I see it.

After building the training data set, I was ready to set my creation loose on some real data. There are six total classifications, one for each rank. Below is the example of the “Zero” classifier.

require 'stemmer'
require 'classifier'
require 'sqlite3'

@db ="articles.db")	

def get_ranked_articles(rank)
	rows = @db.execute("select summary_simple from articles where actual_rank = ?", rank)
	puts "returning articles with rank #{rank}: #{rows.count}" #I like to see what is going on...
	return rows

 # Load previous classifications
rank_zero  = get_ranked_articles(0)

# Create our Bayes / LSI classifier
 classifier ='0','1', '2', '3','4','5')

# Train the classifier
rank_zero.each { |zero| classifier.train_0 zero.join }

#do the same for the other ranks...
#now rank the new ones
a = get_all_articles

a.each { |text|
if text[5]
	estimated_rank = classifier.classify text[5]
	rows = @db.execute("update articles set estimated_rank = ? where id=?", estimated_rank, text[0])

That is pretty much it for the classifier. I added some things to make it reclassify older articles that I have not read (so that it changes the rankings as my tastes change). I also have it classify the articles that I have already ranked then compare the estimated rank with the actual rank just to see how well it is doing. I have achieved results greater than 60%. In some cases (depending on how much new data there is) I can get greater than 80% accuracy. In the next segment, I will discuss the reader and the poster. Lest you think this was confusing, just wait until you try to use Facebook’s API.

Photo Credit: