Blogging Thesis Progress


After discussing it with my advisor, I've decided to start blogging about my work on my master's thesis. I'll start things off with an post about my research questions.


The Internet, particularly the world wide web, is an increasingly important part of how people seek out political information. According to results from a 2004 Pew/Michigan survey, 53% of Internet users had gotten news about the Iraq war online, 35% of Internet users had gotten news about gay marriage online, and 26% of Internet users had gotten news about the debate over free trade online. Early theorists of the Internet championed it as an egalitarian medium; since the cost of producing a web site is much lower than traditional publishing, and the potential reach of that web site is much greater, the Internet would expand the political voice and knowledge of the average citizen. As Howard Dean's campaign manager Joe Trippi effused, “The Internet is the most democratizing innovation we’ve ever seen, more so even than the printing press.” Others have taken a more pessimistic view of the same phenomenon. Sunstein and Putnam, for example, fear that with public attention diffused across millions of web sites political discourse will become more polarized.


Another possibility is that the Internet might not be so egalitarian after all. To understand why this would be, it's necessary to reflect on the structure of the web. The element tying one web page to another is the hyperlink. Clicking a hyperlink is what allows an Internet user to “browse” from one web page to another. Across the web, hyperlinks follow a power law distribution . A power law distribution is highly inegalitarian; this means that a small number of web sites are the destination of the vast majority of hyperlinks.


The distribution of traffic to web sites also follows a power law. To understand why this should related to the hyperlink structure, it's necessary to think about the ways Internet users discover web sites. If a user already knows about a web site, they can visit it directly. If they don't, they can discover it via a hyperlink from a site they already know about or by using a search engine like Google. Both of these methods favor the discovery of highly linked-to sites. When browsing the web, the more hyperlinks there are to a site the more likely a user is to come across one of them. When using a search engine, most users only visit web sites on the first page of results. The release of search data for over 600,000 AOL users showed that 90% of clicks went to the results from the first page, 74% of clicks went to the first 5 results, and 42% of clicks went to the first result. This is significant because search engines' rating algorithms give heavy weight to the number ofhyperlinks a site receives. Although the exact algorithms vary from search engine to search engine and are often secret, search engine result ordering is barely distinguishable from simply ordering web sites based on the number of hyperlinks to them.


Using a data set that meshed data from an Internet service provider about the sites their users visited with data on the number of hyperlinks to those sites, Matthew Hindman found a .704 correlation between the amount of traffic a site received and the number of hyperlinks to it. Hindman also found that the power-law distribution of hyperlinks on the web as whole also applies to political content. Using techniques I'll discuss in future posts, Hindman examined communities of web sites dealing with abortion, the death penalty, gun control, the presidency, the congress, and politics in general. In all of these cases, a power law fit the distribution of hyperlinks with an R2 greater than .90.


Despite the Internet's importance, little research has been done examining the sources of political information to which Internet users are most readily exposed. Hindman's research tell us that the visibility of web sites on at least some political issues follows a power law, but it does not tell us anything about the characteristics of the most visible web sites relative to the rest. What kinds of organizations are behind the most visible web sites on an issue? What kinds of information is presented by the most visible web sites? Are the viewpoints of the most visible web sites representative of the entire set of web sites on an issue? Do the web sites about an issue cluster together based on ideology, type of source, or some other factor? These are the questions my thesis is designed to address.

OpenSecrets API

While playing with OpenSecrets web service, I've run into some puzzling discrepancies between the data it returns and the data listed on the website. I'm interested in the PAC contributions from a given sector to a given candidate during a given cycle.

When I access Senator Clintons 2006 PAC contributions, I see $282,600 in Finance, Insurance & Real Estate PAC contributions. When I access this data through the candSector method of their API I get a response that lists $363,464 in Finance, Insurance & Real Estate PAC contributions.

I tried using the CandIndByInd method to gather data on the specific industries that make up the Finance, Insurance & Real Estate sector. These numbers match when I receive them, but I sometimes get a “Resource not Found” error. For example Senator Shelby's 2006 Finance, Insurance & Real Estate PAC contributions shows $4,500 in misc finance contributions, but the API request just returns the error.

Reddit's API

You can find discussions of Reddit's “secret” API on a few blogs. Just as interesting, but not discussed on those blogs, is that Reddit will return any page in JSON format if you append .json to the URL. For example:

· 2009/01/20 19:39 · Brian Pitts · 0 Comments

Calendar of FLOSS Events in GA

I've started maintaining a calendar of events and volunteer opportunities around GA involving linux and other free software. I'm currently subscribed to the ALE, CHUGALUG, LUG@GT, GA State's Students for Open Source, GA Ubuntu LoCo, Atlanta “Pragmatic” Linux Meetup Group, and Free IT Athens mailing lists. I'm also subscribed to the LCLUE, MGALUG, SAVLUG, ATLOSUG, and OSSAtlanta rss feeds. Please leave a comment if you know of other places I should monitor for events or have an event you want publicized. Here are links to the calendar html, xml, and ical formats.

This calendar is not a comprehensive list of Free IT Athens events, for that visit freeitathens.org

LibX for Firefox 3

The LibX Firefox extension for the University of Georgia is now available for Firefox 3. You can install it from here.

UGA Proxy Bookmarklet

Along with my recent Ubuntu upgrade came the Firefox 3 beta, but LibX has not released a version of their extension that works with Firefox 3 yet. If you're like me, all links lead to JSTOR, and you're really missing the ease with which LibX allowed you to reload a page through your institution's proxy. Luckily, this is as easy as adding a bookmark to firefox with the following as the location. If you're not at UGA, you should replace the text inside the quotation marks with the URL of the proxy you use.

 javascript:void(location.href="http://proxy-remote.galib.uga.edu:2048/login?url="+location.href);

Now if only del.icio.us would update their extension.

Change filetypes cached by Apt-Cacher

I've been learning how to use debian-installer's preseed functionality in order to automate some of the installations we do at Free IT Athens. Among other things, I wanted to set it to use apt-cacher, our caching proxy server for software, and to install some additional packages, including msttcorefonts. Msttcorefonts downloads each font as an exe file, which isn't in apt-cacher's whitelist of filetypes to accept. If you try, you'll receive a 403 error and the message Sorry, not allowed to fetch that type of file. Since apt-cacher is written in perl, this was an easy fix; I modified line 646 to read

 if ($filename =~ /(\.deb|\.rpm|\.dsc|\.tar\.gz|\.diff\.gz|\.udeb|\.exe)$/) {

LibX for UGA

I created a version of the LibX Firefox extension for the University of Georgia. It provides nice browser integration with the catalog (Voyager), OpenURL resolver (SFX), and proxy (EZ Proxy). You can install it from here.

· 2008/02/08 01:04 · Brian Pitts · 0 Comments

DokuWiki HTTPClient Fix

For a while now, I had an annoying error every time I created a new page in DokuWiki.

Warning: Invalid argument supplied for foreach() in inc/HTTPClient.php on line 427 Warning: Cannot modify header information - headers already sent by (output started at inc/HTTPClient.php:427) in inc/actions.php on line 296

The problem may have been caused by the blog plugin. The fix is to cast $data to an array in line 427 of HTTPCLient.php like so

foreach((array) $data as $key ⇒ $val){

Tags:

Macworld Rails Presentation Uploaded

I've uploaded to pdf and latex source from my Macworld presentation about deploying Ruby on Rails here.

Older entries >>

blog.txt · Last modified: 2007/09/27 01:01 by brian
Recent changes · Show pagesource · Login