The top google result for the search terms mysql, python, and unicode is a blog post that recommends using the init_command option to the construct a MySQLdb connection to run the SQL 'SET NAMES utf8'. A better way is to set the charset option to 'utf8'. This way MySQLdb realizes the connection is using utf8 and encodes things properly. For more explanation see MySQLDB's user guide.
If you've ever found yourself trying to log the commands ran by your bash scripts (and cursed your failures to quote and escape them properly), then try using bash's built in history functionality instead. Below is in example of how to turn it on within a script.
#!/bin/bash HISTFILE="$0.log" set -o history history -c echo 'Those who cannot remember the past are condemned to write their own logging functions.' history -w
If anyone is using the following mod to a phpbb installation,
/***************************************************************************
* rss.php
* -------------------
* begin : Monday, July 7, 2003
* notes : This code is based on the work of the original
* developer below. Portions of this code
* 'borrowed' from phpbb_fetch_posts, an
* untitled rdf content syndicator posted at
* phpbb.com, and phpbb itself.
* email : rss@wickedwisdom.com
*
*
* $Id: rss.php,v 2.0.1 2003/07/17 10:11:00 nschindler Exp $
*
*
***************************************************************************/
I've found that the following patch which splits the query into two greatly increases the performance. MySQL's join plan for the original query starts with the post text table but we have to sort by the post table, necessitating a temporary table and filesort over a potentially huge number of records. Selecting the id's of the 10 posts we actually care about before loading the related information avoids that.
--- rss.php.original 2009-08-24 20:58:07.000000000 -0400 +++ rss.php 2009-08-24 22:11:54.000000000 -0400 @@ -1,4 +1,17 @@ <?php + +/*************************************************************************** +* rss.php +* ------------------- +* begin : Monday, August 24, 2009 +* notes : This code is based on the work of the original +* developers below. The SQL was modified +* to improve performance +* email : brian@polibyte.com +* +* +***************************************************************************/ + /*************************************************************************** * rss.php * ------------------- @@ -141,6 +154,17 @@ // // BEGIN SQL statement to fetch active posts of public forums // +$sql = "SELECT post_id FROM " . POSTS_TABLE . " ORDER BY post_time DESC LIMIT $count;"; +$id_query = $db->sql_query($sql); +$ids_array = array(); + +while ($id = $db->sql_fetchrow($id_query)) +{ + $ids_array[] = $id['post_id']; +} + +$ids = implode(',', $ids_array); + $sql = "SELECT f.forum_name, t.topic_title, u.user_id, u.username, u.user_sig, u.user_sig_bbcode_uid, p.post_id, pt.post_text, pt.post_subject, pt.bbcode_uid, p.post_time, t.topic_replies, t.topic_first_post_id FROM " . FORUMS_TABLE . " AS f, " . TOPICS_TABLE . " AS t, " . USERS_TABLE . " AS u, " . POSTS_TABLE . " AS p, " . POSTS_TEXT_TABLE . " as pt WHERE @@ -151,7 +175,8 @@ AND p.topic_id = t.topic_id $sql_topics_only_where $sql_forum_where - ORDER BY p.post_time DESC LIMIT $count"; + AND p.post_id in ($ids) + ORDER BY p.post_time DESC"; // could take this out and reverse the sorting using php instead $posts_query = $db->sql_query($sql); // // END SQL statement to fetch active posts of public forums
After discussing it with my advisor, I've decided to start blogging about my work on my master's thesis. I'll start things off with an post about my research questions.
The Internet, particularly the world wide web, is an increasingly important part of how people seek out political information. According to results from a 2004 Pew/Michigan survey, 53% of Internet users had gotten news about the Iraq war online, 35% of Internet users had gotten news about gay marriage online, and 26% of Internet users had gotten news about the debate over free trade online. Early theorists of the Internet championed it as an egalitarian medium; since the cost of producing a web site is much lower than traditional publishing, and the potential reach of that web site is much greater, the Internet would expand the political voice and knowledge of the average citizen. As Howard Dean's campaign manager Joe Trippi effused, “The Internet is the most democratizing innovation we’ve ever seen, more so even than the printing press.” Others have taken a more pessimistic view of the same phenomenon. Sunstein and Putnam, for example, fear that with public attention diffused across millions of web sites political discourse will become more polarized.
Another possibility is that the Internet might not be so egalitarian after all. To understand why this would be, it's necessary to reflect on the structure of the web. The element tying one web page to another is the hyperlink. Clicking a hyperlink is what allows an Internet user to “browse” from one web page to another. Across the web, hyperlinks follow a power law distribution . A power law distribution is highly inegalitarian; this means that a small number of web sites are the destination of the vast majority of hyperlinks.
The distribution of traffic to web sites also follows a power law. To understand why this should related to the hyperlink structure, it's necessary to think about the ways Internet users discover web sites. If a user already knows about a web site, they can visit it directly. If they don't, they can discover it via a hyperlink from a site they already know about or by using a search engine like Google. Both of these methods favor the discovery of highly linked-to sites. When browsing the web, the more hyperlinks there are to a site the more likely a user is to come across one of them. When using a search engine, most users only visit web sites on the first page of results. The release of search data for over 600,000 AOL users showed that 90% of clicks went to the results from the first page, 74% of clicks went to the first 5 results, and 42% of clicks went to the first result. This is significant because search engines' rating algorithms give heavy weight to the number ofhyperlinks a site receives. Although the exact algorithms vary from search engine to search engine and are often secret, search engine result ordering is barely distinguishable from simply ordering web sites based on the number of hyperlinks to them.
Using a data set that meshed data from an Internet service provider about the sites their users visited with data on the number of hyperlinks to those sites, Matthew Hindman found a .704 correlation between the amount of traffic a site received and the number of hyperlinks to it. Hindman also found that the power-law distribution of hyperlinks on the web as whole also applies to political content. Using techniques I'll discuss in future posts, Hindman examined communities of web sites dealing with abortion, the death penalty, gun control, the presidency, the congress, and politics in general. In all of these cases, a power law fit the distribution of hyperlinks with an R2 greater than .90.
Despite the Internet's importance, little research has been done examining the sources of political information to which Internet users are most readily exposed. Hindman's research tell us that the visibility of web sites on at least some political issues follows a power law, but it does not tell us anything about the characteristics of the most visible web sites relative to the rest. What kinds of organizations are behind the most visible web sites on an issue? What kinds of information is presented by the most visible web sites? Are the viewpoints of the most visible web sites representative of the entire set of web sites on an issue? Do the web sites about an issue cluster together based on ideology, type of source, or some other factor? These are the questions my thesis is designed to address.
While playing with OpenSecrets web service, I've run into some puzzling discrepancies between the data it returns and the data listed on the website. I'm interested in the PAC contributions from a given sector to a given candidate during a given cycle.
When I access Senator Clintons 2006 PAC contributions, I see $282,600 in Finance, Insurance & Real Estate PAC contributions. When I access this data through the candSector method of their API I get a response that lists $363,464 in Finance, Insurance & Real Estate PAC contributions.
I tried using the CandIndByInd method to gather data on the specific industries that make up the Finance, Insurance & Real Estate sector. These numbers match when I receive them, but I sometimes get a “Resource not Found” error. For example Senator Shelby's 2006 Finance, Insurance & Real Estate PAC contributions shows $4,500 in misc finance contributions, but the API request just returns the error.
I've started maintaining a calendar of events and volunteer opportunities around GA involving linux and other free software. I'm currently subscribed to the ALE, CHUGALUG, LUG@GT, GA State's Students for Open Source, GA Ubuntu LoCo, Atlanta “Pragmatic” Linux Meetup Group, and Free IT Athens mailing lists. I'm also subscribed to the LCLUE, MGALUG, SAVLUG, ATLOSUG, and OSSAtlanta rss feeds. Please leave a comment if you know of other places I should monitor for events or have an event you want publicized. Here are links to the calendar html, xml, and ical formats.
This calendar is not a comprehensive list of Free IT Athens events, for that visit freeitathens.org
The LibX Firefox extension for the University of Georgia is now available for Firefox 3. You can install it from here.
Along with my recent Ubuntu upgrade came the Firefox 3 beta, but LibX has not released a version of their extension that works with Firefox 3 yet. If you're like me, all links lead to JSTOR, and you're really missing the ease with which LibX allowed you to reload a page through your institution's proxy. Luckily, this is as easy as adding a bookmark to firefox with the following as the location. If you're not at UGA, you should replace the text inside the quotation marks with the URL of the proxy you use.
javascript:void(location.href="http://proxy-remote.galib.uga.edu:2048/login?url="+location.href);
Now if only del.icio.us would update their extension.
I've been learning how to use debian-installer's preseed functionality in order to automate some of the installations we do at Free IT Athens. Among other things, I wanted to set it to use apt-cacher, our caching proxy server for software, and to install some additional packages, including msttcorefonts. Msttcorefonts downloads each font as an exe file, which isn't in apt-cacher's whitelist of filetypes to accept. If you try, you'll receive a 403 error and the message Sorry, not allowed to fetch that type of file. Since apt-cacher is written in perl, this was an easy fix; I modified line 646 to read
if ($filename =~ /(\.deb|\.rpm|\.dsc|\.tar\.gz|\.diff\.gz|\.udeb|\.exe)$/) {