Amazon Web Services for Research

Within the last year, I’ve had a chance to use a number of tools within the Amazon Web Services (AWS) platform. This was for a number of different research projects, the latest of which actually focuses on using AWS to host a prototype for a business I’d like to start. Unfortunately for you readers, I’m still keeping very quiet on this work.

As I use the service to build my prototype, I keep thinking about how useful AWS is for researchers. While I realize that this blog focuses on social science research, it’s really a shame as to how difficult it is to develop software for the social scientific community. Unfortunately, AWS is nowhere near ready for most social science researchers, but if you have access to a research assistant, computer science student, or have a pre-existing technical background, this service may prove very useful.

A relatively easy to use tool is the Mechanical Turk. The tool is designed to allow people to “outsource” their work to a distributed network of workers. You submit a set of tasks or questions, and offer to pay people to complete or answer them. The amazing thing is you can pay people as little as $0.01 per answer, and hundreds still flock to do the work for you. There many papers that illustrate how to use the tool for various research projects, many of which focus on building training sets for machine learning tasks… But with features like filtering users by geographical area and pre-screening people for specific tasks, I think there’s a great opportunity to run other interesting experiments. For example, do people from different geographical regions answer political questions differently? What about providing a colour-blindness test in the pre-screening questions and then running experiments on vision or perception?

The idea behind the Mechanical Turk is called crowdsourcing, and you can even read more focused academic studies on this idea.

The second tool is Amazon’s Elastic Compute Cloud (EC2). If you can’t stop reading about cloud computing in the news, then this is part of the reason. For those scientists who don’t have access to dedicated hardware for experiments or data analysis, whether it’s a regular desktop computer or a supercomputer, this is a great tool to use. I currently pay about $2 / day to have access to the equivalent of a fully dedicated web server, and it’s been great for building prototypes and running test ideas without depending on my university’s infrastructure. This method is also much, much cheaper than setting up a server or buying a second computer.

All I can say at this stage is to check these out. If you’re unfamiliar with the tools or need help, feel free to e-mail me.

Data Visualization in Development

Lately I’ve been very interested in various forms of art, from painting to comic books to writing short stories. I think expression is important, and this is especially true when it comes to visualizing data and statistics. The age of pie charts and bar graphs is long gone, and making something more beautiful and useful is important. This is especially true as data becomes more complex: representing percentages in a pie chart is a great way to make comparisons between different categories, but how do you usefully show a social network with ten thousand nodes and information on income levels and gender? Graphic design and modern art can teach us a lot about this.

Aside from promoting research findings and academic discourse, data visualization can be used as an advocacy tool. The Tactical Technology Collective has a great guide for organizations interested in visualizing data.

The guide itself focuses on static data visualization — this can include posters, images, or advertisements. A further approach to visualizing data is through animations over time. One of my favourite examples of this is Hans Rosling’s Gapminder talk at the TED Conference. While I normally shy away from video lectures, this is definitely a must-see for anyone interested in data visualization and international development. Another example is an interactive animation of Iraq War casualties.

Thanks to the Internet, data visualization is becoming much more mainstream, both because the technology is easier to use and because people are forced to deal with more data (more e-mails, more websites, more social networking sites) on a regular basis. The Pandemic Preparedness Capacity Map is a simple example where USAID and InterAction visualize data focusing on international health projects.

Not surprisingly, the US Presidential Election has been the focus of many designers and data analysts. Sites like Everymoment Now show the number of articles referencing specific candidates, while pitch interactive shows campaign donation data publicly available at the Federal Election Commission (which also has a nice map).

The proliferation and availability of data is a great opportunity for political and development-focused organizations. A great opportunity exists in being able to combine multiple data sources. For example, mixing information from the US census, Department of Labor, Federal Election Commission, and other public governmental data sets can allow one to build a comprehensive view of regional political trends. Using data on sites like Nation Master or OECD and World Bank data archives can allow you to explore and visualize global economic trends, wars, or related topics.

Multilevel Models (in R)

Multi-level models, also known as hierarchical linear models or mixed models are a fairly complex and useful approach to the analysis of multiple and disparate groups. The reason one would use such models is that group-level models don’t allow you to predict variables for an individual within the group, while individual-level models might avoid important effects caused by group membership.

Suppose, for example, that you are modeling a soccer (football) league. You have statistics about each player, such as age, years playing, health information, and so on. You also have team-level information, such as number of wins, age of the team, and maybe information about the coach and manager of the team. If you want to model the amount of games won in the coming season, it’s important to include information about each player as well as team-level information.

While it’s technically possible to build a model like this using regressions, a multi-level model takes care of a great many details for you and allows you to formally test whether certain group-level variables affect individuals in the groups, or if there are important interaction effects taking place between the two levels. You can also abstract to multiple levels — for example, if you want to model voter behaviour in cities but want to build models at regional and provincial (state) levels.

The reason I’m writing all this is I’ve been exploring some options with multilevel data this past week, one set that includes personal social networks and labels therein, and another that include data on literacy projects for over a thousand schools in India. I spent this morning trying to see if there are open source tools that allow one to build multilevel models, and two R packages specifically stand out: nlme and lme4.

nlme has a great manual that provides a quick introduction to R itself, along with a detailed introduction to multilevel models and how to build and test them using the package. It’s a great read for anyone interested in learning more.

Human Rights and Network Analysis

Several months ago, Skye Bender-deMoll wrote Potential Human Rights Uses of Network Analysis and Mapping, a report for the American Association for the Advancement of Science. The report is a great introduction for social network analysis, especially with regards to applications to international development and social sciences.

Specifically, the Existing practice: current human rights-related use of network techniques chapter provides a great list of applications to politics and public policy. I won’t go into all the details here because you can simply get the PDF file above. What I did want to mention, however, are two projects listed in the report which I haven’t come across yet and are really worth checking out.

The first of these is Presidential Watch ‘08, a site that tracks just over 500 political blogs. I’m not sure if the data is still being updated, but the user interface is definitely worth a lot. Very neat and tidy — a great place to start for anyone interested in making their own political blog applications and tools.

The Net-Map Toolbox is the second project worth visiting. Run by Eva Schiffer, the toolbox is designed to learn about influence and power networks in communities (mainly within an international development context). By using board game pieces, height modifiers (for said pieces), and various tokens, interviewees build their perceived network of actors in their community and how these actors relate to each other. This is extremely important when running humanitarian and development-focused projects, and the International Food Policy Research Institute has a paper by Schiffer that discusses the importance of power relations, and the difficulties of data collection.

Tikhonov Regularization (in Graph Theory)

I’ve been fooling around with research papers lately, and have implemented the Tikhonov Regularization algorithm described in Belkin, Matveeva, and Niyogi’s Regularization and Semi-supervised Learning on Large Graphs. It’s a fairly straight forward algorithm, at least in terms of implementation in Python… Part of the praise should definitely go to NumPy, which really made life easy for me.

A few words of warning — the code hasn’t been checked aside from a few rough sample analyses. If you do find a mistake, please do let me know. I’m just sharing it with the world in case anyone wants a good starting point. Also, there’s been no optimizing of code at all!

So get the code here, and you can test on data files #1 and #2. You’ll need both to run the script.

In terms of the theory behind this algorithm, one important assumption it makes is that the labels (data point associated with each node) are distributed in a homophilous manner… That is, nodes lining to each other are likely to have similar values. This isn’t always the case, so be careful where you’d be applying this code. I won’t rant about this here, but I’ve had a few long discussions about when such an algorithm can actually be applied to data — it’s more tricky than it sounds!

Scanning the Web for Diseases

Let’s start with two over-simplifications. There’s a lot of information on the Web… And, it’s hard to analyze it. Indeed, regardless of which field of study you work in, the Web probably applies to you — whether you’re an economist studying auction systems, a computer scientist looking at technical infrastructure, or pretty much anything in between.

One very relevant application that tries to solve this problem for health practitioners is HealthMap. Funded by Google, it’s a perfect example of the intersection between data mining and public policy.

HealthMap: An Overview

HealthMap scans various health sites and news directories, constantly looking for news related to health and diseases. It does this by scanning the actual text of the articles and, using a text classifier, tries to categorize every article into (1) a specific disease, and (2) a specific region. This is much harder than it sounds, as the software needs to know the difference between a team of American doctors studying a new outbreak in England, and a team of English doctors studying a new outbreak in America. While this is easy for humans to do, computers are often quite terrible at telling the difference.

Once this information is collected and synthesized, the site displays information on outbreaks as a Google Maps mashup, making it easy to check where outbreaks are happening and what is going on in specific regions.

An overview of the technology is provided in a recent paper in the Journal of the American Medical Informatics Association.

The great thing about this is that using open source web crawling tools like Nutch, WVTool for text analysis, and Weka for model generation, you can build prototypes of HealthMap-like tools in several weeks. Of course, the accuracy of your classifiers and mapping articles to specific regions and diseases is often the hardest (and most important!) part.

The Challenge of Unstructured Data

HealthMap is a great response to the deluge of data that people and organizations have to deal with on the Web. Collecting data, analyzing it, and making it accessible is a major challenge in almost every field of study. Another example of a response to this is Issue Crawler, which allows one to explore political discussions online.

The main difference between these approaches and tools like Wikipedia and Who Is Sick? is that the latter use distributed networks of people to collect and organize information. I imagine that one great opportunity in the next few years will be combining the use of such “people power” with machine learning to build web services that help us deal with all this information and data.

Data Inaccuracies in Polls and Surveys

Salon.com published an interesting article yesterday by Paul Maslin and Jonathan Brown, discussing an inaccuracy in the standard approach to political polling. They say that phone surveys only focus on landlines, which ignore people who only have cell phones. They have a fairly detailed discussion on why this is the case, and how much this can affect polls — essentially, as the number of people who only use cell phones increases, polls can become less and less accurate. This is especially true since a specific type of demographic owns cell phones and avoids land lines (younger, more technical people), meaning the polls can become quite biased (and thus inaccurate).

Alternatives to Political Polling

So the first question that comes up is, “Are there alternatives to phone calls?” Even with the rise of the Internet, political polling is still very dependent on random phone calls. The basic problem is getting a random sample — you can’t do that with e-mails or site visits. So never trust those CNN or Fox News polls.

One alternative is using a prediction market. These act just like stock markets, but people buy and sell shares in a specific event — you then make a profit if the event takes place, and lose money if it does not.

What is exciting about prediction markets is that, with enough people participating, they aggregate individuals’ knowledge and can provide a reasonably accurate probability around a specific event. In fact, markets have been known to provide better predictions than those of experts. A lot of major companies, such as Google, HP, Best Buy, and many others, are using these now, and one can get a good overview by exploring Google’s work, and Wolfers’ and Zitzewitz’s paper.

So how accurate are these markets for the upcoming elections? Well, the Iowa Electronic Market pretty much shows a 50-50 split on the 2008 Presidential race, while Intrade.com gives Obama a 2-to-1 lead. Of course, not everyone participates in these markets, and I’m sure it is easy to argue that Obama supporters (read: younger, more technology-friendly people) are more likely to use sites like this.

Is This Cell Phone Problem An Isolated One?

When reading newspapers or magazines, people often feel more comfortable with numbers than they do with qualitative or subjective discussions. This is a major problem — yes, numbers do not lie, but the definitions used to get those numbers can often be misleading. The way surveys are designed, and the way “random” samples are chosen, can often bias results quite a bit.

One area where this is a very big problem is poverty measurements. Poverty is often defined with regards to how much of a family’s income is spent on food and shelter. International comparisons, however, are murky — the way you define baskets of goods (e.g. nutritional requirements, staple foods, etc.) can change quite a bit between countries. One of the biggest criticisms of surveys focusing on poverty has been that they are household surveys — people without homes are often missed. Indeed, finding such people can be very tricky in the first place.

Oftentimes, running surveys and collecting data is extremely difficult. A great overview of this, in an international development context, is Martin Ravallion’s “How Well Can Method Substitute for Data? Five Experiments in Poverty Analysis”. Statisticians, mathematics, and other researchers are constantly trying to find new analytical tools to make models and analysis more accurate, but bad data can rarely be fixed after it has been collected.

In general the important thing is to critically analyze the definitions and methods used in surveys and polls. The best piece of advice I ever got on this issue was that numbers and methodologies tell stories just like words do, and it is important to read between the lines.

YouTube, Viacom, and Data Concerns

Over the last few days, Viacom and Google have been in the news quite a bit due to their trial. Viacom is suing Google for $1 billion due to the amount of copyrighted material being posted on YouTube (which Google owns).

The Associated Press reports, “U.S. District Judge Louis L. Stanton authorized full access to the YouTube logs after Viacom Inc. and other copyright holders argued that they needed the data to show whether their copyright-protected videos are more heavily watched than amateur clips.”

The EFF has statements from both sides.

While such legal issues aren’t my specialty, I wanted to write about it because of the political ramifications of such a release of data. Also, the limitations behind data anonymisation are concerning here, even though Viacom says this data will be “anonymised” and not used to target specific individuals or users.

As far as I understand, Google will be handing over approximately 12 terabytes of data, in a database that includes when a video is played, each viewer’s user name, and also their IP address. At this point, both sides have argued that a user’s IP address cannot lead to identification of a specific person.

IP Addresses and User Names

If nothing is changed within the database, Viacom’s lawyers will be able to see individual user names that watched videos. I imagine it will be easy for them to also find out who posted the videos, either by simply visiting the site or making a few basic assumptions about the data (e.g. “The first person to watch a video is likely to be the one who posted it.”) that can be empirically tested.

While this process will not compromise everyone’s privacy, there will be users who can be tracked down through the information above. There are a number of ways to do this:

People with obvious user names. Some users use their real names, while others are building brands around their user names. For example, if you knew what videos lonelygirl15 watched, you can probably guess who watched them. The same is true for many other YouTube users. Furthermore, a great deal of users also post personal videos, and made accounts without ever expecting to have their viewing patterns analyzed by lawyers. Now if I see that User123 watches Colbert Report videos on YouTube, I can check if he or she posted personal videos (say, from a trip to Costa Rica or playing beer pong) and easily track that person down.

IP address contain geographic information. In sparsely populated areas, your IP address can’t be connected to you specifically, but can narrow the list of potential people by quite a bit. If you live in New York City, you might be okay.

Usage patterns. Remember when AOL released “anonymised” search queries of a few hundred thousand users? Based on the terms that were put into the search engine, a news agency was able to track down specific users based on this data. A similar issue occurred when researchers de-anonymised Netflix data by comparisons to publicly available information on IMDB. The same can be done on YouTube. What’s worse, one can easily use social networks based on comments, favourites, feeds, etc. to build a community-level view. Using this information, you can find clusters of users who are sharing or watching illegal content.

In my final year of university, I did a project focusing on community identification on YouTube and was able to build a basic crawler that found communities of users that would share anime clips. These were communities that were not formally organized, but could be found by analyzing comment patterns under each YouTube video. I won’t go into the math, but it’s very easy to do. This method might not find a specific individual, but can definitely find groups of friends (say, from your local high school) or fan clubs… Viacom could easily use a similar approach to track down groups of people who regularly view illegally uploaded content.

The Anonymity Myth

Suppose Google modified its database so that: (1) user names became a set of random characters, (2) so did IP addresses. This is often what people do when releasing data — it’s what AOL and Netflix did, for example.

Unfortunately, the last two methods above can still be used to track individuals down, because they depend on the underlying social network of the website, and the video content as well.

To really anonymise the data, one would need to randomize the underlying network structure as well. This would get rid of community structure and make it harder to track down groups of people with similar interests or backgrounds. Since Viacom is allegedly not interested in such information (and by the judge’s ruling, is not allowed to search for it even if it wants to), getting rid of links and scrambling the network structure is fair game.

Modifying the video content would be trickier. Viacom is trying to make the argument that illegally posted videos are more popular on YouTube than legal videos. To label a video as “legal” or “illegal”, one would need to watch the content, and there is no technology that exists to do this automatically (if there was, then Google should just use that and get this lawsuit over with).

To get over the problem of content comprising users’ privacy, one option is to have Viacom submit a list of videos they feel contain illegal content. Google can then scramble all the video IDs and return a list showing which scrambled IDs represent illegal content, as judged by Viacom. This will ensure that Viacom cannot visit YouTube and track down who posted those videos or connect their data with other databases (say, check if users posting videos are also commenting elsewhere, or participating in other sites). While this process may sound tedious, Viacom will need to label data in such a way for the company / team / group / lawyers they hire to actually analyze the data.

In the End, Does It Matter?

What bothers me most about all of this, however, is that if Viacom simply wants to prove that illegal content is more popular than legal content, can’t they use simple view statistics, or have Google calculate the number of unique viewers per video? I imagine that’s more than enough information to draw such a conclusion.

Clearly, there’s more behind this than meets the eye. Viacom initially wanted to see YouTube’s source code, arguing that YouTube might be treating illegal content in a different way than legal content. Luckily the judge didn’t feel that the company’s source code was as relevant in this discussion.

We’ll see what happens with this data, and how it is analyzed…

A new blog?

I’ve been debating with myself about this blog, and whether or not to actually start it. Sure, blogs are often easy and simple to set up, and writing is as difficult as you want to make it… Why the confusion, then? For those who know me, I already have a personal blog at hellowojo.blogspot.com, which I contribute to quite a bit. In addition, I also use del.icio.us for storing links, data sets, and other information.

So why start this? Right from the start, I want to say that, at least until August, I do not want to write daily or even a few times a week. However, after attending the Networks in Political Science conference this past week, reading a great deal about data visualization, and noticing the amount of attention my own discussions on combining mathematics with international development have gotten, I feel it may be useful to discuss my ideas, explore papers and theories, and look at various data sets in a public forum.

I can do this in my current blog, and I can continue using del.icio.us — in fact, I will continue to do both. The difference is that this blog will focus on my academic journies and research. Rather than writing regularly and often, I will write when I run into new ideas, and this will be more like a public white brainstorm… And if anyone would like to write or get involved as well, do let me know! As always, you can contact me at wojciech@gmail.com.