Word4Word Philadelphia

I know I haven’t been great about keeping this updated, but rest assured — I did complete the project! I’ll be posting some more of the word clouds, along with some “comprehensive” word clouds (i.e. a word cloud for all of the posts in a category for the entire month). If you want to know when these are available, just leave a comment to that effect and I’ll let you know.

Tags:

In my last tech post, I started explaining my process for creating these word clouds. Wordle does a lot, but I still needed a mechanism to grab the feed of entries from each day’s Craigslist ads and strip out all the code. Getting each day’s Craigslist feed was a pain — and I had to remember to do it just before midnight, every day, or I wouldn’t get the right data for the word clouds.

Enter the magic of Unix. Unix is a command-line operating system running on a lot of web servers. If it makes you think of “Linux” — that free operating system that your local geek might have tried to convince you to try — it’s because, indeed, Linux is built off of Unix. Additionally, Apple’s Mac OS X software is built on top of Unix, although most users won’t ever see it.

As I said, Unix is command-line only, meaning you have to type everything out — no clicking and dragging with the mouse, or relying on icons, task bars and the like. What you get in exchange are powerful ways to manipulate data.

The starting point for this project is a function called cron. Cron is a command that you can set up to run commands automatically in Unix. So, to begin, I told cron to run the following command every evening at 11:55 PM:

/usr/bin/curl -s
"http://philadelphia.craigslist.org/{w4w,w4m,m4w,m4m}/index.rss"
-o "#1.xml"

There’s a lot going on there, so let’s break it down.

/usr/bin/curl is the command I’m executing. Curl is a Unix web browser (actually more of a spider). So I’m telling it to go to Philadelphia Craigslist. The -s option just tells curl to operate silently, in the background.

After the Craigslist web address, you see a brace filled with four items: w4w, w4m, m4w, and m4m. Craigslist helpfully generates an RSS feed for each category, and it’s in the same location in each directory (index.rss). Curl accepts what’s called regular expressions, which are ways to reference certain types of text. In this case, I’m telling curl to make four visits to Craigslist’s website, once for each of those directories.

Following that, you see the -o option, which tells curl to output the data to a file. (By default, Unix loads everything onto the screen unless you tell it what to do with the data.)

Because I want to make a different word cloud for each of Craigslist’s directories each day, I don’t want it to all get output to the same file, though. Helpfully, curl allows you to use wildcards in connection to regular expressions. What #1 means is, essentially, “insert the results of the regular expression.” So curl will loop through this process four times:

So now we have the four files — but we can’t simply leave them there, because then they’ll get overwritten each day when a new cron runs. If anything should happen to a word cloud, I want to be able to go back to the original data. So my next two commands to cron are:

/bin/mkdir files/craigcloud/"`date +%d`"
/bin/mv *.xml files/craigcloud/"`date +%d`"/

Mkdir is a Unix command for creating a folder. I’m telling it to create a subfolder of files, where I’ve already created a folder for this project called craigcloud. Inside this folder, I’m using a Unix replacement string to insert the two-digit day of the month. What this means is that each day, a new folder will be created with that day’s number, e.g.:

  • files/craigcloud/01
  • files/craigcloud/02
  • files/craigcloud/03
  • etc.

The mv command then moves all of the .xml files I just created into that day’s folder.

Now comes the interesting part — what to do with the data itself? When I last left off, I pointed out that putting the data from Craigslist’s RSS feed created word clouds full of data instead of words:

Word cloud of Craigslist RSS code

So I need to do something to the .xml files I’m pulling in to strip out all of that code, and let the text itself shine through.

It’s Unix to the rescue again! (A sidenote: One nice benefit of grabbing the RSS feed instead of the HTML posts is that the “It’s not OK to contact this poster with commercial services…” line that appears on every post is automatically generated, but doesn’t show up in the RSS feed — meaning we don’t have to strip it out later on.)

Remember above when I said regular expressions allowed you to reference different types of text? The way it’s most often used is for manipulating text — changing one string of text to something else — and that’s exactly what I’ll be doing here. Now this is going to get a little complicated, so bear with me.

/usr/bin/awk '{gsub(/<title><!\[CDATA\[/,"");print}'
files/craigcloud/"`date +%d`"/m4m.xml |
/usr/bin/awk '{gsub(/\]\]><\/title>/,"");print}' |
/usr/bin/awk '{gsub(/<description><!\[CDATA\[/,"");print}' |
/usr/bin/awk '{gsub(/\]\]><\/description>/,"");print}' |
/usr/bin/awk '{gsub(/<br>/,"");print}' |
/bin/sed '/<.*>/d' |
/usr/bin/awk '{gsub(/[[:punct:]]/,"");print}' |
/bin/sed '/ xml/d' |
/usr/bin/tr '[:upper:]' '[:lower:]'
>files/craigcloud/"`date +%d`"/m4m.txt

Yikes! You should know that I came up with this bit by bit — so let’s go through it the same way and see what’s going on.

First, the vertical bar at the end of each line just means “run another command before stopping.” So in fact these are several commands strung together into one line. Awk, sed and tr are all ways to manipulate text in Unix — and in fact they’re not nearly the only three ways to do so. I’ll spare you the details of each but suffice to say there are LOTS of ways to push characters around in Unix!

OK. So the code above is saying the following, line by line:

  • Search for text that begins with <title> and ends with [CDATA[. Substitute it with nothing (i.e. erase it).
  • Take that same file and look for text beginning with <description> and ending with [CDATA[. Dump it.
  • Do the same with </description>.
  • Remove all of the line break (<br>) tags.
  • Now remove ALL the tags — anything between angled carets < >. The reason we didn’t do this to begin with was because by identifying the <title> and <description> tags, we could strip out everything on those lines, getting rid of some other non-text code as well.
  • Remove all punctuation. Wordle is supposed to do this anyway, but this ensures that an instance of you’re and an instance of youre are treated as the same thing. (Craigslist personals are not a paragon of English usage.)
  • Get rid of the string “xml” since it’s code and not content.
  • Transform all the text to lowercase. Again, Wordle should do this, but, you know. Raincoat AND umbrella.
  • Dump all of this into today’s folder as a new file with a .txt extension.

(I should point out that the above command is actually run four times, once for each of the files.)

So there we have it. Each day I get a folder full of just the text from that day’s posts in each Craigslist category. Pretty cool!

Now what would be super-cool is if we could automatically feed this to Wordle and get back a word cloud. Alas, Wordle has no API for perfectly valid reasons and … well my fun a day project probably shouldn’t be entirely automated, right? :)

I’ve been posting the “large” thumbnails of the word clouds, but now that I’m posting several days’ worth I wanted to try to give a sense of the progression and variation from day to day — thus the small thumbnails below. You can still click through to the originals. As before, these are in the order m4m, m4w, w4m, w4w.

Jan. 2 m4m

Jan. 2 m4m

Jan. 2 m4w

Jan. 2 m4w

Jan. 2 w4m

Jan. 2 w4m

Jan. 2 w4w

Jan. 2 w4w

Jan. 3 m4m

Jan. 3 m4m

Jan. 3 m4w

Jan. 3 m4w

Jan. 3 w4m

Jan. 3 w4m

Jan. 3 w4w

Jan. 3 w4w

Jan. 4 m4m

Jan. 4 m4m

Jan. 4 m4w

Jan. 4 m4w

Jan. 4 w4m

Jan. 4 w4m

Jan. 4 w4w

Jan. 4 w4w

Jan. 5 m4m

Jan. 5 m4m

Jan. 5 m4w

Jan. 5 m4w

Jan. 5 w4m

Jan. 5 w4m

Jan. 5 w4w

Jan. 5 w4w

Jan. 6 m4m

Jan. 6 m4m

Jan. 6 m4w

Jan. 6 m4w

Jan. 6 w4m

Jan. 6 w4m

Jan. 6 w4w

Jan. 6 w4w

At the Fun-A-Day exhibit, I plan to have them stretched out in four parallel lines, so you can easily compare the four for each day and then chronologically for each section — I realize this is a little harder to do in this blog format.

Looking continues to dominate, and continues to dominate more in the m4m category. Im is also popular. I’ve been filtering out all punctuation, including apostrophes, so most of those are probably from “I’m.” Some might be from IM (instant message) but spot-checking a few of these showed most of the instances as having been the former. It’s worth pointing out that Wordle is automatically removing common English words (like I, a, the) so that they don’t dominate even more than Im is.

Interesting word combos:

  • Jan. 3 w4m: I’m just looking
  • Jan. 3 w4w: Someone black pleaseblack tends to rate in the top tier of words in this section; white is generally frequent in m4m
  • Jan. 5 m4m: Stats please send pic face pics (coincidental arrangement, but probably not an uncommon ordering of words)
  • Jan. 5 m4w: (Im) know woman well

Other notes:

  • w4w seems to be the most polite section — please generally rates highest here.
  • Bottom and top tend to be about equal frequency in m4m. Neither word ranks high in any other category.
  • Conversely, the word someone seems to be an interesting word that appears in m4w, w4m and w4w relatively frequently, but rarely in m4m.
  • Good tends to rate high in m4w and somewhat less high in w4m, but almost nowhere in w4w and m4m. (Lest you think this is all about moral judgment or good looks, spot-checking seemed to indicate about as much “good time” as “good looking.”)
  • Want seems to be pretty inconsistent, but is usually most frequent in m4w.
  • Life keeps popping up in w4m, and almost nowhere else.

The order I create these in will be the same each day — the order the files are listed in my folders — m4m, m4w, w4m, w4w. Here is the first day’s batch, from January 1 (click to see the originals on Wordle):

Jan. 1 m4m

Jan. 1 m4m

Jan. 1 m4w

Jan. 1 m4w

Jan. 1 w4m

Jan. 1 w4m

Jan. 1 w4w

Jan. 1 w4w

You can already see some of the interesting comparisons between the four sections. Looking is significantly more dominant in m4m, and the word discreet has a high frequency unmatched in the other sections. Like is present in all four sections, but love is only really visible in m4w and w4m, and is much more prominent in the latter.

I think my favorite for the day is w4m — there are just so many good word clusters in there. Above looking are things: work, family, college and finally life. Below someone: dont go. On the left, guy: sports, degree, talk, spend. On the right: pretty good time.

What are your favorites?

So Wordle gives you a lot of choices when it comes to word clouds.

When you say nothing at all

When you say nothing at all

The Right Left Hand

The Right Left Hand

Expo 86

Expo 86

wildboys

wildboys

These are the moments

These are the moments

bromine

bromine

girl anachronism

girl anachronism

!!!!!!!!!!!!!

!!!!!!!!!!!!!

At first I thought, “Great! I can do a different style each day!” I also considered doing a different style for each of the sections (m4w, w4m, m4m, w4w).

But then I realized that one of the most interesting things would be the changes over time, and in comparing one section to another.

So, after playing around with the settings for awhile, I came up with a standard scheme for each of the word clouds. Here’s an example, using the text from the Wikipedia entry for “love” (minus the word love itself, which was larger than the word cloud itself):

Love (Wikipedia)

Word Cloud: Love (Wikipedia)

I chose a relatively rounded word cloud, because it seemed to be able to fit the most number of words in. The words are mostly horizontal, for readibility. After going through many of their fonts I chose a balloon or comic font — not the type of font I usually choose, but I like how it echoes the overall shape and I think it will work for the voice of these word clouds.

The colors of the words themselves are the “classic” scheme from Wordle. They have some really nice color schemes, as you can see in the examples above, but what I’ve decided to do is change the color of the background each day, but leave the colors of the words the same. Given that, this particular scheme seemed to work well on a variety of background without looking too garish.

Up soon: The first days’ word clouds!

Each day I am collecting the day’s posts from Craigslist personals and running them through Wordle.net to create a word cloud. But there are dozens of posts every day. How to grab all of them easily?

The first step is by getting the web feed. Many frequently-updated websites like newspapers and blogs have what’s known as an RSS feed, a constantly updated “stream” of the latest information that’s been posted. This blog itself has two RSS feeds up in the upper-right corner: one for the latest entries, and one for the latest comments.

Craigslist automatically generates an RSS feed for each of its categories. For instance, here’s the Philadelphia women for men category, and here’s the women for men RSS feed.

Many people use RSS feeds to stay up to date on their latest websites; follow a lot of feeds and you can quickly scan them to see if there are any new posts without having to go to each website to find out. Most people interact with RSS feeds through a feed reader, such as Google Reader, My Yahoo, Bloglines, Net News Reader, or your browser. Most of these parse the feed for readability. Here’s an example of a feed in Safari:

An RSS feed in Safari v3

An RSS feed in Safari v3

So, it seems simple enough — just cut and paste from a feed reader, right?

Well, not exactly. In my last post I pointed out the strange word “src” being displayed off on the left. This is part of the code for displaying an image, and because there are many image tags in Craigslist personals, it pops up quite a bit. In coding language, that’s known as cruft — leftover code that’s no longer useful.

Additionally, feed readers usually insert their own additional words into posts, such as Safari’s “Read more…” links above. On top of all that, Craigslist ads have an automatically-generated sentence, “it’s NOT ok to contact this poster with services or other commercial interests,” at the bottom of nearly every post. Leaving that sentence in would cause those words to always show up as the most-frequent.

It’s true that I could copy and paste from a feed reader and then go through and find-and-replace all of these different things with empty space, but that would increase the amount of time needed to do this exponentially.

Luckily, there’s another option. Just like web pages, RSS feeds have a layer of code underlying them, known as XML. It looks like this:

RSS feed code

RSS feed code

Now, at first glance this looks even worse — there’s lots of extra stuff! And if we were to create a word cloud directly out of that, it wouldn’t look to great, because so much of the code would be treated as words:

Word cloud of Craigslist RSS code

Word cloud of Craigslist RSS code

Ick. Not what we want at all.

In my next tech post, I’ll talk about how I’m filtering all of this out — automagically! — and why it really is easier with the code version.

As I explain above, I’m spending January creating word clouds from Craigslist personals ads. I’ll be collecting data right away, but it might take me a couple of days to get the word clouds right (and I probably won’t publish every one on every day — there have to be some surprises for the exhibit!). So in the meantime, here’s a sample word cloud drawn from Craigslist “best of” postings, which come from every city in every category, not just personals ads.

Word cloud from Craigslist 'best of' posts

Word cloud from Craigslist 'best of' posts

You can almost hear the posts in this word cloud. “Like…just…really!” On the left hand side you can see that there are a lot of wants and needs. Many people seem to have issues with big things. There’s more love (lower left) than fucking (upper center).

Off on the left end of the word cloud you can see a cryptic few letters, src. These will be the subject of a more technical post soon, on why I began this art project with some coding.