Wed, 12 Aug 2009

Fun With Unix Commands

Just for fun I decided to see if I could write a long Unix command line to produce a top-25 word cloud from my blog contents.

Here's what I came up with.

should would google those voted
know needs never server technology
seems thought people three products
country speed shuttle looks couple
through problem pretty could every


And here's how I generated it from the Unix command line:
cat *.txt | tr '[A-Z]' '[a-z]' | tr ' i ' ' I '| sed 's/ /\n/g' | grep ..... | \ egrep -v '[\><"#1234567890/|_)(!&-+=:@-]'| sed 's/[{}?;,.]//' | sort | uniq -c | \ sort -n | grep -v -f ~/stopwords.txt | sed 's/$/<\/font>/' | sed 's/[0-9] /">/'| \ sed 's/\([0-9]\)/<font size="\1/' | tail -25 | shuf -n 25 > ~/wordcloud.txt

In English, what those commands do are:

  • List all the text files
  • Transfer all the words into lower case
  • Recover the word i back to I
  • Substitute every space in each text file with a carriage return (now all words are on their own line)
  • Apply my own "stop word" filter-- namely, only show words with 5 characters or more (not, me, us, she,... boring words for a cloud)
  • Pull out lines containing non-alpha characters
  • Pull off , and . and other punctuation from words
  • Sort the resulting list
  • Count the unique lines
  • Sort the counted list
  • Pull out a list of stopwords from google
  • Add a closing </font> tag to the word
  • Replace the space between the count and the last digit of the number with a "> to close the font size tag
  • Replace the first digit(s) of the count with <font size="[number]
  • Take the 25 last (most frequent) lines
  • Randomize the list of 25
  • Put the <font> tags into a file called wordcloud.txt

Another day, another (geeky) project. ;)




Khan Klatt

Khan Klatt's photo