Khan Klatt's blog

Fun With Unix Commands

Just for fun I decided to see if I could write a long Unix command line to produce a top-25 word cloud from my blog contents.

Here's what I came up with.

should would google those voted
know needs never server technology
seems thought people three products
country speed shuttle looks couple
through problem pretty could every

And here's how I generated it from the Unix command line:


cat *.txt | tr '[A-Z]' '[a-z]' | tr ' i ' ' I '| sed 's/ /\n/g' | grep ..... | \ 
egrep -v '[\><"#1234567890/|_)(!&-+=:@-]'| sed 's/[{}?;,.]//' | sort | uniq -c | \
sort  -n | grep -v -f ~/stopwords.txt | sed 's/$/<\/font>/' | sed 's/[0-9] /">/'| \
sed 's/\([0-9]\)/<font size="\1/' | tail -25 | shuf -n 25 > ~/wordcloud.txt

In English, what those commands do are:

List all the text files
Transfer all the words into lower case
Recover the word i back to I
Substitute every space in each text file with a carriage return (now all words are on their own line)
Apply my own "stop word" filter-- namely, only show words with 5 characters or more (not, me, us, she,... boring words for a cloud)
Pull out lines containing non-alpha characters
Pull off , and . and other punctuation from words
Sort the resulting list
Count the unique lines
Sort the counted list
Pull out a list of stopwords from google
Add a closing </font> tag to the word
Replace the space between the count and the last digit of the number with a "> to close the font size tag
Replace the first digit(s) of the count with <font size="[number]
Take the 25 last (most frequent) lines
Randomize the list of 25
Put the <font> tags into a file called wordcloud.txt

Another day, another (geeky) project. ;)

Comments are closed for this story

All content and photography on this site is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.	Colophon:	Written using vim and VS Code	Icons by Font Awesome & NerdFonts
		Layout via Pure CSS	Design Inspired by Prorez, but coded from scratch.