Given the recent interest in Song of Fire and Ice recently due to the HBO premiere of Game of Thrones, I did up the word cloud below for GoT using Wordle
Wordle: Game of Thrones

This is pretty interesting since a SoFaI newbie can draw the following conclusions from looking at the word cloud:

  • Lords and Sers abound in the book
  • Jon, Catelyn, Tyrion, Ned, Dany and Arya all play a pretty major role in the book
  • Pycelle? Not so much
  • etc…

For those who are interested in doing this to their favorite pdfs, here’s the groovy code that I cobbled together to extract the text from pdf:

package extract

import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.PdfTextExtractor

class Book
{
	def path
	def start
	def end
	def outputFileName

    // start - page in the .pdf you want to start extracting from. No point extracting from preface and content pages
    // end - last page to stop extracting. Not interested in the family descriptions, etc
	Book(path, start, end, outputFileName)
	{
		this.path = path
		this.start = start
		this.end = end
		this.outputFileName = outputFileName
	}
}

books = []
books.add(new Book("C:\\book1.pdf", 3, 553, "book1.txt"))
books.add(new Book("C:\\book2.pdf", 3, 596, "book2.txt"))

for (eachBook in books)
{
	reader = new PdfReader(eachBook.path)
	wordList = [];

	for (i in eachBook.start..eachBook.end)
	{
		page = PdfTextExtractor.getTextFromPage(reader, i)
		lines = page.split()
		for (eachWord in lines)
		{
            // Because I only want to capture entities and not ALL the text,
            // the regex below is a naive method to capture only words that start with
            // an uppercase letter, e.g ,Ned\" and have at least 2 characters as there's
            // a good chance that it's an entity. This can be made more sophisticated with time.
			capitalisedWordRegex= /.*?([A-Z][a-zA-Z]+).*/
			matcher = (eachWord =~ capitalisedWordRegex)
			if (matcher.matches())
				wordList.add(matcher[0][1])
		}
	}

	outputFile = new File(eachBook.outputFileName)
	for (eachWord in wordList)
		outputFile.withWriterAppend{ file -> file << eachWord + "\n"}
	println "Finished $eachBook.path"
}

Once you have the output file, you can create your word cloud using by going to Wordle, or you can download Wordle and generate a picture which you can save and use. Here’s some of the clouds I generated.