The first step in this process, presented by Showk, is importing the cables. Luckily, the WikiLeaks cables follow a simple structure that makes this relatively easy. Showk based his work on the cablegate Python code by Mark Matienzo that scrapes data from the cables in HTML form and converts this to Python objects. For the HTML scraping, the code is using Beautiful Soup, a well-known Python HTML/XML parser that automatically converts the web pages to Unicode and can cope with errors in the HTML tree. Moreover, with a SoupStrainer object, you can tell the Beautiful Soup parser to target a specific part of the document and forget about all the boilerplate parts such as the header, footer, sidebars, and supporting information.
After the parsing, The Python natural language toolkit NLTK is used on the text body to bring more structure to the word scramble with the goal of extracting some topics. The first step is tokenization: NLTK allows easily breaking up a text into sentences and each sentence into its separate words. Then for each word the stem is determined, which means that all words are grouped by their root. For example, to analyze the topics of the WikiLeaks cables, it doesn't matter if the word in a text is "language" or "languages", so they are both grouped by their root "languag". An SHA-256 hash value of each stem is then used as a database index.
MongoDB, a document-oriented database, is used as document storage for all this data. MongoDB allows transparently inserting and reading records as Python dictionaries, as well as automatic serializing and deserializing of the objects. Then Showk queried the MongoDB database to extract the heaviest occurrences and co-occurrences of words, and converted that to a graph using the Neo4j graph database.
For the final step, visualizing and analyzing the data, Bilcke used Gephi, an open source desktop application for the visualization of complex networks. Gephi, to which Bilcke is an active contributor, is a research-oriented graph visualization tool that has been used in the past to visualize some interesting graphs, like open source communities andsocial networks on LinkedIn. It's based on Java and OpenGL, but it also has a headless library, the Gephi Toolkit.