This lesson is going to cover one of the best tools for getting started in text analysis: Voyant. I am going to use the website version, voyant-tools.org, which at this point is running Voyant 1.0. But, the 2.0 is currently available in pre-release. The UI is different, and when finished it promises to be a good replacement to the current version. But for now, we should work with the stable version.
The Voyant website was down for a short period the other day, and that raised an important issue. How do you deal with the dangers of using web-tools, if you can’t be certain that it will always be up? In the case of open-source (or public license) software, one solution is to download and set-up your own copies. There are three very useful tools for DH we have already used, Voyant, Overview, and Raw. They can all be installed, though Overview and Raw are tougher than Voyant.
For now, though, we will use the version available on the website. So, head to voyant-tools.org and let’s get started.
- Loading Texts into Voyant
- The Basic User Interface
- Controlling Your Analysis: Stopwords
- Saving and Exporting Data & Graphics
- Keywords: Close and Distant Reading
- Main Tools: Frequencies and Comparisons
When you first head to the website, you will see the logo, and an interface for selecting/uploading the texts which you want to analyze. It will look something like the first screen below. You have several options. If you hit ‘Open’ you can choose from a couple of sample corpora (e.g. Shakespeare), mostly you will not use this option. Next to that is ‘Upload.’ I end up using this one often to analyze custom texts.
If you hit ‘Upload,’ a new menu will pop up, as in the second picture above. From there, hit the ‘Add’ button, and then select the first text file you want to upload. Voyant can work on a single document, but it is most powerful when you give it a larger a corpora of texts (such as the works of Shakespeare, or a larger text broken down into a individual files by some organization (such as chapters). Here is the one annoying thing about Voyant’s interface: you can only select one file to upload at a time. So you need to hit the add button, then choose the file, then click okay for every file you want to upload. OR (one of my students realized this), you can compress them in a .zip file and upload that. Voyant will automatically unzip and process the individual .txt files.
But for now, I used the last option for providing texts for analysis: by entering the URLs into the main field. You can see that I have copied this list into the field in last picture above.
For our investigation, I was inspired by a colleague who shared an article from <em>The Atlantic</em> showing a visualization of the State of the Union Addresses ( 1 | 2 ). The authors did a great job highlighting some interesting topics, and having very clear interactive displays. In this tutorial we will conduct some similar analyses in Voyant, and in later tutorials we will even try to add a few new things using other tools.
For expediency, I created a .zip containing the .txt files of every speech which you can download here.
For a complete list of Voyant tools, and full documentation and help, see Voyant Tools documentation.
Once you have loaded the example in another tab, proceed to the next section and we’ll get started with the basics in Voyant.
Once you have loaded the texts by clicking the link at the end of the last section, you should see a screen like the first one below. In order to familiarize you with the basic functions of Voyant, we are going to expand all of the tool panels, though you can choose which you want open when doing your own projects. Click the expand arrows in the locations indicated in the first picture below, and after that click two additional arrows as shown in the second picture.
Now, all of the tool panels should be expanded. I have superimposed numbers on a screenshot of the fully expanded panel below. I will quickly summarize the function of each section here.
1) Corpus Reader: this is the main window where you can actually see the body of your text displayed.Color-coded bars to the left of the main panel allow you to select between individual documents. If you hover over any word, a pop up will tell you its frequency count, and if you click on it, it will load that word for analysis in several other windows. There is a search bar and navigation buttons underneath text area 1.
2) Corpus Window: Here, you see a list of the documents that are included in this analysis. ‘Tokens’ denotes the total word count of each document, and ‘types’ is the total number of unique words. ‘Density’ is a ratio of unique words/total wars (meaning how much does document vary word usage). You can sort the document orders by any of these categories, and export this data (more on that below).
3) Cirrus Word Cloud: This is one of those famous Word Clouds that have become so ubiquitous. For now, the words appearing are not very interesting, but we’ll fix that in the next section. This can often be a nice visual starting point. If you hover over any word in the Cloud it will give you a frequency count, and if you click it, it will load that word in other panels. *Note, since there are over 200 documents in this corpus, words like ‘the’ appear way too many times (over 14k) for this tool to load all instances. So, you will likely get an error message if you click on any word at this point.
4)Summary View: This panel gives you an overview of the whole corpus. It points out particular documents, words, or other patterns that stand out. This is another good place to start your search into a text.
5) Words in the Entire Corpus: Here you can see the numerical statistics behind every word in the corpus, and you can sort by various categories. You can see the words, their total frequency, and their frequency over the course of the corpus. There are a lot of pages here (616), but we will probably only care about the most common. There are two heart shaped buttons at the bottom of this panel (partially cut off in our picture). One heart has a plus sign, this will add all words you have checked to ‘Favorites’ list. The plain looking heart simply switches you back and forth between your favorites list and the list of all words. The favorites view is a great way to pick out words that you want to come back to and track. We will come back to the arrow highlight in the picture for this section.
6) Word Trends: This isn’t showing much right now, nor are the next two. But, they will after we clean up our results some with a stopword list and then start exploring some words. But, this window is where you see the frequency graph. If you click on a word elsewhere, you will see its frequency here. If you check multiple words in the Word in the Entire Corpus list, or in the Favorites list, they will appear here.
7) Keywords in Content: We’ll return to this, but when you choose a word, you will see every appearance here, and you will also see the text that appears to the right and to the left of that word. This really helps you quickly scan for important or strange uses of a word.
8) Words in Documents: This tool tells you how often a word you have chosen appears in each document. It will also show you a miniature frequency chart. Thus, you can see the frequency breakdown of individual words, document by document.
So that is a quick rundown of the user interface. We’ll explore how to use these tools in a moment, but let’s first try to clean up out results by using stopwords.
Stopwords are just lists of words that you want excluded from your results. You will probably want to use one no matter what you are analyzing. Unless you are trying to apply some very specific kind of grammatical analysis, you probably don’t care how many times ‘the’ or ‘of’ appear. There might be other reasons why you want to add words to a stop list. Perhaps you aren’t interested in the appearance of character names, and so you want to add them to filter the characters from your results. To access the stopword list, click the gear icon in the toolbar menu of the “Words in the Entire Corpus” tool, as indicated by the arrow in the first picture below (which is a repeat of the last image).
You will then see the options menu appear, as in the second image above, if you click the drop down menu icon, you will see a list of pre-designed stopword lists. These will suit many of your needs in the available language. For us, we are going to choose English (Taporware), as seen in the last image above. Now, make sure to check Apply Stop Words Globally.
You can also customize the list by clicking the ‘Edit Stop Words’ button, which will bring up a menu much like the one below. You can then type in your own words, and even save the menu to your local computer, so that you can load it back in for other documents. If you search on google, you can often find stopword lists online. Once you click OK until you are back into the main screen, you should have a word cloud that looks a little more like this.
Now this is getting a little more useful. If you start clicking some words, you might get some more data. But we still have some problems, there are just too many documents for it to display as normal in the exploratory view. So, sometimes, to get better results, we can narrow which documents we want our searches to apply to. If you want you can just click CTRL + click a couple documents to restrict your results to those. Now if you start clicking words, you should see details appear in some of the word panes. Try playing around with it. Once you are done, head to the next section to learn how to export your data.
Now, you will likely want to come back to the same text again. To avoid having to re-upload the text and add stopwords again, we are going to generate a URL that allows you to return. To do this, click the icon that looks like a disk in the upper right corner of the entire Voyant window, as in the first picture below.
When you do, you will see a pop up, as in the middle picture above, for now, leave the first option selected and hit okay. You will then see something like the third screen. From there, you can copy the URL so that you can paste it into a document to save for later use. I often save them in plain.txt files, but using a .doc(x) or anything else is fine.
To try exporting some data, click the disk icon in the Words in the Entire Corpus panel. When you see a pop up like the first image above, select ‘current data as comma-separated values’ and then click OK. When you do you will see something like the middle image above. From here, you can simply copy the text out and paste it into a plain text file to save as a .csv. Note one problem though, it only exported the data for the words currently visible in the panel (only 1 of hundreds of pages). So, you can go through page by page, but that is a huge pain. You can choose export all types and counts as plain text, but then that is not in a CSV format! This is slightly problematic, and we will deal with other options in a later lesson. Another option, is to select the words in which you are most interested, adding to them your favorites list. Then, it is much easier to export the data from your favorites list in just a few clicks.
Now, to save a picture from a visual tool, click the disk icon on the relevant panel. For now, there are only two panesl: the Word cloud and the Word Trends. Click the cloud’s icon as seen in the picture below, and then choose the last option ‘a static PNG image.’ Then, click OK.
For this portion, we are going to switch to a different corpus for examples. This one, the Odyssey, does not have so many documents. That way, we can view the entire corpus as we try to use the tools. Here is the URL for Voyant, pre-loaded with an English translation of the Odyssey (all due warnings about using a translation!), with stopwords pre-configured. Expand all of the panels to get a view like the one below.
Okay, so, we are going to flag some keywords for analysis. For the sake of this tutorial, we will only flag words from the first few pages. The words we are going to flag, by page, are….
- House, Ulysses (note, this is Odysseus, the lamentable choice of the translator), men, man, son, home, suitors, ship, father, sea, gods, people, heaven, jove, minerva, wine, water, countr
- Penelope, stranger, Ithaca, mother, wife, land, daughter, town, return, bed, god, women
- Alcinous, sons, husband, antinous, goddess, neptune
If you click to see your favorites list from the previous section, select some related topics to see their trends. Through this, we will explore the other tools. If you select Ulysses and Suitors, you should see a picture like the first one below. Both terms appear in the frequency visualization tool in the upper right, and a breakdown of the appear of those terms in different documents appears at right-bottom. At first, the right-middle panel will be empty, but if you on any document appearing in the bottom-right, it will then show you that keyword in content for all appearances in that document.
Now, note that underneath the chart there is a button that says relative.That means that this chart is displaying relative frequencies of appearance. You can change the chart to display it in raw. With just two terms, there won’t be much difference, but this can make a big impact. In the middle screen, you can see what happens if you click ‘collapse terms.’ This is very useful if you have a number of terms for essentially the same thing. For example, in the Iliad, there are several terms for the Greeks (Argives, Achaeans, Danaans, etc), but here you can combine them. There is also the ‘segments’ button. In certain cases, this allows you to view variations on that word (e.g. Congress might yield ‘congress -men, -women, – building, -ional deadlock).
These tools alone are extremely powerful. You can search for terms you want, and quickly look around your text for the appearance of different terms. There is no magic bullet with this, or any other tool. Some of the trick is slowly honing in on a good visualization. You might need to add words to your stopword list. You might need to think carefully about what words to cross compare.
At this point, you’ve also got a good list of passages in which words appear. That allows you to begin the traditional process of humanistic interpretation: the close reading. I suggest using the middle-right panel to look at how these words are used in context, and to then view them in the main text panel. There are often details about how words are used that simply cannot ever be captured by pure visualization.
So, that’s it! Or rather, that’s just the end of all I can cover briefly. It’s only the start of what can be done with this suite. There is already a plugin so that you can integrate your own copy of Voyant with Omeka (a WordPress derivative designed to create online museum-like exhibits), embedding the displays within your Omeka pages.
There are more powerful text analysis packages out there (especially R, as well as several Python modules), but all are far less user friendly. Voyant is always a great place to start, whether you are heading to a powerhouse package, or just want to conduct a largely traditional analysis of a text. You quickly find what things you want to look for first, and where they are located.
So, find a text or texts of your own choosing. Either upload it in .txt form, or add in a list of URLs. Try to do so in a way that either has a corpora of many different texts in different documents… or have one document split up into many files in numerical order, like this. That way, you can produce charts that show flow over time (or differences between documents).
Choose your text or texts however you wish (for academic purposes), and feed them to Voyant. Work at them, try to find interesting trends, or examine certain topics within your texts.