Blog Listing

Mass Reading with Overview

  • Signing up for Overview & Upload Data
  • Using the Tools
  • Diving into the Results
  • Exporting and Preparing Data

For this tutorial, we are going to analyze every State of the Union address since 1790, the texts of which can be found on ThisNation. I created a series of .txt files, one for each address, and compressed them as a .zip file. Download the file here, place it in its own folder, and then unzip it. Directions for unzipping for Windows are here, and for OSX, here.

Head to OverviewDocs.com, and sign up for an account. You may wish to use an alternate email if you wish to avoid any risk of spam. Once there, you should see your home screen, which should look something like figure 1 below. Click on ‘Upload files’ and you will be taken to an interface where you can upload text(s). If your .zip file is still in the same folder as your uncompressed .txt files, move it to a different folder. Now, click ‘Add all files in a folder’ (fig. 2) and navigate to the folder with your texts and choose it. You should see a flurry of action as it uploads each file in turn (fig. 3).

When it is done loading, it will ask you to set the document name and options to complete the process (fig. 4). In this case, I named it “States of the Union.” I left all other options as default. I would note that you can add fields at this stage, if you want to use this to record specific information about the documents. In hindsight, two values we end up tracking in this demonstration (president and political party) might have worked well as fields in this case (instead of tags). You will then have to wait for some time while Overview processes your texts for analysis (fig. 5). When complete, the texts will load in analytical mode.

Once the documents are loaded, you will likely see a Word Cloud (fig. 6), or perhaps some other tool. On the right side of the screen you will find every document in our corpus listed. On the lower left, you will see whichever tool is active (in this case a Word Cloud). Above that you will see tabs for any other open tools, as well as a tab to ‘Add view.’ Above that, you have a search bar. If you type anything into the search, it narrows the range of documents on the right to those matching the terms and updates the analytical tool to reflect the filtered corpus. If you click a term in the Word Cloud, you will see the range of documents containing that key word. To explore some of the other (and more useful) tools, click on the ‘Multisearch’ option. If you do not see the Multisearch or Word Cloud, you can add it by clicking ‘Add View’, and then choosing your desired tool. The Multisearch (fig. 7) is simply a variant of the general search above. Much more powerful than that is the ‘Tree’, or the keyword dendrogram (fig. 8)

The keyword tree is one of the most powerful aspects of Overview. At the top level, you see key words that are found in all, most, or some of the documents in the corpus. But, at each level below, Overview attempts to identify and cluster subgroups with increasing specificity (fig. 8). If you click a grouping, it will expand to show you the subgroups below. It will also restrict the range of documents shown on the right to only those that are within the group. At the second level you might see that there are some broad clusters of language. But, as you delve down you will increasingly find documents clustered together that have much more in common in their use of words. Combined with the ability to filter items and tag them, we can explore a series of texts very quickly. We will return to that shortly. For now, click ‘Add view’ and select ‘Xperimental: Word Co-occurrence’ (fig. 9).

When this tool loads, you will see a network of some of the most common key words (fig. 10). Overview automatically uses a stopword list to filter out unnecessary words (e.g. ‘the, and, but…). Words that are larger appear more often, the small words less so. Words are linked together if they commonly appear in the same documents. The bars connecting them are thicker (and they have a stronger gravitational pull) if they have more co-occurrences. Using this tool, you might find that words are connected in certain (and sometimes unexpected) ways. This tool visually points you to a number of strong interrelated correlations over a large set of texts. Interpreting the meaning of that correlation is another matter, of course. Sometimes manipulating the tool (you can drop and drop words) can help you identify what webs of other words are most strongly connected. If you click a keyword, Overview restricts the range of documents visible on the right. This can be one way of finding documents to tag (more on that below).

Next, add another view. This time, choose the ‘Entities’ tool. This tool (fig. 11) attempts to scan the texts for known ‘entities’ (cities, countries, companies, etc.) and provide statistical counts. This works best for English language documents. Playing with the options can produce some nice results, though it is far from perfect (figs. 12 & 13). Again, when you click on various entities, you can get a list of the documents containing them.

Finally, let’s look at one of Overview’s very nice exploratory features, the document view. Many of the tools on the left side of the screen have various means of restricting the range of documents visible on the right side. This is so that at any point you can easily explore all the documents pertaining to your exploration. If you click on any document on the right, you will see something like figure 14. This way, you can efficiently read documents that match your searches. You might wish to see if they are something in which you might be interested. Just as likely, you might want to look into why a number of documents all contain related key words. Quickly glancing at the documents can often tell you if their similarity is superficial or more serious. In addition to seeing the document itself, you can also view a plain text version of it (fig. 15), which is redundant in this case as we uploaded plain .txt’s.

With these tools alone, Overview is a great way of burrowing through a gigantic amount of text to look for interesting features. All of these tools become more powerful when combined with Overview’s tagging feature, which you can access by clicking the ‘Tags’ button in the upper right corner (fig. 16). You can tag a document for any reason. Often, when you conduct a search that refines your results to a narrower set of interesting documents, you might want to tag them all, so that you can cross reference these with later searches and other tags. You might want to tag documents because they have key features or topics. Or you might want to tag them with information you already know about them, as I did in figure 16, where I tagged the document by the president who delivered it, Washington (fig. 17).

I went through the remainder of the documents and tagged them by the president. If I had imported these texts via a CSV, I could have automatically imported these tags as well. Now that I have enriched these documents with metadata, I can start to get more interesting results out of our tools. My first step was back to the document tree. The highest levels of the tree host a diverse bunch of topics. But, if you traverse down the tree, you can hone in on specific issues in order to locate the relevant documents. Clicking on one of the subgroups on the bottom row (fig. 18),  the documents are typified by “tonight, jobs, Americans,” and to a lesser degree, a slightly different topic, “wealth, Iraq, Iraqi, terrorists, terror.” I found my results restricted to just 14 documents, and I can also see which presidents (primarily H.W. Bush) delivered them.

Narrowing the results to a further sub-branch of 6 documents, all are typified by “tonight, jobs, Americans, Afghanistan,” and most contain “Iraq, terrorists, terror, Iraqi” (fig. 19). The summary results on the right show that each speech contains both economic and military issues. Iraq and Afghanistan dominate H.W. Bush’s addresses, where as Reagan and Obama’s speeches are much more heavily economic in nature. Bush and Obama talk about both Iraq and Afghanistan. Reagan mentions only Afghanistan, and only in a passing statement concerning the Soviet-Afghan War. Narrowing further continued to refine my results (pruning away first Reagan, then Obama).

Next, I jumped over to another branch clearly pertaining to economic issues “program, economic, budget, help, today.” Going down another level, most of this group contained “economic, help, Americans, budget, today,” and some contained “tonight, jobs, programs, percent” (fig. 20). Interestingly, the documents in this group begin in 1902, with Teddy Roosevelt and continued fairly steadily until the first year of H.W. Bush’s presidency. While economic issues have been discussed by presidents both before and after, the particular patterns of language employed in this group is strangely specific to the 20th century. Going down yet another level, I find a group where all speeches contain “economic” and “farm”, and some contain “construction work, depression, purchasing power, and veterans bank” (fig. 21. Coolidge, Hoover, FDR, and Truman each contribute one or two speeches to this group.

Now let’s explore how the word collocation network graph is enhanced by the use of tags. The first word I clicked on was “savages” (fig. 22), and this revealed a number of speeches every president from Washington to Van Buren, and then sporadically through Grover Cleveland. By dragging the term “savages” around, I can quickly see that it is most strongly connected with “enemy,” “militia,” and “Spain.” One useful way to get at an issue is to narrow your results to only documents containing two or more terms that you see connected on the collocation graph. If I search for “Spain AND Savages” I get 10 results, including Monroe, John Quincy Adams, Jackson, Polk, Pierce, Arthur, and Cleveland. You can also use “AND NOT” to filter out some terms (e.g. “tonight”).

When I find a particular topic via this (or any other method), I might want to use the results I have gathered in order to generate new tags. Although I did not do so, I could have clicked the ‘add tags’ button at the top right, and created a new topic titled “Anti-Spanish Sentiment,” which would then have been applied to those 10 documents. Repeating this would allow me to build up a list of “topics”, my interpretations of the results, which potentially unlocks new questions.

Changing back to an economic issue, if I click on “depression,” I can now see the range of presidents who have used the word, from Benjamin Harrison through Obama. (fig. 23). If I wanted to narrow my results further, I can cross-search depression with another tag, in this case “Bush (George).” The result nets me one speech, from 1989 (fig. 24).

Returning to my entity locator now produced much more interesting results. Clicking around gave me a taste of which presidents were discussed what locales. The visual above (fig. 25) represents a later stage, when I returned after having added an additional tag representing the political party of the president who delivered the speech. Now, I could also see if different parties discussed different places.

Even the much maligned word cloud is much more interesting thanks to Overview’s tag system (fig. 28). Highlighting “war” now tells me which presidents addressed the issue (or any other information I might have chosen to tag).

Although I have not highlighted it because it is an advanced topic, Overview has a wonderful Regex (Regular Expressions) search feature. Regular Expressions allow you to run extremely complex pattern searches. For example, you could search for a term, but only if it is found inside of parenthesis, no matter how far away those parentheses are. In figure 29, I have (for no reason) used a formula which limits results to documents that have somewhere within them a block of text containing numbers from four to six. I also have not discussed the ‘Custom… New Visualization’ option, as this is also advanced. But, you can embed tools from other web apps (such as your own) into your Overview project (fig. 30)

Overview has a mixed exporting feature (fig. 31). It does offer you a nice range of options. In addition using CSV format, Overview embeds tags in your Excel file that can overcome many issues caused by Excel’s default interpretation method. You can select if you want to export the entire corpus, or just your currently found documents. The ability to export subsets of your collection based upon searches makes it easy to create different data sets for visualization purposes. The downside is, Overview’s tagging export formatting does not work well with many visualization packages. For most purposes, you will want to choose the ‘Spreadsheet (one row per document)’ option. There may be times when you want to use the “each tag in its own column” option, but I’ve found that their formatting can cause issues.

As I mentioned above, I later went back and marked every speech with the affiliation of the president who delivered it. So, I decided to put every tag in one column (fig. 32). Some visualization options (such as Palladio) can handle atomized lists (multiple items in one column). But, many do not. As as I stated above, Overview’s attempt to export each tag to its own column has issues. In order to do it successfully, I needed to split the column into multiple columns manually. But first, I had to clean up the “id” field.

Overview uses each file’s name and path to generate the “id” value of each document. For us, this is unnecessary, and turns useful ID’s (e.g. “1940”) into clutter (“Text Files/1940.txt”). This is especially a barrier if I should want to join this table to another table which contains further data about these documents. Figure 33 shows the first steps I used to clean the file. First, I clicked on a particular cell in the “id” column and copied the “Text Files/” portion of the cell. Then, I clicked column header key “A” (right above “id”), which auto-selects the entire row. This restricted my next action to only that column. Then, I ran find/replace, clicked over to the ‘replace’ tab, and pasted “Text Files/” into the “Find what” field. Most importantly, I left the “Replace with” field completely blank. When I hit “Replace All” (fig. 34), Excel auto-deleted “Text Files/” from every line by replacing it with nothing. Then, I proceeded to repeat the routine in order to remove the “.txt” suffix at the end of every “id.”

In the far right column of figure 34, I deleted the “text” column. This was merely because I did not have use for it in this particular case. But, you may certainly want to keep the text. Each tag in the “tags” column is separated by a comma (known as a delimiter). I used this delimiter to break this column up. First, I clicked the “B” column key (above “tags”) in order to select the entire column. Then, I clicked over to the “Data” menu and selected the “Text to Column” option (fig. 35).

Subsequently, the “Convert Text to Columns Wizard” pops up, taking you through the next few steps (fig. 36). As my tags are separated by a delimiter, I wanted to use “delimited” as my first option, and then “comma” in the second screen. Finally, in this case, leaving “General” as the data type for every field, and I simply clicked “finish.” Once completed, the party affiliation of the president was directly underneath “tags,” while their name was in the next column over. At that time, I manually changed the column headers to more appropriately describe the data(fig. 37 cols. 1 & 2).

I also decided to add another column, “Year” which I would create from the values in the ID column. Nearly every speech’s value is simply the numeric value of the year. But Washington’s 1790 speech is in two parts. While fine for an ID, ‘1790b’ is not a year any graphing application would understand. As most speeches’ year matches its ID, I began by creating writing the formula (“=A2”) to make the first row of the “Year” column that equaled the value of the corresponding “ID” column. (fig. 37, col. 3). By double-clicking the small green box at the bottom-right of the cell in which I wrote the formula, I auto-extended the formula down the rest of the column.

Now, every cell in “Year” contains a formula which displays the value located in the “ID” column” (fig. 38, col. 1). However, I could make manual changes to any of the “Year” cells, since they are merely mirrors of the “ID” cell (the resulting values are not truly present in the cells). What I want to do is to copy out the resulting values (rather than the functions which generated them) and pasted them right back over the formulae. This replaces the functions with their own hard results. In order to do this I right-clicked the “D” column key (above “Year”) and choose copy. I then right-clicked the “D” column again, and instead of choosing paste, I selected one of the options under “Paste Options” labeled “values.” On the PC, the icon looks like a little clipboard with “123” at bottom (fig. 38, cols. 2 & 3). On OSX, I had to choose “Paste Special” and then selected “Unformatted Text.” Now that the “Year” column is filled with editable values, I could change “1790b” into “1790” so that if I tried to put any data into a timeline, my results will be ordered correctly.

The tags that I ended up encoding were not the most exciting for this stage of the process (to demonstrate data cleaning), because I did not add many texts in Overview. Had I done so, this would prove more useful. Should you do so, learning how to transform your data to visualize it effectively will be important.

That’s it! Using Overview is a non-linear process of exploration. I used only a few methods by which I could explore a corpus, and shallowly at that. As you come to be familiar with Overview, you will develop more strategies for putting your finger on the pulse of important semantic relationships. As with any digital method, you will also get more out of Overview as you come to know more about the corpora you are analyzing.

Good luck!

 



Leave a Reply