Text Analysis

Alan Liu
April 8, 2012
A star Red Star indicates tools that combine power (advanced, multiple, or flexible features) with ease of use. The star tends to be reserved for tools capacious enough for multiple uses or add-on uses that they might also be called "platforms" (e.g., TAPoR, Many Eyes, or Google Earth).  (Other tools may be more powerful, or more easy to use, but not in combined balance.)


CLAWS Part-of-Speech Tagger (Part-of-Speech [POS] Analysis Tool; online Web service)


Results of ClAWS POS Tagger as applied to Wallace Stevens's "The Idea of Order at Key West"Tag set key for ClAWSCLAWS is a grammatical tagger that analyzes words in a text by part of speech. Based on the approximately 10 million words of the British National Corpus (BNC), CLAWS returns each work in a text passage listed by line number with a tag identifying its part of speech. (The Web site makes available keys to the tag sets.) Punctuation is also identified. The free CLAWS online Web service limits the user to submitting 300 words of text at a time for analysis; the full version requires an institutional or individual license.

Crawdad (Text and Content Analysis Tool; downloadable for Windows system only)



Text-analysis software that models the content and relations of material not just by equating word frequency with importance but through "natural language processing" and "a network model of text" reflecting "linguistic theory concerning how people create coherence in their communication." Interfaces for analyzed content include: tree-flow vizualizer, browser that highlights keywords, "comparator" that finds common and unique material in two texts, classifier that clusters texts according to similarity, and a sequencer that "exports keyword metrics for further secondary analysis." (This is a high-cost software tool; but there is a free 30-day trial.)

Google Ngram Viewer (Statistical graphing of word occurrences over time from the Google Books corpus)


Statistical occurrences of the workds Book, Newspaper, Film, and Television from 1800 to 2000

Tool for exploring the comparative occurrence frequencies over time of words and phrases in selected corpora from Google Books (e.g., American English, British English, British fiction, French, German, Spanish).  Users can also download the raw datasets.  For an article explaining the tool and its implications, see Jean-Baptiste Michel, Erez Lieberman Aiden, et al., "Quantitative Analysis of Culture Using Millions of Digitized Books" (2011).

iResearch Reporter (creates digests organized by topic from Web or local text searches)


iResearch Report digest of a Web search related to President Obama, showing topic classifications, first lines of articles, and links to full documents

Tool for searching the Web or local files that produces topically-organized digests (first lines of documents plus links to full document) within selectable methods, perspectives, and areas of search.  Free to try for one day; pricing based on period of use, ranging from 2 to 30 days.

Linguistic Inquiry and Word Count (Text Analysis Tool; online)


LIWC text analysis

"LIWC calculates the degree to which people use different categories of words across a wide array of texts, including emails, speeches, poems, or transcribed daily speech. With a click of a button, you can determine the degree any text uses positive or negative emotions, self-references, causal words, and 70 other language dimensions." This is an expensive software tool, but there is a simple online tool ("Try LIWC Online") where a user can submit a piece of text and see a count of self-references, social words, emotions, etc. in that text compared to counts of those words in personal texts and formal texts.

MONK Project (text analysis tools used on a defined corpus of literary works)


MONK analysis of adjective frequences in poetry vs. prose

"This instance of the MONK Project includes approximately 525 works of American literature from the 18th and 19th centuries, and 37 plays and 5 works of poetry by William Shakespeare....  MONK provides these texts along with tools to enable literary research through the discovery, exploration, and visualization of patterns. Users typically start a project with one of the toolsets that has been predefined by the MONK team. Each toolset is made up of individual tools (e.g. a search tool, a browsing tool, a rating tool, and a visualization), and these tools are applied to worksets of texts selected by the user from the MONK datastore. Worksets and results can be saved for later use or modification, and results can be exported in some standard formats (e.g., CSV files)."  Tutorials.

TAPoR - Text Analysis Portal (Collection of Online Text Analysis Tools with "Recipes" for Use)  Also try TAPoR 2 (beta next generation TAPoR)


TAPoRTAPoR is a collection of online text-analysis tools--ranging from the basic to sophisticated--that allows users to run search, statistical, collocation, extraction, aggregation, visualization, hypergraph, transformation, and other "tools" on texts. (The site comes seeded with prepared texts, but users can sign up for a free account and input their own.) TAPoR allows tools to be mixed and matched in a mashup-style "workbench." Particularly impressive is the "recipes" page, which in step-by-step fashion suggests ways that tools can be combined for particular purposes--e.g., identify themes, analyse colloquial word use, visualize text, explore changes in language use by a writer, create an online interactive bibliography, build a social network map from text, create a chronological timeline from bibliographical text, etc

WordCount (Language Exploration Tool; online)

WordCount tool for exploring the frequency of words in English


Enter an English word and see it visualized in a series of words ranked by their frequency of use or "commonness." "An artistic experiment in the way we use language, [WordCount] presents the 86,800 most frequently used English words, ranked in order of commonness. Each word is scaled to reflect its frequency relative to the words that precede and follow it, giving a visual barometer of relevance. The larger the word, the more we use it. The smaller the word, the more uncommon it is. WordCount data currently comes from the British National Corpus."

WordHoard (Text Analysis Tool; online)


WordHoard concordance and analyses of Shakespeare)

Example from WordHoard manual of the kind of information that an experienced user of the system can generate from a body of literary texts (in this case: the different frequencies with which men and women use the word "love" in verse and prose)Powerful text-analysis tool for a select group of "highly canonical literary texts"--currently, all of early Greek epic (in original and translation), all of Chaucer and Shakespeare, and Edmund Spenser's Faerie Queene and Shepheardes Calendar. WordHoard applies corpora-scale textual analysis (corpora linguistics) to specific authors/texts. It is based on "deep tagging" (fine-grained application of the underlying computational description or encoding) of texts, so as to allow users to constrain their searching, collocating, concordance-building, time-charting of word use, and other text analyses not just to word forms but to specific works, speakers, the gender of speakers, publication year, prose vs. verse, metrical shape, parts of speech, etc. "It is a basic assumption of WordHoard that new kinds of historical, literary, or broadly cultural analysis will be supported through the forms of data access that are made possible when literary texts are treated in the manner of linguistic corpora. Deeply tagged corpora of course support more finely grained inquiries at a verbal or stylistic level. But more importantly, access to the words of a text at such microscopic levels also lets you look in new ways at the imaginative worlds created by those words." Wordhoard is a Java-based application that starts on a user's local computer and interacts through the Internet with the literary texts and text-analysis programs on WordHoard's servers. [Note: clicking on the "Download and Run WordHoard" link on the homepage starts the Java application. No manual installation of a program on the user's computer (the usual implication of "download" in such a context) is required.]

WordNet (Concept and Language Exploration Tool; can be used online or downloaded)


WordNet conceptual analysis of the word "book" (partial) Accessed online or in downloadable form, WordNet allows users to tap intelligently into "a large lexical database of English" for the purpose of exploring concepts and their interrelations. "Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated.... WordNet's structure makes it a useful tool for computational linguistics and natural language processing." In essence, WordNet can be conceived of as an extremely high-powered, interactive thesaurus that facilitates the rapid pursuit of conceptual relations and affiliations--a kind of "rapid prototyping" of language-based concepts. While reading a poem, for instance, one might use WordNet to explore the author's choice of a particular word by seeing the word cocooned within a structured universe of alternative and related "synsets."


WordNet 3.0 Vocabulary Helper (also known as EVA) (Language Exploration and Hyper-Thesaurus Tool; online)

WordNet Vocabulary Helper results for search on the word "book" (partial)

Accessed online, WordNet Vocabulary helper is a powerful means of discovering and following the radiating network of signifying relations around a word. Entering "book," for example, produces a hyper-thesaurus/glossary of definitions and usages in the categories:


   * Overview of noun book
   * Overview of verb book
   * Hyponyms of noun book
   * Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun book
   * Part Holonyms of noun book
   * Member Meronyms of noun book
   * Part Meronyms of noun book
   * Meronyms of noun book
   * Holonyms of noun book
   * Derived Forms of noun book
   * Domain of noun book
   * Domain Terms of noun book
   * Coordinate Terms (sisters) of noun book
   * Meronyms of noun book
   * Holonyms of noun book
   * Troponyms (hyponyms) of verb book
   * Entailment of verb book
   * Synonyms/Hypernyms (Ordered by Estimated Frequency) of verb book
   * Derived Forms of verb book
   * Sample Sentences of verb book
   * Coordinate Terms (sisters) of verb book




