Cyber Infrastructure for the Digital Humanities is a part of UITS Research Technologies, Visualization & Analytics at Indiana University Bloomington. The CyberDH Team have been developing an open instructional workflow for text analysis that aims to build understanding and basic coding skills before scaling up analyses. We have chosen to bootstrap in R because of its statistical and graphical capabilities and because of its wealth of domain-specific packages. We have also included some Python notebooks due to its ease of use and popularity for use with text mining and text analysis. Moreover, the open source and scripting nature of R and Python allow for methods that are repeatable, extensible, scalable, and sustainable. The aim is to provide code templates that can be adapted, remixed, and scaled to fit a wide range of text analysis tasks.

What is in this repo?

  1. R Notebooks: heavily annotated to explain each line of code.
  2. R Scripts: lightly annotated to allow the user to experiment.
  3. Data: need to replicate our results.*

Getting Started

The suggested workflow is to fork the repository for your own use. Read the R Notebook as it explains how a given script works, line by line. Then load the lightly annotated script that goes along with the notebook and try it out for yourself in RStudio. Suggestions on alterations and basic parameter tweaking are provided in the script.

*So that you can replicate our work we have provided all the data we have used in our examples. For plain text notebooks and scripts, we use the Shakespearean corpus from the Visualizing English Print Project where speaker names and stage directions are removed; for Twitter notebooks and scripts, we provide twitter data that has been harvested by the team.

Text Preperation

Word Clouds

Word clouds may seem simplistic, they offer a wealth of information that is easily parseable at a glance.

Word Co-occurence

The co-occurrence script aims to discover the semantic proximity of words throughout the Shakespeare Drama Corpus. At the end, it will take in a word of the user’s choice and find closest terms by proximity.

Word Correlation

The word correlation scripts aim to discover if there is a relationship between two words throughout the Shakespeare Drama Corpus or within a single play beyond random chance. At the end, it will take in a word of the user's choice and find the words that correlate above your chosen rate using pearson's correlation coefficient.

Sentiment Analysis

Sentiment determines whether a tweeter feels negatively or positively about a topic by comparing the words in a tweet to a lexicon of words that have positive valences or negative ones. By analyzing sentiment scores, we can determine how English-language Twitter users feel about a topic.

Additional Resources

Link to register for our Friday Workshops:

Scholars' Commons Workshops for Digital Tools and Visualization Methods for Humanists

CyberDH and Advanced Visualization Lab Presentations:

View All Current Presentations on Box

Presentations include: