D

Data Is Plural Visualization Challenge: Project Dialogism Novel Corpus

This article originally appeared in Issue 3 of Nightingale magazine. Get your copy here!


The challenge 

We’re a community of data visualizers, so let’s do what we do best—visualize data! Explore the selected dataset, find an interesting angle or insight, and create a visualization using the tool of your choice. Infographics, data stories, data art… you have permission to get creative. 

Submission 

There are no prizes here—this is simply an opportunity to practice our craft—but we would love to see what you create! 

Send in one high-resolution image plus a 50-word description by AUGUST 4, 2023 (deadline extended!). 

The dataset 

In Nightingale Magazine Issue 1, we launched the Data Is Plural visualization challenge with Jeremy Singer-Vine, creator of Data Is Plural, a weekly newsletter that highlights useful and interesting datasets. This issue’s dataset was selected from the Data Is Plural archive and is described here by Nightingale editor Kyle Dent. 

Jane Austen fans will especially delight in this iteration of the Data Is Plural Visualization Challenge featuring the Project Dialogism Novel Corpus, but there is still something for everyone. In addition to the literary fiction genre, the dataset includes children’s literature, detective fiction, and science fiction. Project Dialogism Novel Corpus captures every piece of dialogue from 22 different novels from Project Gutenberg. The selected titles span a variety of genres and overlap with previous datasets for comparison. Collectively, they add up to more than 35,000 quotations. Each row in the dataset, representing a single quotation, includes the character speaking, who can “hear” what’s being said, other characters mentioned, plus several other attributes. Besides the main collection effort, the researchers also created a comprehensive set of annotation guidelines (which we thought fitting for this issue’s Guidelines theme!). The guidelines are invaluable for annotators and any future contributors to dialogistic corpora. 

Perhaps because of the carefully crafted guidelines, the dataset is quite clean. It’s well organized and normalized and comes with a helpful README.md file to explain the data layout, details of the annotations, as well as an explanation for each of the fields across the three data files produced for each book (quotations.csv, character_info.csv, and text.txt). For this challenge, you’ll probably focus primarily on the quotation data itself, but you may want to consider information about characters, too. If you fancy working with Python, the data repository also includes a handy Jupyter notebook demonstrating how to load each file to get you started, but you can easily work with this data using your tool of choice. 

While Jane Austen is the most represented author in the corpus, Margaret Schlegel from Howards End is the most loquacious character. Other notably chatty personages are Jake Barnes from The Sun Also Rises, Katharine Hilbery from Night and Day, and Anne Shirley from Anne of Green Gables

For this challenge, tap into your most literary and creative impulses to consider the narrative structure, events, and characters. You might also think about dialogism (the variation in the speaking styles of characters compared to one another and to the narrator), which was the main goal of the researchers who compiled the data. You can look at quotations within a single text or across multiple works. You might also look at social networks or consider the different ways characters are referred to as the stories progress. 

There’s nothing like digging into a good book, so enjoy! “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” —Henry Tilney to Eleanor Tilney and Catherine Morland in Northanger Abbey 

Illustration of a woman in a long, old-fashioned brown dress reading a book in the autumn woods. She is surrounded by literary quotes.
Illustration by Ghazal Qadri 
Nightingale Editors

Our Nightingale editorial team currently consists of Alejandra Arevalo, William Careri, Jason Forrest, Elijah Meeks, and Teo Popescu. Reach us at Nightingale(at)Datavisualizationsociety.org

CategoriesAll Stories