Text Mining - The How To Screenshots

Pascal Lieblich

eDiscovery and Legal Outsourcing Executive

Published Jun 17, 2019

By: Pascal Lieblich

In "The Five Remarkable Differences Between Fiction And Non-Fiction", I analyzed thirty fiction and non-fictions books to find differences in form and substance between these types of literature.

This article provides some behind the scene snapshots and brief explanations of the text mining methodology and work performed. The text preprocessing steps shown are used in both text mining and machine learning analyses.

Preprocessing The Texts

Importing The Books

These following 30 books were first imported into Python's Natural Language Toolkit (NLTK) in .epub format. The .epub format is very popular for electronic publications meant to be read on tablets, e-readers and computers. Incidentally, it is also the format of my book collection.

In the examples below, we use a paragraph from "The Book Thief", a fiction novel by Markus Zusak, to illustrate the process.

There are several basic steps taken to prepare text for text mining or machine learning analysis. The imported paragraph initially looks as following:

Tokenization

The text is first split up in sentences, then in individual words and punctuation marks, a process called tokenization.

Lots of information was recorded during the pre-processing stage for our analysis. For instance, the number of words per sentence as well as the type and number of punctuations marks used were logged in order to be able to compute and compare textual style differences between fiction and non-fiction.

Stop-word Removal

Any English-language text contains many common words, called stop-words, which are grammatically necessary but can typically be removed for purposes of analysis. Examples of such words are the, or, it, to, from,and, by.

I selected a list containing 181 stop-words for removal from the corpus of texts. Below you can see how light our "Book Thief" paragraph looks like once the stop words have been removed.

Note how half the words from the sentence: "There might be a discovery; a scream will dribble down the air" consisted of stop-words which were removed. What is left from that sentence, [might, discovery, scream, dribble, air] is all we need for our analysis, and is generally sufficient for text mining and machine learning purposes.

Over the entire 30-book collection, approximately 25% of all words were stop-words. They were removed from the texts.

Stemming

In order to be able to properly analyze text, it is usually necessary to group words together that derive from the same root. We want to make sure that different forms of a word are recognized and counted as such in any analysis.

For example, if a document contains the words "body" ,"bodies" and "bodily", we want these to be recognized as an expression of a related concept, rather than as 3 separate words. As the list of stemmed words from our paragraph show, the word body was stemmed to its root "bodi", which functions as the category that captures all related forms of that word.

Here are a few more interesting stemming examples from our non-fiction books.

Quite a few distinct words are captured by the stemmed form "philosoph". How many can you think of?

Content Analysis

Word Frequency

Now that the text has been preprocessed, we can start going a bit deeper. Here, we did a quick count to see how often the stemmed words appear across our collection. Note how stemming is sophisticated enough to distinguish terms rooted in gene, such as gene and genes, from words rooted in genet, such as genetics, genetically and their derivations.

Parts of Speech and Tense Analysis

In order to determine what the grammatical differences are between fiction and non-fiction literature, each sentence was analyzed for its parts of speech (POS) components.

The first-cut POS categorization shown contains many categories which I combined later on to make the data easier to understand and compare.

The POS data was also used to compare how tenses -past, present and future- are used in fiction and non -fiction literature and what the differences in tense usage are.

Substance - Emotional Sensitivity Comparison

After testing out some existing Python sensitivity analysis tools, I built my own framework to gauge differences in emotional expression between fiction and non-fiction literature. Terms and categories of emotion were derived from a list used in Psychology.

Book by book, every sentence was analyzed to check whether it contained any expressions associated with the 17 broad categories of emotion, some of which you see in the third columns.

Each occurrence was logged and weighed in order to determine how frequent particular categories of emotion were expressed in fiction and non-fiction literature.

The analysis allowed me to evaluate the emotions expressed in each book and then to compare the emotional expressions in fiction and non-fiction books.

As shown in "The Five Remarkable Differences Between Fiction And Non-Fiction", there are interesting differences between the types of emotions expressed in fiction and non-fiction.

Conclusion

These were some of the behind the scene snapshots of the work involved in order to write "The Five Remarkable Differences Between Fiction And Non-Fiction". The article itself contains many more graphs showing substantive differences between the styles.

Much of the work was done in NLTK, the Natural Language Toolkit, which is a suite of programs for Natural Language Processing initially developed at the University of Pennsylvania (incidentally my Alma Mater). All the programming was done by me in Python.

I find it amazing that just a year after starting to learn data analysis, text mining and machine learning, I can write programs in Python capable of answering sophisticated questions.

Lawyers can learn to code too! Feel free to reach out or drop a line in the comments.

Pascal Lieblich

Gregory Bufithis

CyberFlâneur. Attorney, journalist, writer, media producer, and technology tart. We can only see what we think is possible. Me? A weapon of mass instruction because knowledge is only a rumor until it lives in the muscle.

Incredible stuff!! So well done. I need to carve out some time and compile your last few posts and do a blog article.

2 Reactions

See more comments

Text Mining - The How To Screenshots

Pascal Lieblich

eDiscovery and Legal Outsourcing Executive

Preprocessing The Texts

Importing The Books

Tokenization

Stop-word Removal

Stemming

Content Analysis

Word Frequency

Parts of Speech and Tense Analysis

Substance - Emotional Sensitivity Comparison

Conclusion

More articles by this author

Insights from the community

Explore topics

Preprocessing The Texts

Importing The Books

Tokenization

Stop-word Removal

Stemming

Content Analysis

Word Frequency

Parts of Speech and Tense Analysis

Substance - Emotional Sensitivity Comparison

Conclusion

How Neural Networks Learn

Dec 12, 2020