The Stylometry of Baen Books
Science Fiction Authors
Have you ever thought about the stylistic choices of authors? What makes that author’s writing unique and easily recognizable to those familiar with their works? Or how an author’s work compares to another? Or how about their gender may factor into the equation? It’s possible to look at all of these things by examining text through a process called stylometry. Stylometry is the study of variations in literary style, a method which can be approached computationally to analyze a large number of texts. It is basically the study of certain written tics and cues in an author’s writing style. It is the measure of how many times a certain word is used in a text. You would not think it at first, but those subtle cues can be used to distinguish author style, legitimacy, and veracity. I first came across this method in Ben Blatt’s 2017 book Nabokov’s Favorite Word is Mauve, which prompted me to explore the stylometrics of authors who write for a particular genre. However, I took it a step forward to examine how similar or different authors who write specific genre fiction published by the same publishing house compare to one another, all while taking gender into account.
The purpose of this essay is to conduct a computational stylometric analysis of science fiction authors from the Baen Books Publishing House Free Online Library. This essay will provide a brief overview of stylometry, citing a famous example for people unfamiliar with the topic so that they may gain an idea of the type of literary analysis being conducted. A discussion of literature related to the overarching topic of science fiction as well as previous findings from previous stylometric analyses will be provided before I share my own findings in the form of code snippets and interactive graphs. I encourage anyone who wishes to try this type of analysis for themselves to visit my GitHub repository or Colaboratory Notebook and either use the dataset I have compiled, or one of your own.
What Is Stylometry?
According to The Programming Historian, “Stylometry is the quantitative study of literary style through computational distant reading methods. It is based on the observation that authors tend to write in relatively consistent, recognizable and unique ways.”1 The practice of stylometry grew out of the techniques used for analyzing texts for evidence of authenticity and author identity. In the past, stylometry has been used to attribute authorship of disputed documents. One of the most famous examples comes from The Federalist Papers controversy. In the 1780s, James Madison, Alexander Hamilton, and John Jay each wrote essays that appeared in New York newspapers under the pseudonym “Publius.” No one took credit for any individual essay until decades later, when Madison and Hamilton provided contradictory claims about twelve of the eighty-five essays published. Many years later in 1963, statisticians David Wallace and Frederick Mosteller provided evidence in Inference in an Authorship Problem that the stylometrics of Madison and Hamilton differed. Their findings were crucial to solving a mystery two hundred years in the making, as well as advancing the use and legitimacy of stylometric analysis.
In his book Nabokov’s Favorite Word is Mauve, Ben Blatt refers to his examples of stylometry as “literary fingerprints.” Blatt explores Wallace and Mosteller’s assumption that word choice constant, and that literary style does not change from book to book.2 Blatt tests Wallace and Mosteller’s fingerprint idea on popular fiction by high profile authors, using their method to identify books purposefully removed from the sample as well as those written by known pseudonyms, such as Stephen King’s “Richard Bachman” and J.K. Rowling’s “Robert Galbraith.”3 4 Just as people were able to identify the intricacies in these author’s writing, so could a computer program.
After looking more into Blatt’s method, I was was curious to see if a similar analysis could be conducted using authors who write for a particular genre, specifically science fiction. As a genre, science fiction explores material that is as broad as it is controversial. Everything from utopias to distopias, space travel, time travel, and more can fit under this one genre. I am most familiar with feminist science fiction, which is science fiction written by female authors, but I am also aware it is a predominantly male genre. Furthermore, I have always been fascinated by science fiction, and some of the most thought provoking works of the twentieth century have come out of the genre. Ursula K. Le Guin’s The Left Hand of Darkness (1969) takes a look into anthropologically different species and gender norms, while The Dispossessed (1974) features a societal struggle between capitalism, socialism, and anarchism. Orson Scott Card’s Ender’s Game (1985) features an intergalactic conflict between humans and an alien race. George Orwell’s 1984 (1949) explores the concept of surveillance, and the idea that someone is always watching your every move. There are many, many more I could name off, but I believe I have made my point that science fiction is a genre filled with wondrous ideas and unimaginable horrors.
When I started looking more into this, I was frustrated by the lack of material on my topic. There was not many scholarly articles on the specific topic I wanted, and even less related to computational literary analysis. This prompted me to take a step outside traditional research to see if I anyone had written a thesis, dissertation, or blog post about this. After all, Blatt’s book where he talks about his fingerprint method was only published in 2017, so this is a topic that may not have been explored much in the past. Once I refocused my search, I was happy to find I was not the only one who thought this genre held potential for more in depth stylometric analysis.
Two of the sources I found were doctorate level theses, both of which used a combination of computational and statistical analysis. The first was written by Zofia Wąchocka and was titled “The Left Hand of Genre: A Stylometric Examination of (the Absence of) Genre Signal in Science-fiction and Fantasy Works of Ursula K. Le Guin, C. J. Cherryh and Orson Scott Card in English and Polish.” Wąchocka’s analysis examined the stylometric differences of Ursula K. Le Guin, C.J. Cherryh, and Orson Scott Card’s science fiction and fantasy works in English and Polish, where it was found that “generic, cycle and chronological signals become blurred in translation.”7 This study was a step in the right direction, because it featured a comparison of female and male science fiction authors, yet did not really emphasize the importance of gender. Three years later, Naomi K. Fraser conducted a stylometric analysis titled “Style in Science Fiction and Fantasy” (2017). According to Fraser, science fiction authors were “traditionally analysed for the potency of their themes and tropes rather than for their language and style.”8 However, Fraser found that there was more variance between texts, and that author style could fluctuate depending on the frequency of words used.9 Fraser’s study was much more focused on the variances found in texts, yet did not emphasize author characteristics, only style. In 2021, a slightly different take on stylometric analysis appeared in a peer-reviewed journal using nine specific stories from early and modern science fiction prose. Michal Místecký and Tomi S. Melka’s “Literary “higher dimensions” quantified: a stylometric study of nine stories” found that there are nuanced differences between sub-genres of science fiction as well as an author’s personal style.10 Each of these studies provide excellent examples of stylometrics, yet do not completely go in-depth concerning the intricacies of author style and gender.
Each of these studies are vastly different from one another. In most cases they are meant to study the philosophy or linguistics behind the texts, rather than conducting a computational literary analysis as I intend. I would have loved to have used some of the more popular titles in science fiction, but I had to come to terms with the fact I do not have access to the same datasets. Originally I intended to conduct an analysis using Bethanie Maples, Srini Kadamati, and Eric Berlow‘s “100 Years of Science Fiction,” however the dataset used to create this project is massive and not readily available online. I then did some random searching through Google before deciding to simply make my own dataset using the Baen Books Free Online Library.
Baen Books is an American publishing house for science fiction that specializes in space opera, hard science fiction, and military science fiction. They also publish fantasy books, but the number of fantasy books found on their site pales in comparison to the sheer number of science fiction books. Baen Books offers free ebook downloads of specific books from their catalogue. Most of those books are compilations written by the authors who most commonly publish their works with Baen Books, but there is a substantial amount of books written by standalone authors that they offer for free.
I spent the better part of a day looking through Baen Books Free Library before downloading every book they offered. As of April 2022, Baen Books has seventy-five free ebooks available. Twenty-nine of those books are compilations of short stories, free fiction, and free nonfiction. The other forty-six books are comprehensive stories written by standalone authors or were written in collaboration with only one other author. After looking through the dataset, I determined the compilations would have to be excluded since they included too many authors to properly denote. Books with more than two authors could not be quantified in like those with only one or two since any one author’s contribution was minimal at best. Those twenty-nine books were therefore excluded from future tests, leaving me with forty-six books to analyze.
It is important to note that I have not read any of the books found in this dataset. I have heard of many of the authors whose books I downloaded to conduct this analysis, yet have never read any of their works. The only exception to this would by Timothy Zahn, though I have not read any of his books found in this dataset.
To run these tests, I was inspired by my professor, Dr. Zach Whalen, who created his own version of Ben Blatt’s fingerprint method using Google’s Colaboratory Notebook. This version of the fingerprint method uses Python 3 to import and analyze the text files of each of the books. This version of the fingerprint method imports Plotly into Python so that the data for each book can be represented visually on an interactive graph. Each of the Baen Books titles I worked with was downloaded as an EPUB file, which was then converted to TXT using CloudConvert. While it may be possible to use a regular ebook file, its best to convert any text to a TXT file to avoid OCR errors and reduce any confusion that may occur if Python cannot recognize a character. TXT just makes the process of working with these files easier. Since I had forty-six different TXT files, I created a compressed ZIP folder easily upload all these files to the runtime in my notebook. This ZIP folder was organized by author. A single folder was created for each of the authors found in my dataset, which then contained all the files for the books they wrote. Two of these files that repeat across three different authors, because they wrote a book with one other author in the dataset. As a result, I included those files twice under the appropriate authors. The reason for doing this was so that I could compare that work individually against that author’s other works, rather than having it as an outlier for both.
However, before conducting the official stylometric analysis in a Python environment, I uploaded this dataset to Voyant Tools. Voyant Tools is an open-source tool that can be used for textual analysis. There are many different tools in Voyant Tools, but for this analysis I was most interested in Summary and Trends. Summary provides a textual overview of a corpus, examining document length, vocabulary density, the average number of words per sentence, readability index, and distinctive words. On the other hand, Trends displays the frequency of most common words in a graphical setting. Both tools are not perfect, and do not allow one to distinguish texts by author names unless the TXT files are named so. However, these tools do provide interesting insight into what words appear most often throughout all of the texts in this dataset, allowing me to use those words for stylometric analysis in Python.
Analyzing The Stylometry of Baen Books Authors
The questions I sought to answer in conducting this analysis is the following: How do the stylometrics for authors who publish their works with Baen Books compare and contrast with one another? More specifically, how do these stylometrics compare when taking gender, either of the authors or characters, into account?
Of the forty-six total books collected in this dataset, there is a total of eighteen different author categories. Fifteen of these authors are male, two of these authors are female, and one category consists of a male and female writing team that consistently writes together. This dataset is very much male dominated, with a much higher percentage of Baen Books being written by standalone male authors. The two female authors, Ellen Guon and Rosemary Edgehill, have a much smaller presence in the dataset. The one team that consists of a male and female author team, Sharon Lee and Steve Miller, continuously co-write their books together. Since all of the books written by them contained shared credit, Lee and Miller have their own category as a team instead of a singular author.
There are two texts that will appear double in this dataset. They are Boundary by Eric Flint and Ryk Spoor and In the Heart of Darkness by Eric Flint and David Drake. These two texts will appear twice in any given analysis associated with this dataset because I have the file listed under each author. The reason for this was that Eric Flint, Ryk Spoor, and David Drake all have a rather large presence in the Baen Books free library. I wanted to be able to compare those books individually with the rest of books written by those individual authors. If any other author co-wrote a book with another author that did not have a presence in the Baen Books free library, I have excluded their name. However, I do have a spreadsheet listing each book with its correspond authors. If you would like to view that, you can download my Baen-Books-List.xlsx here or in my GitHub repository for this project.
I have provided embed links to the Summary and Trends tools from Voyant Tools. If one wishes to view these tools in a less compressed space, simply hover over the top right corner and select the “Export URL” option, which will open the tool in a new tab. Feel free to click around and play with the tools, as Voyant Tools is meant to be interactive. Similarly, if you would like to explore this corpus with Voyant Tools, click here. Otherwise, I will be exploring my findings using Summary, Trends, and later Plotly graphs generated using Python.
Summary is a pretty self-explanatory tool. It provides a basic rundown the corpus. However, the thing I was most interested in is the Most Frequent Words and Distinctive Words. The most frequent words found in this corpus will help me adjust which words to use in a stylometric analysis. These are words that appear across all of the texts and can be used to determine how similar or different each text is. For this corpus, “said,” “like,” “just,” “time,” and “know” are the most frequent words found in all of the Baen Books titles. Distinctive words provides me with the opposite. The distinctive words are words that are uncommon, typically appearing in a single text. I know to avoid these words, because they will not provide good data if they are not present throughout all of the books. For this particular corpus, the distinctive words appear to be names. This makes sense, and I will not be using these words to conduct my analysis.
The above graph represents Trends. Trends merely takes the most frequent words and arranges them graphically. This provides a useful visual on how frequent these most frequent words appear in each documents. For instance, Starliner has the most uses of the word said, where Prime Palaver has the least uses. This is useful data, because I can use some of these words to examine the stylometric differences between the books. I can plot the differences in the frequency of said versus like, like versus just, and so on to see how similar or different the writing style in these books are from one another. However, this analysis aims to examine the stylometrics of gender in science fiction, meaning the word said will be valuable when examining this dataset. Said is often used in conjunction with gender identifiers and pronouns, such as he, she, and they. However, these identifiers can also be used in connection with names. While individuals names cannot be used in this analysis, pronouns will appear throughout these books.
The following table contains the information generated using Python when this corpus was uploaded to the Colab Notebook I performed this analysis in. It was generated using a Date Frame in Python, mainly to ensure each of the files were loading correctly and tagged with the appropriate authors. I have provided the data in table format, but if you wish to view it and the full code in its original Colab environment, click here.
I used the data generated by the Data Frame to catalogue common words found across each of the texts. Using the data generated from Voyant Tools, I can then compare two words against each other for stylometric analysis. Since said is the most common word that appears across every text in this corpus, I decided to use that for this analysis. I generated three separate graphs that plot speech, using the word said, against the pronouns he, she, and they.
The following graphs were generated using Plotly, which is an interactive plotting and graphing visualizations. The data points featured on these graphs corresponds to the titles of the books found in the table above. Each of the data points is categorized by author name. You can hover over each datapoint on the graph o view more information, including the text name, author, printed values for the words being compared. If you would like to limit the number of datapoints being displayed on the graph, simply click on the author names in the guide to turn them on or off. You can also double-click on any of the author names in the guide to isolate those data points on the graph.
Figure 1. This graph shows a comparison of the words “he” and “said” across all the texts in the corpus.
Figure 1 shows the comparison of the words he and said. By looking at the graph we can see there is a pretty consistent usage of the word he with the word said. David Drake uses the word he most often in his books, with the majority of his works being clustered around the top and right of the graph. There is one outlier for him, which is the book he co-wrote with Eric Flint. Flint’s works are vastly different from Drakes, leaning more towards the bottom and left of the graph. This indicates Flint’s work does not have as many speaking roles for a male character, while male characters obviously dominate speaking roles in Drake’s books. Tom Kratman and David Weber’s works closely resemble Flint’s use of male pronouns, yet the two did not co-write a book together with Flint in this dataset.
I also isolated the two female authors, Ellen Guon and Rosemary Edgehill, as well as the writing team of Sharon Lee and Steve Miller. Their works differ greatly from Drake and Flint, yet have a wide range across the graph. Since there are very few works present in the dataset for these authors, I can only infer that their writing styles tend to vary. Lee and Miller’s use of the word he varies, which is to be expected since they commonly write together. However, Guon and Edghill use the pronoun he much less than the male authors found in this dataset, indicating their works do no feature male characters as prominently.
Figure 2. This graph shows a comparison of the words “she” and “said” across all the texts in the corpus.
Figure 2 shows the comparison of the words she and said. Looking at the graph, Rosemary Edgehill has the most prominent usage of the pronoun she. Ellen Guon, Sharon Lee, and Steve Miller’s works come in a close second for the most uses of the word she, yet Michael Williamson’s works are a close follow up. The rest of the novels feature the word she a moderate amount or none at all. Tom Kratman’s Training for War noticeable has zero uses of the word she. Eric Flint, Tom Kratman, Andre Norton, John Ringo, and John Joseph Adams all have very low uses of the word she. In comparison, they use the word said quite a bit, which is reflected in their use of the word he in Figure 1.
In many ways Figure 2 looks as if it could be the inverse of Figure 1. There are interesting similarities in the two though. For instance, John Ringo’s works appear to share similar plot points in both Figure 1 and Figure 2. There is obviously more emphasis on the word he than she, but this still hints a sort of inclusiveness not found in other author’s works. In contrast, Drake and Flint’s works were more widespread in Figure 1, their works are much more clustered together here, hinting that their works are predominantly male in style.
Figure 3. This graph shows a comparison of the words “they” and “said” across all the texts in the corpus.
Figure 3 shows the comparison of the words they and said. I wanted to conduct an analysis for the word they because it can represent both a gender neutral pronoun while also identifying a group. This analysis does not take authors into account as much, because to my knowledge none of them identify as gender neutral. However, this graph does provide insight into whether or not groups exist in the context of these books. Andre Norton uses they the most in his works, yet does not use said in a similar manner. This hints that they is likely used in the context of action and thought rather than speech. Tom Kratman and David Weber also appear to use they more in their works, but not as much of a difference is observed in the two previous examples. Interestingly enough, John Ringo’s share the same similar plot points in this graph, hinting his works are some of the most consistent and well rounded in this dataset.
Conclusions and Discussion of Further Research
After examining the graphs, it is plain to see that feminine pronouns are lacking from this male centric dataset. These results are not altogether surprising, however it does indicate there is a lack of representation of the sexes in Baen Books science fiction. The graphs testing he and she versus said are meant to distinguish gender found within the corpus by excluding specific names that this analysis would not have been able to pick up on. Using the word they prompted less than favorable results, as it appears there are less instances where they is used in the corpus than he and she. This indicates male science fiction authors from Baen Books prefer to focus on male characters. Author writing styles can differ greatly from one another, however when examining gender, many of these authors have similar stylometrics.
Even though there is some movement, David Drake and Eric Flint, the two authors who have the most works in this dataset, appear to remain in the same general area on all three graphs. Their writing styles do not differ with respect to gender, indicating a degree of uniformity when writing about gender. A similar phenomenon occurs with John Ringo, though it is not as noticeable since Baen Books does not have as many free books by him. In contrast Andre Norton’s works are sporadic throughout the three graphs and do not really indicate any particular style concerning gender.
Since I have not read any of these books, it is hard to say why these authors have a particular predilection towards one gender or another. Though, it is plain to see that male science fiction authors tend to gravitate towards a masculine style with male characters. One of the limitations I was acutely aware of while conducting this analysis was the lack of female science fiction authors. Originally I had a list of well-known female science fictions authors, including Ursula K. Le Guin, Margaret Atwood, Joanna Russ, and more but their works are under copyright and I had no way to obtain proper digital copies in time to complete this project. This is why I went with Baen Books, since they are free. Should a similar study be conducted in the future, I would recommend more female authors be examined. I would also recommend looking outside the Baen Books corpus for a much broader dataset. As it stands, I consider this project a case study of a much wider and far more interesting topic. But for now, I am happy with how this project turned out.
If you are interested, you can learn more about stylometric analysis using Python from The Programming Historian, who have an excellent introductory lesson that goes more in depth with The Federalist Papers as an example. However, this version of the stylometric analysis was inspired by Ben Blatt’s Fingerprint Method from Nabokov’s Favorite Word Is Mauve. Also, many thanks to Dr. Zach Whalen from the University of Mary Washington for creating the working example of this method of which this project is based on. All documentation related to this project can be found in my Stylometry GitHub repository, including links to the Colab Notebook used to generate all of the examples found in this post.
- François Dominic Laramée, “Introduction to Stylometry with Python,” ed. Adam Crymble, The Programming Historian 147, no. 7 (April 21, 2018), https://doi.org/10.46430/phen0078.
- Ben Blatt, Nabokov’s Favorite Word Is Mauve: What the Numbers Reveal about the Classics, Bestsellers, and Our Own Writing (New York: Simon and Schuster, 2017), 67.
- Blatt, Nabokov’s Favorite Word Is Mauve, 67.
- Patrick Juola, “How a Computer Program Helped Show J.K. Rowling Write a Cuckoo’s Calling,” Scientific American, August 20, 2013, https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling/.
- Signet Books, The Running Man Book Cover, 2022, Wikipedia Commons, 2022, https://en.wikipedia.org/wiki/The_Running_Man_%28King_novel%29#/media/File:Runningmanbachman.jpg.
- J.K. Rowling, The Cuckoo’s Calling Book Cover, 2018, Wikipedia Commons, 2018, https://en.wikipedia.org/wiki/The_Cuckoo%27s_Calling#/media/File:TheCuckoo’sCalling(first_UK_edition)cover.jpg.
- Zofia Wąchocka, “The Left Hand of Genre: A Stylometric Examination of (the Absence Of) Genre Signal in Science-Fiction and Fantasy Works of Ursula K. Le Guin, C. J. Cherryh and Orson Scott Card,” Jagiellonian University (PhD Thesis, 2014), https://ruj.uj.edu.pl/xmlui/handle/item/200802.
- Naomi K. Fraser, “Style in Science Fiction and Fantasy,” (PhD Thesis, 2017), https://nova.newcastle.edu.au/vital/access/services/Download/uon:32285/ATTACHMENT01, 9.
- Fraser, “Style in Science Fiction and Fantasy,” 87.
- Michal Místecký and Tomi S. Melka, “Literary ‘Higher Dimensions’ Quantified: A Stylometric Study of Nine Stories,” Glottotheory 12, no. 2 (September 27, 2021): 129–57, https://doi.org/10.1515/glot-2021-2021.
- Baen Books, Baen Books Logo, 1983, Baen Books Publishing House, 1983, https://www.baen.com/.
- Stéfan Sinclair and Geoffrey Rockwell, “Summary”, Voyant Tools, accessed April 18, 2022, https://voyant-tools.org/?view=Summary&corpus=99c84493753ec8ee6b3d4fa0496a11c3.
- Stéfan Sinclair and Geoffrey Rockwell, “Trends”, Voyant Tools, accessed April 18, 2022, https://voyant-tools.org/?view=Trends&query=said&query=like&query=just&query=time&query=know&corpus=99c84493753ec8ee6b3d4fa0496a11c3.