Reporting in from my second iteration

jaj160
Oct 24, 2021
5 min read

Fox Rug Claraloo Variation 2, submitted for copyright registration by Daniela

Jean Robbins in 2017. Rejected for the second and final time in March 2019.

To kick off our second iterations, students have been asked to produce a deliverable using a digital tool and to use that deliverable to “put forward a thesis”. In the weeks that have elapsed since my last updates, I have successfully “pushed the button” that is MALLET and generated a few iterations of results. Considering these has enabled me to offer the following thesis to summarize my process thus far: I employed MALLET – a topic modeling solution – looking to find the philosophical values embedded in decisions of the U.S. Copyright Office Review Board. What I found were inferences. This said, my original assumption – that digital tools and methods can be used to identify such values in determination letters from the Review Board – still appears valid. If it does indeed prove to be, however, it will only be so through human involvement, investigation and judgement.

Reflecting on my Mindfulness Practice Journal (MPJ) has allowed me to retrace my steps and better interpret by process as I moved through the steps of this iteration. After struggling at length with my command line, I received invaluable help from the University of Pittsburgh IT Help Desk as well as from Hillman Library’s Humanities Data Librarian Tyrica Terry Kapral, each of whom helped me to install and configure not just MALLET but also the Topic Modeling Tool that assisted me as a more friendly graphic user interface to interact with the software. My first run of the entire get-up was on October 16, 2021, revealing in about fifteen seconds that some honing of my data was necessary to produce relevant results. Here's the list of topics that it generated:

How did I know that my results weren’t relevant? As is evident in my MPJ, I didn’t, but prior to that first run I had the good fortune of reading both Dr. Matthew Burton's "The Joy of Topic Modeling" and the Quickstart Guide from the Topic Modeling Tool Blog. I took to heart their recommendations (respectively) to seek “reasonable representation” of the corpus in my results and use my intuition to know if that had been achieved. My intuitive interpretation of the above was that it had not.

I noted in my MPJ that the next steps were to “remove numbers and product names,” the latter being a reference to the inclusion of such terms as “claraloo,” “balloon” and “donut”. These are each references to the submitted works themselves, some of the more-or-less “tangible nouns” of the corpus. I sought more “intangible nouns” (if not downright adjectives), so my first step the following day was to clip my documents down to just the “Discussion” sections of each response letter. These were the paragraphs in which the qualities qualifying or disqualifying submitted works from copyright protection were discussed (see below for an example). My thinking was that if I cut out everything else – including signatures, salutations, administrative histories and conclusions, I could winnow the text analyzed to that which was most relevant to my query. The collection of .txt files that resulted is appended at the end of this post, and formed the corpus used in each subsequent run of MALLET performed as part of my second iteration.

It wasn't until several days later that I was able to attend to my stop words, which is the means by which I eliminated numbers from the tokens analyzed by MALLET. In this first effort at editing my stop words, I included all numbers from one to a randomly-chosen 2260, as well as other regularly-utilized terms such as "copyright," "protectable" and "work". It's notable to me, especially given a conversation that I had with Ms. Terry Kapral days later, that I didn't note this decision in my MPJ at all. I just included them because they were common, and it was my presumption that the common words would "throw" the statistics.

Now, to be fully honest, I don't yet entirely understand if they do or if they do not. I certainly don't understand how they do if they do. But through the aforementioned conversation with Ms. Terry Kapral and one a day later with Dr. Alison Langmead, I began to understand that it could well be that the "meaningless" or "throw-away" words in a corpus can become some of the most interesting if considered thoughtfully enough. One of my next steps, therefore, is to remove all actual words from my stop words list, leaving only numbers to be excluded.

The last MALLET run performed as part of this second iteration was done on October 21, 2021 and did exclude all of the stop words I referenced above. It was during this run that I also began to experiment with the number of topics returned. What you'll see below are the topics returned from a run of three, five and ten (these are also uploaded at the end of this post):

Three topics:

Five topics:

Ten topics:

Were I pressed to analyze these right now, I would be inclined to investigate either the last in the list of five topics or the sixth (Topic id 5) in the list of ten, but it was again due to input from Ms. Terry Kapral and Dr. Alison Langmead that I know I'm not yet ready for in-depth analysis. The reason for this is that my corpus only contains 48 documents, which is too small a collection (of documents that themselves are rather short) to generate statistically insightful results. This was suggested by Dr. Langmead in her comments on my first iteration, felt intuitively true to me as I considered the results of my first MALLET run, and was effectively confirmed by Ms. Terry Kapral.

As a result, my next steps are to work with a bit of web scraping code provided by Ms. Terry Kapral, and use that to download en masse all of the Review Board Decision Letters available through their dedicated database. Ms. Terry Kapral was also generous enough to share with me a short Python script that I am hoping to have the skill to leverage in converting the entire cache Review Board decision letters from pdf to .txt. As mentioned in a prior post, I have been using Adobe to perform these conversions for me one-by-one. However, the process is "expensive" both in terms of time and computer bandwidth. Having the means to perform more expedient conversions will be essential as I move forward.

An additional "next step" for me is to begin consideration of the additional files returned following a "run" of MALLET. In addition to the lists of topics and words such as those that are captured above, MALLET also returns CSV data that include lists of the topics "contained" in each document, a breakdown of documents by topic and "topic proportions from the "topics-in-docs" file. This last sentence is a tortured paraphrase from the more finely detailed descriptions offered by the Topic Modeling Tool Blog's Quickstart Guide, necessitated by the fact that I don't yet understand those documents well enough to describe them on my own. Spending sufficient time with them that I do will be an essential step in cultivating the judgement that I will need to use to analyze my results.

My Corpus:

My October 21 Results:

When you unpack the zipped file, you'll find that it contains folders per run. Those runs were differentiated by the number of topics requested. To find results in terms of topics alone (as depicted in the screenshots above), look for the files whose names contain the phrase "