Open Refine for Librarians

On October 24th, 2018, I gave a half-hour online presentation as part of a virtual conference from NISO called That Cutting Edge: Technology’s Impact on Scholarly Research Processes in the Library.

My presentation was called:

And here is the script and the slides that I presented:

Good afternoon. Thank you for the opportunity to introduce you to OpenRefine.

Even if you have already heard of OpenRefine before, I hope you will still find this session useful as I have tried to make a argument why librarians should invest investigate technologies like OpenRefine for Scholarly Research purposes.

This is talk has three parts.

I like to call OpenRefine the most popular library tool that you’ve never heard of.

After that this introduction, I hope that this statement will become just a little less true.

You can download OpenRefine from its official website OpenRefine.org

OpenRefine began as Google Refine and before that it was Freebase Gridworks, created by David François Huynh in January 2010 while he was working at Metaweb, the company that was responsible for Freebase.

In July 2010, Freebase and Freebase Gridworks were bought by Google adopted the technology and Freebase Gridworks was rebranded as Google Refine. While the code was always open source, Google supported the project until 2012. From that point on, the project became a community supported Open Source product and as such was renamed OpenRefine.

As an aside, Freebase was officially shut down by Google in 2016 and the data from that project was transferred to Wikidata, Wikimedia’s structured data project.  


OpenRefine is software written in Java. The program is downloaded on to your computer and accessible through the browser. OpenRefine includes web server software and so it is not necessary to be connected to the internet in order to make use of OpenRefine.

At the top of the slide is a screen capture of what you first see when you start the program. The dark black window is what opens behind the scenes if you are interested in monitoring the various processes that you are putting OpenRefine through. And in the corner of the slide, you can see a typical view of OpenRefine with data in it.

OpenRefine has been described by its creator as “a power tool for working with messy data”

Wikipedia calls OpenRefine “a standalone open source desktop application for data cleanup and transformation to other formats, the activity known as data wrangling”.

OpenRefine is used to standardize and clean data across your file or spreadsheet.

This slide is a screenshot from the Library Carpentry set of lessons on Open Refine. I like how they introduce you to the software by explaining some common scenarios in which you might use the software.

These scenarios include

  • When you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
  • When you want to know how values are distributed across your whole data set
  • When you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format

The software developers who maintain and extend the powers of Open Refine run regular surveys to learn more about their fellow users and in 2018 they found that their largest community was from libraries – a group that did not even register as its own category in the original 2012 survey.

So we know that the largest OpenRefine user group are librarians. Do we know how OpenRefine use measures up *within* the population of librarians?  Unfortunately we don’t.

While we don’t expect such a specialized tool to be widely used across all the different types of librarians and library work, we have seen from this recent survey of metadata managers of OCLC Research Library Partners, OpenRefine is the second most popular tool used, second to MarcEdit.

That being said, once you know the power of OpenRefine, you will like myself see all these other potential uses for the tool outside of metadata cleanup. In August of this year, I read this tweet from fellow Canadian scholarly communications librarian, Ryan Reiger and sent some links that had instructions that illustrated how OpenRefine could help with this research question


When introducing new technology to others, it’s very important not to over sell it and to manage expectations.

LINK

But I’m not the only one who feels strongly about the power of OpenRefine. For good reasons, which we will explore in the second section of this talk

If you asked me what is the most popular technology used by librarians in their work and support of scholarship, I would say that one answer could be Microsoft Excel. Many librarians I know do their collections work, their electronic resources work, and their data work in Excel and they are very good at it.

But there are some very good reasons to reconsider using Excel for our work.

This slide outlines what I consider some of the strongest reasons to consider using OpenRefine. First, the software is able to handle more types of data than Excel can. Excel can handle rows of data. Open Refine can handle rows and records of data.

For many day to day uses of Excel it is unlikely you will run into the maximum capacity of the software but for those who work with large data sets, a limit of a 1 million and change rows might and can be a problem.

But the most important reason why we should consider OpenRefine is the same reason why it’s fundamentally different than Excel. Unlike spreadsheets software like Excel, no formulas are stored in the cells of OpenRefine

Instead formulas are used to transform the data and these formulas are tracked as scripts.

Don’t use Excel / Genome Biology

Not only do the cells of Excel contain formulas that transform the data presented in ways that are not always clear, Excel sometimes transforms your data without clearly demonstrating that it is doing so. According to a paper from 2016, one fifth of genomics journals had datasets with errors from Excel transforming gene names such as SEPT10 as dates.

I want to be clear, I am not saying that Excel is bad and people who use Excel are also bad.

We can all employ good data practices whether we use spreadsheets or data wrangling tools such as OpenRefine. I believe we have to meet people where they are with their data tool choices. This is part of the approach taken by the good people responsible for this series of lessons as Part of the Ecology Curriculum of Data Carpentry.

And with that, I just want to take the briefest of moments to thank the good people behind Software Carpentry and Data Carpentry – collectively now known as The Carpentries as I am pretty sure it was their work that introduced me to the world of OpenRefine

This slide is taken from the Library Carpentry OpenRefine lesson. There is too much text on the slide to read but the gist of the message is this: OpenRefine saves every change you make to your dataset and these changes are saved as scripts. After you clean up a set of messy data, you can export your script of the transformations you made and then you can apply that script to other similarly messy data sets

Not only does this ability saves the time of the wrangler, this ability to save scripts separately from the data itself lends itself to Reproducible Science.

Here is a screenshot of a script captured in OpenRefine in both English and in JSON.

It is difficult for me to express how important and how useful it is for OpenRefine to separate the work from data in this way.  

This is the means by which librarians can share workflows outside of their organizations without worrying about accidentally sharing private data.

link

As more librarians start using more complicated data sets and data tools for supporting research and their own research, the more opportunities there will be for embodying,  demonstrating, and teaching good data practices.

I remember the instance in which I personally benefited from someone sharing their work with OpenRefine. It was this blog post from Angela Galvan which walked me through the process of looking up a list of ISSNs and running that list through the Sherpa Romeo API and using this formula on the screen to quickly and clearly present whether a particular journal allowed publisher PDFs to be added to the institutional repository or not.

And with that, here’s a bit of tour of how libraries are using OpenRefine in their work

This is from a walkthrough by Owen Stephens in which he used both MarcEdit and OpenRefine to find errors in 50,000 bibliographic records and make the corrections necessary so that they would be able to be loaded into a new library management system.

I haven’t spent much time highlighting it, but one of the most appreciated features of OpenRefine are related to its data visualizations that allow the wrangler to find differences that make a difference in the data.

The slide features two screens captures. In the lower screen, OpenRefine has used fuzzy matching algorithms to discover variations of entries that are statistically likely meant to be the same.

link

I had mentioned previously that I had used OpenRefine to use the Sherpa/Romero API. This ability of OpenRefine to allow users access to the API who may not be entirely comfortable using command-line scripting or programming should not be understated. That’s why lesson plans that use OpenRefine to perform such tasks as web scraping as pictured here, are appreciated.

link

With OpenRefine, libraries are finding ways to use reconciliation services for local projects. I am just going to read the last bit of the last line of this abstract for emphasis: a hack using OpenRefine yielded a 99.6% authority reconciliation and a stable process for monthly data verification. And as you now know, this was likely done through OpenRefine scripts.

OpenRefine has proved useful in preparing linked data…

link

And if staff feel more comfortable using spreadsheets, OpenRefine can be used to covert those spreadsheets into forms such as MODSXML

link

Knowing the history of OpenRefine, you might not be surprised to learn that it has built in capabilities to reconcile data to controlled vocabularies…

link

But you might be pleasantly surprised to learn that OpenRefine can reconcile data from VIVO, JournalTOC, VIAF, and FAST from OCLC.

link

But the data reconciliation service that I’m particular following is from Wikidata.

In this video example, John Little uses data from VIAF and Wikidata to gather authoritative versions of author names plus additional information including their place of birth.

I think it’s only appropriate that OpenRefine connects to Wikidata when you remember that both projects had their origins in the Freebase project.

link

Wikidata is worthy of its own talk – and maybe even its own conference – but since we are very close to the end of this presentation, let me introduce you to Wikidata as structured linked data that any one can use and improve.

link

I was introduced to the power of Wikidata and how its work could extend our library work from librarians such as Dan Scott and Stacy Allison Cassin. In this slide, you can see a screen capture from Dan’s presentation that highlights that the power of Wikidata is that it doesn’t just collect formal institutional identifiers such as from LC or ISNI but sources such as AllMusic.

link

And this is the example that I would like to end my presentation on. The combination of OpenRefine and Wikidata – working together – can allow the librarian not only to explore, clean up and normalize their data sets – OpenRefine has the ability to extend our data and to connect it to the world.

It really is magic.

Back to the Future of Libraries

I am in the process of reading Clive Thompson’s Coders: The Making of a New Tribe and the Remaking of the World and I have to say that I am, so far, disappointed with the book. I am a fan of Thompson’s technology journalism and I really enjoyed his earlier work, Smarter than you think: How Technology is Changing Our Minds for the Better, so I thought I would be a good reader and order the book as soon as it came out from my local. And it’s not a bad book. The book does what it says its going to do on the tin: it is a book about the tribe of coders.

The trouble is mine: I am not interested in a measured account of the lives of coders in America. I think the status quo for computing is dismal.

The way that we require people to have to think like a computer in order to make use of computers is many things. It is de-humanizing. It is unnecessary hardship. It feels wrong.

This is why Bret Victor’s Inventing on Principle (2012) presentation was (and remains) so incredible to me. Victor sets out to demonstrate that creators need (computer) tools that provide them with the most direct and immediately responsive connection to their creation as possible:

If we look to the past, we can find alternative futures to computing that might have served us better, if we had only gone down a different path. Here’s Bret Victor’s The Future of Programming 1973 (2013) which you should watch a few minutes of, if just to appreciate his alternative to powerpoint:

Here’s another video that looks to the past to see what other futures had been forsaken that is definitely worth your time. It is of and from Claire Evans – author of Broad Band: The Untold Story of the Women Who Made the Internet – who spoke at XOXO 2018.

At the around the 11 minute mark, Evans sets the scene for the first unveiling of Tim Berner Lee’s World Wide Web, and it’s a great story because when Lee first demonstrated his work at Hypertext ’91, the other attendees were not particularly impressed. Evans explains why.

So why am I telling you all about the history of computing on my library-themed blog? Well, one reason is that our profession has not done a great job of knowing our own (female) history of computing.

Case in point: until this twitter exchange, I had not had the pleasure of knowing Linda Smith or her work:

(There was now deleted post from a librarian blog from 2012 that comes to mind. I’m not entirely sure of the etiquette of quoting deleted posts, so I will paraphrase the post as the following text…)

Despite librarianship being a feminized and predominantly female profession, [author of aforementioned blog post] remarked that she was never introduced to the following women in library school despite their accomplishments: Suzanne Briet, Karen Spärck-Jones, Henriette Avram, and Elaine Svenonius. And if my memory can be trusted, I believe the same was true for myself.

Is there a connection between a more human(e) type of computing that Bret Victor advocates for with the computing innovations from women that Claire Evans wants to learn from and these lesser known women of librarianship and its adjacent in computing? I think there might be.

From the Overlooked series of obituaries from The New York Times for
Karen Spärck-Jones

When most scientists were trying to make people use code to talk to computers, Karen Sparck Jones taught computers to understand human language instead.

In so doing, her technology established the basis of search engines like Google.

A self-taught programmer with a focus on natural language processing, and an advocate for women in the field, Sparck Jones also foreshadowed by decades Silicon Valley’s current reckoning, warning about the risks of technology being led by computer scientists who were not attuned to its social implications

“A lot of the stuff she was working on until five or 10 years ago seemed like mad nonsense, and now we take it for granted,” said John Tait, a longtime friend who works with the British Computer Society.

Overlooked No More: Karen Sparck Jones, Who Established the Basis for Search Engines” by Nellie Bowles, The New York Times, Jan. 2, 2019

I have already given my fair share of future of the library talks already, so I think it is for the best if some one else takes up the challenge of looking to into the work of librarians past to see if we can’t refactor our present into a better future.