Humanities Data Cleaning

A soft introduction to Data Cleaning with OpenRefine

Sebastian Barzaghi | sebastian.barzaghi2@unibo.it | https://orcid.org/0000-0002-0799-1527

What is humanities data?

Any value assigned to something that can be quantified, qualified or interpreted in some way to be used as an informative evidence

Numbers, words, images, videos, photos, audio records, interviews, manuscripts, notes, collections…

To be useful, data should be organized

To be understood by yourself in the future

To be machine-readable (Interoperable in FAIR)

To be understood by others (Reusable in FAIR)

All data is messy

Never take your data at face value

80% of data analysis is spent on the process of cleaning data and preparing it for further manipulation and analysis

Cleaning and preparation must be iterative

Necessary for working with datasets

A dataset is a collection of data

Every value belongs to a variable and an observation

Every variable forms a column

Every observation forms a row

Data cleaning is hard (but worth it)!

Prepare the data
Eliminate redundancy
Separate or combine values
Fix errors and inconsistencies (e.g. duplicates, empty values, inconsistent spelling or formatting, etc.)
Standardize when possible
Treat NULL values

Common mistakes

Never modify your original data

Always make a copy before making changes

Back up your files

Keep track of all the steps

Save your files in the UTF-8 encoding

Always describe your data

Document everything necessary to understand what is in the dataset and how to use it

Consider:

The who, what, when, where, and how of data collection
How to find and access the data
Suggestions on the suitability of the data for answering specific questions
Warnings about known problems or inconsistencies in the data
Information to check that the data are properly imported

Use standard table formats

Every variable must have a separate column

Use standard table formats

Every observation must have a separate row

Use standard table formats

Every cell should contain a single data value

Avoid aggregating comments and other information with data

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html

Examples: comments placed within data cells, measurement units included in data cells, entering more than one type of information in a cell, including metadata in the table, etc.

Avoid aggregating comments and other information with data

Add units to the column title or into a separate column

Add information to a separate column

Add metadata in a separate document

Do not use formatting to convey information

https://datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes.html

Example: highlighting cells, rows or columns that should be excluded from an analysis, leaving blank rows to indicate separations in data.

Do not use formatting to convey information

Add units to the column title or into a separate column

Add information to a separate column

Add metadata in a separate document

Do not use multiple tables nor tabs

Multiple data tables or tabs within a single spreadsheet

If possible, combine everything into one table, or store each table in a separate file

Do not use special characters

Whenever possible, avoid special characters in column headings and text: / \: * . ? ‘ < > [ ] ( ) & $ æÆøØåÅäÄ

Always avoid using spaces in column headings - instead, use underscores or CamelCase

Be consistent

Dates displayed in many different formats (e.g. “12 July 2024”, “12/07/2024”, “12-07-2024”, etc.)
Names displayed in many different variants (e.g. “Alessandro Manzoni”, “A. Manzoni”, “Manzoni”, “Manzoni, Alessandro”, etc.)
Use international standard formats (e.g. YYYY-MM-DD for dates)
Use authority records for names (e.g. VIAF)
Be consistent in your capitalization of words, choice of delimiters, and naming conventions for variables
Avoid using your delimiter in the data itself

Eliminate redundancy (with caution!)

Includes:

Duplicates
Irrelevant observations
Incomplete data
Invalid data
Conflicting data

It’s all interpretation: consider whether an observation or a column should be removed or not!

Always treat NULL values

Use a consistent way that is compatible and unlikely to cause errors (e.g. blank)

Consider that:

It can be difficult to know if a value is missing or was overlooked during data entry
Blanks can be confusing when spaces or tabs are used as delimiters in text files

NA and NULL are reasonable null values

Perform basic quality control

Check for:

Grammatical errors
Inconsistent use of upper and lowercase
Inconsistent titles for columns
Inconsistent units of measurement, data types (e.g., numeric, character), naming schemes, etc.
Out-of-range values

Use an established and open license to let others know exactly what they can and cannot do with the data (e.g. Creative Commons Zero (CC0))

Use open formats to make the data interoperable (e.g. CSV), when possible

Consider sharing your data on GitHub and/or publishing it on a repository (e.g. Zenodo, Figshare, etc.) to make it visible, accessible and stable

OpenRefine

A tool for tabular data cleaning

Easy to identify and fix errors
Easy to combine data from different sources
Does not change the original file
All actions are reversible and tracked
Documentation can be published alongside the data
Workflow can be saved and applied to new datasets

Some common scenarios

You want to:

Know how many times a particular value appears in a column in your data
Know how values are distributed across your whole data set
Normalize dates which are formatted in different ways, and want to change all the dates in the list to a single common date format
Normalize names or terms that differ from each other but refer to the same people, places or concepts
Separate bits of data combined together into individual data
Align your data to an external data source

Installing OpenRefine

Download OpenRefine: https://openrefine.org/download.html
- It should already come with Java Runtime Environment
- If it does not work, also download and install Java Runtime Environment: https://www.java.com/it/
Unzip the downloaded file
Double-click on openrefine.exe or refine.bat
Access the interface on a web browser at http://127.0.0.1:3333

Creating a project

Download the Mythologiae dataset (link in chat)
Create Project > This computer
Click on Choose files > mythologiae-dataset.csv
Click on Next

Creating a project

Parse data as > CSV / TSV / separator-based files
Character encoding > UTF-8
Select Columns are separated by > commas (CSV)
Flag Use character " to enclose cells containing column separators
Flag Trim leading & trailing whitespace from strings

Creating a project

Select Parse next 1 line(s) as column headers
Flag Store blank rows
Flag Store blank cells as nulls
Click on Create project

The project screen

Permalink allows you to return to a project at a specific view state, with facets and filters applied, by putting view-specific information directly into the URL

The project screen

Open opens up a new browser tab showing the Create Project screen to change settings, start a new project, or open an existing project
Export allows you to pick a format for exporting a dataset
Help will open up a new browser tab and bring you to this user manual on the web

The grid screen

Where the data of your project is displayed in a tabular format
Number of total rows
Rows mode vs. records mode
Number of rows / records visible on screen at one time

Facet/Filter

Main ways to explore your data, by displaying patterns, trends and subgroups

Refresh updates each facet with the latest information
Reset all resets all facets without removing them
Remove all removes all facets

Undo/Redo

Project's history as a list of changes in order
Autosave every 5 minutes and when closing with CTRL + C
Any activity that changes the data can be undone
The change history of each project is saved with the project itself

Row mode, record mode

Row: a series of cells, one cell per column
Sometimes, there are multiple pieces of information in one cell (e.g. dcho_keyword)
In those cases, if you successfully split these values, you can use OpenRefine's records mode to visualize them correctly

Row mode, record mode

Record: a collection of one or more rows
The multiple pieces of information in one cell (e.g. dcho_keyword), once split, are visualized each in a different row, but belong to the same record

Facet: an aspect of data variance in a column

Gives a big picture look at the data

Allows further operations, like filtering and clustering

Typically, you create a facet on a particular column
Click on the triangle in front of the column name (e.g. dcho_theme)
Select Facet
Select a Facet of your choice (e.g. Text facet)

Text facet

Takes the total contents of the cells of the column in question and matches them up
Sort by name or count
Mass-edit every identical cell in the column
Great way to have a look at the data and fix typos, whitespace, etc.

Text facet

Multiple facets are stacked on the top of each other (e.g. cho_author is added)

Filter

Filters can be added on a column
Useful for identifying precise pieces of data
Click on the arrow and select Text filter

Filter

The filter share the same space with the facets
Type in the text you are searching for (e.g. "edipo")
Case-insensitive by default

You can also filter through facets
Click on any entry in a filter (e.g. "Allston, Washington" in the cho_author facet)
Only the rows with that value are shown

You can use multiple facets to enhance filtering

You can use customized facets for certain effects (e.g. filter in rows with blank cells)

Sorting

Determine the order in which rows are shown based on certain conditions
Click on the arrow and select Sort

Sorting

Select how to treat the cell values during sorting (e.g. if they are dates, you most likely want to select dates)
You can also select the sorting order
Click on OK

Clustering

A great way to fix inconsistencies found with facets
Uses a variety of comparison methods to find text entries that are similar but not exact, then shares those results with you so that you can merge the cells that should match
Create a facet
Click on Cluster

Clustering

Click on Cluster
Experiment with different algorithms to detect different clusters
For each cluster, you can merge the values by replacing them with a single consistent value
By default, the most common value in the cluster is used as the new value, but you can change it

Clustering

Experiment with different algorithms to detect different clusters
For each cluster, you can merge the values by flagging Merge?
Click on Merge selected & re-cluster to check, and Merge selected & Close to finish

Cell editing

OpenRefine offers a number of features to edit and improve the contents of cells automatically and efficiently

Editing through a text facet: click on Edit to the right of the facet, and type in a new value
Using a find/replace function: select Edit cells > Replace to input a string to search and a string to replace it with
Editing individual cells: hover over a cell and click on Edit

Splitting

Sometimes a cell can contain multiple values (e.g. dcho_keyword column: multiple values separated by commas ,)
Click on the arrow and select Edit cells
Select Split multi-valued cells

Splitting

You can decide how to split the cells
Ideally, you should always use a separator (e.g. ,, |, ;, etc.)
Click on OK

Transforming

A powerful way to enforce changes that cannot be achieved through simple facets, filters or clusters

trimming leading and trailing whitespaces
splitting data into multiple columns
removing punctuation
standardising a data format
extracting a particular type of data from a textual string

Can be either preset or written ad-hoc in a language called GREL

Reconciliation

Semi-automated process of matching data with external sources

External sources must offer a compliant web service in order to work

Useful for:

Fixing spelling or variations in proper names
Cleaning up manually-entered values against authority files
Linking your data to an existing dataset

Human judgment is required to review and approve the results

Happens by default through string searching

Clean and cluster data before reconciling

Reconciliation

Click on the arrow and select Reconcile
Select Start reconciling

Reconciliation

Select the reconciliation service (e.g. VIAF)
Click on Next

Reconciliation

Depending on the column and the reconciliation service selected, choose the type of entity you want to reconcile (e.g. Person)
Click on Start reconciling

Reconciliation

Some cell values are reconciled directly (e.g. Houdon, Jean Antoine)
Others need manual validation (e.g. Gérard, Francois Baron)
Click on the single check to reconciliate only in that cell, or the double check to extend it to all identical cells

What to do

dcho_theme

Set a common separator (|)
Split
Cluster
Edit (e.g. keep only the last value if there is a hierarchy, delete “Non categoria”, etc.)
Standardize when possible (see Iconclass)
- sign in Iconclass
- search for the theme
- choose the most appropriate option
- replace the existing theme in the dataset with the ID of the option found
- e.g. L’Enigma della Sfinge → 94T33
- e.g. Eracle cattura il cinghiale di Erimanto → 94L324

cho_century

Set a common separator (|)
Edit
Standardize (see EDTF)
- e.g. II century → 0100-01-01/0199-12-31
- e.g. II century B.C. → -0199-01-01/-0100-12-31

cho_date

Set a common separator (|)
Edit (e.g. remove “ca.”, “circa”, etc.)
- Uncertainty (“ca.”, “circa”, etc.) has to be expressed in a separate column cho_date_uncertainty which only accepts True (yes) or False (no)
Standardize (see EDTF)
- e.g. 460 B.C. → -0460
- e.g. 550-530 B.C. → -0550/-0530
- e.g. 1705-1706 → 1705/1706

cho_sources_classic

Set a common separator (|)
Edit (e.g. remove citations that are not in canonical form)
- Non-canonical citations should be saved in a separate column cho_sources_other
Standardize
- Look for the citation on Perseus
- e.g. Apollodoro, Biblioteca, II 5 → Apollod. 2.5
- e.g. Omero, Odissea, XI vv.601-604 → Hom. Od. 11.601

Other data

All: check for NULL values, fix typos, replace separators with |
dcho_keyword: Split, Cluster, Edit
cho_title: Edit
cho_author: Cluster, Edit, Reconcile with VIAF (Person)
cho_period: Cluster, Edit
cho_type: Cluster, Edit
cho_location: Cluster, Edit, Reconcile with VIAF (Organization)

Conclusion

There is already a lot you can do to increase data quality

Explore data by using facets, filters and sorting
Transform data by single- and mass-editing, clustering and replacing
Reconcile data with external sources

Try and experiment! You are working on a copy, and you can alyaws trace back

Humanities Data Cleaning

A soft introduction to Data Cleaning with OpenRefine

What is humanities data?

To be useful, data should be organized

All data is messy

A dataset is a collection of data

Data cleaning is hard (but worth it)!

Common mistakes

Never modify your original data

Always describe your data

Use standard table formats

Use standard table formats

Use standard table formats

Avoid aggregating comments and other information with data

Avoid aggregating comments and other information with data

Do not use formatting to convey information

Do not use formatting to convey information

Do not use multiple tables nor tabs

Do not use special characters

Be consistent

Eliminate redundancy (with caution!)

Always treat NULL values

Perform basic quality control

Export and share

OpenRefine

A tool for tabular data cleaning

Some common scenarios

Installing OpenRefine

Creating a project

Creating a project

Creating a project

The project screen

The project screen

The grid screen

Facet/Filter

Undo/Redo

Row mode, record mode

Row mode, record mode

Explore data with Facets

Explore data with Facets

Text facet

Text facet

Filter

Filter

Filter with Facets

Filter with Facets

Filter with facets

Sorting

Sorting

Clustering

Clustering

Clustering

Cell editing

Splitting

Splitting

Transforming

Reconciliation

Reconciliation

Reconciliation

Reconciliation

Reconciliation

What to do

dcho_theme

cho_century

cho_date

cho_sources_classic

Other data

Conclusion