Humanities Data Cleaning

A soft introduction to Data Cleaning with OpenRefine

Sebastian Barzaghi | sebastian.barzaghi2@unibo.it | https://orcid.org/0000-0002-0799-1527

What is humanities data?

Any value assigned to something that can be quantified, qualified or interpreted in some way to be used as an informative evidence

Numbers, words, images, videos, photos, audio records, interviews, manuscripts, notes, collections…

To be useful, data should be organized

To be understood by yourself in the future

To be machine-readable (Interoperable in FAIR)

To be understood by others (Reusable in FAIR)

All data is messy

Never take your data at face value

80% of data analysis is spent on the process of cleaning data and preparing it for further manipulation and analysis

Cleaning and preparation must be iterative

Necessary for working with datasets

A dataset is a collection of data

Every value belongs to a variable and an observation

Every variable forms a column

Every observation forms a row

Data cleaning is hard (but worth it)!

  • Prepare the data
  • Eliminate redundancy
  • Separate or combine values
  • Fix errors and inconsistencies (e.g. duplicates, empty values, inconsistent spelling or formatting, etc.)
  • Standardize when possible
  • Treat NULL values

Common mistakes

Never modify your original data

Always make a copy before making changes

Back up your files

Keep track of all the steps

Save your files in the UTF-8 encoding

Always describe your data

Document everything necessary to understand what is in the dataset and how to use it

Consider:

  • The who, what, when, where, and how of data collection
  • How to find and access the data
  • Suggestions on the suitability of the data for answering specific questions
  • Warnings about known problems or inconsistencies in the data
  • Information to check that the data are properly imported

Use standard table formats

https://r4ds.had.co.nz/tidy-data.html

Every variable must have a separate column

Use standard table formats

https://r4ds.had.co.nz/tidy-data.html

Every observation must have a separate row

Use standard table formats

https://r4ds.had.co.nz/tidy-data.html

Every cell should contain a single data value

Avoid aggregating comments and other information with data

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html

Examples: comments placed within data cells, measurement units included in data cells, entering more than one type of information in a cell, including metadata in the table, etc.

Avoid aggregating comments and other information with data

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html

Add units to the column title or into a separate column

Add information to a separate column

Add metadata in a separate document

Do not use formatting to convey information

https://datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes.html

Example: highlighting cells, rows or columns that should be excluded from an analysis, leaving blank rows to indicate separations in data.

Do not use formatting to convey information

https://datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes.html

Add units to the column title or into a separate column

Add information to a separate column

Add metadata in a separate document

Do not use multiple tables nor tabs

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html

Multiple data tables or tabs within a single spreadsheet

If possible, combine everything into one table, or store each table in a separate file

Do not use special characters

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html

Whenever possible, avoid special characters in column headings and text: / \: * . ? ‘ < > [ ] ( ) & $ æÆøØåÅäÄ

Always avoid using spaces in column headings - instead, use underscores or CamelCase

Be consistent

  • Dates displayed in many different formats (e.g. “12 July 2024”, “12/07/2024”, “12-07-2024”, etc.)
  • Names displayed in many different variants (e.g. “Alessandro Manzoni”, “A. Manzoni”, “Manzoni”, “Manzoni, Alessandro”, etc.)
  • Use international standard formats (e.g. YYYY-MM-DD for dates)
  • Use authority records for names (e.g. VIAF)
  • Be consistent in your capitalization of words, choice of delimiters, and naming conventions for variables
  • Avoid using your delimiter in the data itself

Eliminate redundancy (with caution!)

Includes:

  • Duplicates
  • Irrelevant observations
  • Incomplete data
  • Invalid data
  • Conflicting data

It’s all interpretation: consider whether an observation or a column should be removed or not!

Always treat NULL values

Use a consistent way that is compatible and unlikely to cause errors (e.g. blank)

Consider that:

  • It can be difficult to know if a value is missing or was overlooked during data entry
  • Blanks can be confusing when spaces or tabs are used as delimiters in text files

NA and NULL are reasonable null values

Perform basic quality control

Check for:

  • Grammatical errors
  • Inconsistent use of upper and lowercase
  • Inconsistent titles for columns
  • Inconsistent units of measurement, data types (e.g., numeric, character), naming schemes, etc.
  • Out-of-range values

Export and share

Use an established and open license to let others know exactly what they can and cannot do with the data (e.g. Creative Commons Zero (CC0))

Use open formats to make the data interoperable (e.g. CSV), when possible

Consider sharing your data on GitHub and/or publishing it on a repository (e.g. Zenodo, Figshare, etc.) to make it visible, accessible and stable

OpenRefine

A tool for tabular data cleaning

  • Easy to identify and fix errors
  • Easy to combine data from different sources
  • Does not change the original file
  • All actions are reversible and tracked
  • Documentation can be published alongside the data
  • Workflow can be saved and applied to new datasets

Some common scenarios

You want to:

  • Know how many times a particular value appears in a column in your data
  • Know how values are distributed across your whole data set
  • Normalize dates which are formatted in different ways, and want to change all the dates in the list to a single common date format
  • Normalize names or terms that differ from each other but refer to the same people, places or concepts
  • Separate bits of data combined together into individual data
  • Align your data to an external data source

Installing OpenRefine

Creating a project

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Download the Mythologiae dataset (link in chat)
  • Create Project > This computer
  • Click on Choose files > mythologiae-dataset.csv
  • Click on Next

Creating a project

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Parse data as > CSV / TSV / separator-based files
  • Character encoding > UTF-8
  • Select Columns are separated by > commas (CSV)
  • Flag Use character " to enclose cells containing column separators
  • Flag Trim leading & trailing whitespace from strings

Creating a project

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Select Parse next 1 line(s) as column headers
  • Flag Store blank rows
  • Flag Store blank cells as nulls
  • Click on Create project

The project screen

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Permalink allows you to return to a project at a specific view state, with facets and filters applied, by putting view-specific information directly into the URL

The project screen

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Open opens up a new browser tab showing the Create Project screen to change settings, start a new project, or open an existing project
  • Export allows you to pick a format for exporting a dataset
  • Help will open up a new browser tab and bring you to this user manual on the web

The grid screen

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Where the data of your project is displayed in a tabular format
  • Number of total rows
  • Rows mode vs. records mode
  • Number of rows / records visible on screen at one time

Facet/Filter

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html

Main ways to explore your data, by displaying patterns, trends and subgroups

  • Refresh updates each facet with the latest information
  • Reset all resets all facets without removing them
  • Remove all removes all facets

Undo/Redo

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Project's history as a list of changes in order
  • Autosave every 5 minutes and when closing with CTRL + C
  • Any activity that changes the data can be undone
  • The change history of each project is saved with the project itself

Row mode, record mode

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Row: a series of cells, one cell per column
  • Sometimes, there are multiple pieces of information in one cell (e.g. dcho_keyword)
  • In those cases, if you successfully split these values, you can use OpenRefine's records mode to visualize them correctly

Row mode, record mode

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Record: a collection of one or more rows
  • The multiple pieces of information in one cell (e.g. dcho_keyword), once split, are visualized each in a different row, but belong to the same record

Explore data with Facets

Facet: an aspect of data variance in a column

Gives a big picture look at the data

Allows further operations, like filtering and clustering

Explore data with Facets

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Typically, you create a facet on a particular column
  • Click on the triangle in front of the column name (e.g. dcho_theme)
  • Select Facet
  • Select a Facet of your choice (e.g. Text facet)

Text facet

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Takes the total contents of the cells of the column in question and matches them up
  • Sort by name or count
  • Mass-edit every identical cell in the column
  • Great way to have a look at the data and fix typos, whitespace, etc.

Text facet

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Multiple facets are stacked on the top of each other (e.g. cho_author is added)

Filter

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Filters can be added on a column
  • Useful for identifying precise pieces of data
  • Click on the arrow and select Text filter

Filter

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • The filter share the same space with the facets
  • Type in the text you are searching for (e.g. "edipo")
  • Case-insensitive by default

Filter with Facets

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • You can also filter through facets
  • Click on any entry in a filter (e.g. "Allston, Washington" in the cho_author facet)
  • Only the rows with that value are shown

Filter with Facets

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • You can use multiple facets to enhance filtering

Filter with facets

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • You can use customized facets for certain effects (e.g. filter in rows with blank cells)

Sorting

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Determine the order in which rows are shown based on certain conditions
  • Click on the arrow and select Sort

Sorting

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Select how to treat the cell values during sorting (e.g. if they are dates, you most likely want to select dates)
  • You can also select the sorting order
  • Click on OK

Clustering

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • A great way to fix inconsistencies found with facets
  • Uses a variety of comparison methods to find text entries that are similar but not exact, then shares those results with you so that you can merge the cells that should match
  • Create a facet
  • Click on Cluster

Clustering

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Click on Cluster
  • Experiment with different algorithms to detect different clusters
  • For each cluster, you can merge the values by replacing them with a single consistent value
  • By default, the most common value in the cluster is used as the new value, but you can change it

Clustering

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Experiment with different algorithms to detect different clusters
  • For each cluster, you can merge the values by flagging Merge?
  • Click on Merge selected & re-cluster to check, and Merge selected & Close to finish

Cell editing

OpenRefine offers a number of features to edit and improve the contents of cells automatically and efficiently

  • Editing through a text facet: click on Edit to the right of the facet, and type in a new value
  • Using a find/replace function: select Edit cells > Replace to input a string to search and a string to replace it with
  • Editing individual cells: hover over a cell and click on Edit

Splitting

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Sometimes a cell can contain multiple values (e.g. dcho_keyword column: multiple values separated by commas ,)
  • Click on the arrow and select Edit cells
  • Select Split multi-valued cells

Splitting

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • You can decide how to split the cells
  • Ideally, you should always use a separator (e.g. ,, |, ;, etc.)
  • Click on OK

Transforming

A powerful way to enforce changes that cannot be achieved through simple facets, filters or clusters

  • trimming leading and trailing whitespaces
  • splitting data into multiple columns
  • removing punctuation
  • standardising a data format
  • extracting a particular type of data from a textual string

Can be either preset or written ad-hoc in a language called GREL

Reconciliation

Semi-automated process of matching data with external sources

External sources must offer a compliant web service in order to work

Useful for:

  • Fixing spelling or variations in proper names
  • Cleaning up manually-entered values against authority files
  • Linking your data to an existing dataset

Human judgment is required to review and approve the results

Happens by default through string searching

Clean and cluster data before reconciling

Reconciliation

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Click on the arrow and select Reconcile
  • Select Start reconciling

Reconciliation

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Select the reconciliation service (e.g. VIAF)
  • Click on Next

Reconciliation

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Depending on the column and the reconciliation service selected, choose the type of entity you want to reconcile (e.g. Person)
  • Click on Start reconciling

Reconciliation

https://datacarpentry.org/spreadsheet-ecology-lesson/01-format-data.html
  • Some cell values are reconciled directly (e.g. Houdon, Jean Antoine)
  • Others need manual validation (e.g. Gérard, Francois Baron)
  • Click on the single check to reconciliate only in that cell, or the double check to extend it to all identical cells

What to do

dcho_theme

  • Set a common separator (|)
  • Split
  • Cluster
  • Edit (e.g. keep only the last value if there is a hierarchy, delete “Non categoria”, etc.)
  • Standardize when possible (see Iconclass)
    • sign in Iconclass
    • search for the theme
    • choose the most appropriate option
    • replace the existing theme in the dataset with the ID of the option found
    • e.g. L’Enigma della Sfinge → 94T33
    • e.g. Eracle cattura il cinghiale di Erimanto → 94L324

cho_century

  • Set a common separator (|)
  • Edit
  • Standardize (see EDTF)
    • e.g. II century → 0100-01-01/0199-12-31
    • e.g. II century B.C. → -0199-01-01/-0100-12-31

cho_date

  • Set a common separator (|)
  • Edit (e.g. remove “ca.”, “circa”, etc.)
    • Uncertainty (“ca.”, “circa”, etc.) has to be expressed in a separate column cho_date_uncertainty which only accepts True (yes) or False (no)
  • Standardize (see EDTF)
    • e.g. 460 B.C. → -0460
    • e.g. 550-530 B.C. → -0550/-0530
    • e.g. 1705-1706 → 1705/1706

cho_sources_classic

  • Set a common separator (|)
  • Edit (e.g. remove citations that are not in canonical form)
    • Non-canonical citations should be saved in a separate column cho_sources_other
  • Standardize
    • Look for the citation on Perseus
    • e.g. Apollodoro, Biblioteca, II 5 → Apollod. 2.5
    • e.g. Omero, Odissea, XI vv.601-604 → Hom. Od. 11.601

Other data

  • All: check for NULL values, fix typos, replace separators with |
  • dcho_keyword: Split, Cluster, Edit
  • cho_title: Edit
  • cho_author: Cluster, Edit, Reconcile with VIAF (Person)
  • cho_period: Cluster, Edit
  • cho_type: Cluster, Edit
  • cho_location: Cluster, Edit, Reconcile with VIAF (Organization)

Conclusion

There is already a lot you can do to increase data quality

  • Explore data by using facets, filters and sorting
  • Transform data by single- and mass-editing, clustering and replacing
  • Reconcile data with external sources

Try and experiment! You are working on a copy, and you can alyaws trace back