What is humanities data?
Any value assigned to something that can be quantified, qualified or interpreted in some way to be used as an informative evidence
Numbers, words, images, videos, photos, audio records, interviews, manuscripts, notes, collections…
To be useful, data should be organized
To be understood by yourself in the future
To be machine-readable (Interoperable in FAIR)
To be understood by others (Reusable in FAIR)
All data is messy
Never take your data at face value
80% of data analysis is spent on the process of cleaning data and preparing it for further manipulation and analysis
Cleaning and preparation must be iterative
Necessary for working with datasets
A dataset is a collection of data
Every value belongs to a variable and an observation
Every variable forms a column
Every observation forms a row
Data cleaning is hard (but worth it)!
- Prepare the data
- Eliminate redundancy
- Separate or combine values
- Fix errors and inconsistencies (e.g. duplicates, empty values, inconsistent spelling or formatting, etc.)
- Standardize when possible
- Treat NULL values
Never modify your original data
Always make a copy before making changes
Back up your files
Keep track of all the steps
Save your files in the UTF-8 encoding
Always describe your data
Document everything necessary to understand what is in the dataset and how to use it
Consider:
- The who, what, when, where, and how of data collection
- How to find and access the data
- Suggestions on the suitability of the data for answering specific questions
- Warnings about known problems or inconsistencies in the data
- Information to check that the data are properly imported
Every variable must have a separate column
Every observation must have a separate row
Every cell should contain a single data value
Examples: comments placed within data cells, measurement units included in data cells, entering more than one type of information in a cell, including metadata in the table, etc.
Add units to the column title or into a separate column
Add information to a separate column
Add metadata in a separate document
Example: highlighting cells, rows or columns that should be excluded from an analysis, leaving blank rows to indicate separations in data.
Add units to the column title or into a separate column
Add information to a separate column
Add metadata in a separate document
Do not use multiple tables nor tabs
Multiple data tables or tabs within a single spreadsheet
If possible, combine everything into one table, or store each table in a separate file
Do not use special characters
Whenever possible, avoid special characters in column headings and text: / \: * . ? ‘ < > [ ] ( ) & $ æÆøØåÅäÄ
Always avoid using spaces in column headings - instead, use underscores or CamelCase
Be consistent
- Dates displayed in many different formats (e.g. “12 July 2024”, “12/07/2024”, “12-07-2024”, etc.)
- Names displayed in many different variants (e.g. “Alessandro Manzoni”, “A. Manzoni”, “Manzoni”, “Manzoni, Alessandro”, etc.)
- Use international standard formats (e.g.
YYYY-MM-DD for dates) - Use authority records for names (e.g. VIAF)
- Be consistent in your capitalization of words, choice of delimiters, and naming conventions for variables
- Avoid using your delimiter in the data itself
Eliminate redundancy (with caution!)
Includes:
- Duplicates
- Irrelevant observations
- Incomplete data
- Invalid data
- Conflicting data
It’s all interpretation: consider whether an observation or a column should be removed or not!
Always treat NULL values
Use a consistent way that is compatible and unlikely to cause errors (e.g. blank)
Consider that:
- It can be difficult to know if a value is missing or was overlooked during data entry
- Blanks can be confusing when spaces or tabs are used as delimiters in text files
NA and NULL are reasonable null values
Check for:
- Grammatical errors
- Inconsistent use of upper and lowercase
- Inconsistent titles for columns
- Inconsistent units of measurement, data types (e.g., numeric, character), naming schemes, etc.
- Out-of-range values
Export and share
Use an established and open license to let others know exactly what they can and cannot do with the data (e.g. Creative Commons Zero (CC0))
Use open formats to make the data interoperable (e.g. CSV), when possible
Consider sharing your data on GitHub and/or publishing it on a repository (e.g. Zenodo, Figshare, etc.) to make it visible, accessible and stable
- Easy to identify and fix errors
- Easy to combine data from different sources
- Does not change the original file
- All actions are reversible and tracked
- Documentation can be published alongside the data
- Workflow can be saved and applied to new datasets
Some common scenarios
You want to:
- Know how many times a particular value appears in a column in your data
- Know how values are distributed across your whole data set
- Normalize dates which are formatted in different ways, and want to change all the dates in the list to a single common date format
- Normalize names or terms that differ from each other but refer to the same people, places or concepts
- Separate bits of data combined together into individual data
- Align your data to an external data source
Creating a project
- Download the Mythologiae dataset (link in chat)
- Create Project > This computer
- Click on Choose files >
mythologiae-dataset.csv - Click on Next
Creating a project
- Parse data as > CSV / TSV / separator-based files
- Character encoding >
UTF-8 - Select Columns are separated by > commas (CSV)
- Flag Use character
" to enclose cells containing column separators - Flag Trim leading & trailing whitespace from strings
Creating a project
- Select Parse next
1 line(s) as column headers - Flag Store blank rows
- Flag Store blank cells as nulls
- Click on Create project
The project screen
Permalink allows you to return to a project at a specific view state, with facets and filters applied, by putting view-specific information directly into the URL
The project screen
Open opens up a new browser tab showing the Create Project screen to change settings, start a new project, or open an existing projectExport allows you to pick a format for exporting a datasetHelp will open up a new browser tab and bring you to this user manual on the web
The grid screen
- Where the data of your project is displayed in a tabular format
- Number of total rows
- Rows mode vs. records mode
- Number of rows / records visible on screen at one time
Facet/Filter
Main ways to explore your data, by displaying patterns, trends and subgroups
Refresh updates each facet with the latest informationReset all resets all facets without removing themRemove all removes all facets
Undo/Redo
- Project's history as a list of changes in order
- Autosave every 5 minutes and when closing with
CTRL + C - Any activity that changes the data can be undone
- The change history of each project is saved with the project itself
Row mode, record mode
- Row: a series of cells, one cell per column
- Sometimes, there are multiple pieces of information in one cell (e.g.
dcho_keyword) - In those cases, if you successfully split these values, you can use OpenRefine's records mode to visualize them correctly
Row mode, record mode
- Record: a collection of one or more rows
- The multiple pieces of information in one cell (e.g.
dcho_keyword), once split, are visualized each in a different row, but belong to the same record
Explore data with Facets
Facet: an aspect of data variance in a column
Gives a big picture look at the data
Allows further operations, like filtering and clustering
Explore data with Facets
- Typically, you create a facet on a particular column
- Click on the triangle in front of the column name (e.g.
dcho_theme) - Select Facet
- Select a Facet of your choice (e.g. Text facet)
Text facet
- Takes the total contents of the cells of the column in question and matches them up
- Sort by name or count
- Mass-edit every identical cell in the column
- Great way to have a look at the data and fix typos, whitespace, etc.
Text facet
- Multiple facets are stacked on the top of each other (e.g.
cho_author is added)
Filter
- Filters can be added on a column
- Useful for identifying precise pieces of data
- Click on the arrow and select Text filter
Filter
- The filter share the same space with the facets
- Type in the text you are searching for (e.g. "edipo")
- Case-insensitive by default
Filter with Facets
- You can also filter through facets
- Click on any entry in a filter (e.g. "Allston, Washington" in the
cho_author facet) - Only the rows with that value are shown
Filter with Facets
- You can use multiple facets to enhance filtering
Filter with facets
- You can use customized facets for certain effects (e.g. filter in rows with blank cells)
Sorting
- Determine the order in which rows are shown based on certain conditions
- Click on the arrow and select Sort
Sorting
- Select how to treat the cell values during sorting (e.g. if they are dates, you most likely want to select dates)
- You can also select the sorting order
- Click on OK
Clustering
- A great way to fix inconsistencies found with facets
- Uses a variety of comparison methods to find text entries that are similar but not exact, then shares those results with you so that you can merge the cells that should match
- Create a facet
- Click on Cluster
Clustering
- Click on Cluster
- Experiment with different algorithms to detect different clusters
- For each cluster, you can merge the values by replacing them with a single consistent value
- By default, the most common value in the cluster is used as the new value, but you can change it
Clustering
- Experiment with different algorithms to detect different clusters
- For each cluster, you can merge the values by flagging Merge?
- Click on Merge selected & re-cluster to check, and Merge selected & Close to finish
Cell editing
OpenRefine offers a number of features to edit and improve the contents of cells automatically and efficiently
- Editing through a text facet: click on Edit to the right of the facet, and type in a new value
- Using a find/replace function: select Edit cells > Replace to input a string to search and a string to replace it with
- Editing individual cells: hover over a cell and click on Edit
Splitting
- Sometimes a cell can contain multiple values (e.g.
dcho_keyword column: multiple values separated by commas ,) - Click on the arrow and select Edit cells
- Select Split multi-valued cells
Splitting
- You can decide how to split the cells
- Ideally, you should always use a separator (e.g.
,, |, ;, etc.) - Click on OK
A powerful way to enforce changes that cannot be achieved through simple facets, filters or clusters
- trimming leading and trailing whitespaces
- splitting data into multiple columns
- removing punctuation
- standardising a data format
- extracting a particular type of data from a textual string
Can be either preset or written ad-hoc in a language called GREL
Reconciliation
Semi-automated process of matching data with external sources
External sources must offer a compliant web service in order to work
Useful for:
- Fixing spelling or variations in proper names
- Cleaning up manually-entered values against authority files
- Linking your data to an existing dataset
Human judgment is required to review and approve the results
Happens by default through string searching
Clean and cluster data before reconciling
Reconciliation
- Click on the arrow and select Reconcile
- Select Start reconciling
Reconciliation
- Select the reconciliation service (e.g.
VIAF) - Click on Next
Reconciliation
- Depending on the column and the reconciliation service selected, choose the type of entity you want to reconcile (e.g.
Person) - Click on Start reconciling
Reconciliation
- Some cell values are reconciled directly (e.g. Houdon, Jean Antoine)
- Others need manual validation (e.g. Gérard, Francois Baron)
- Click on the single check to reconciliate only in that cell, or the double check to extend it to all identical cells
dcho_theme
- Set a common separator (
|) - Split
- Cluster
- Edit (e.g. keep only the last value if there is a hierarchy, delete “Non categoria”, etc.)
- Standardize when possible (see Iconclass)
- sign in Iconclass
- search for the theme
- choose the most appropriate option
- replace the existing theme in the dataset with the ID of the option found
- e.g. L’Enigma della Sfinge →
94T33 - e.g. Eracle cattura il cinghiale di Erimanto →
94L324
cho_century
- Set a common separator (
|) - Edit
- Standardize (see EDTF)
- e.g. II century →
0100-01-01/0199-12-31 - e.g. II century B.C. →
-0199-01-01/-0100-12-31
cho_date
- Set a common separator (
|) - Edit (e.g. remove “ca.”, “circa”, etc.)
- Uncertainty (“ca.”, “circa”, etc.) has to be expressed in a separate column
cho_date_uncertainty which only accepts True (yes) or False (no)
- Standardize (see EDTF)
- e.g. 460 B.C. →
-0460 - e.g. 550-530 B.C. →
-0550/-0530 - e.g. 1705-1706 →
1705/1706
cho_sources_classic
- Set a common separator (
|) - Edit (e.g. remove citations that are not in canonical form)
- Non-canonical citations should be saved in a separate column
cho_sources_other
- Standardize
- Look for the citation on Perseus
- e.g. Apollodoro, Biblioteca, II 5 →
Apollod. 2.5 - e.g. Omero, Odissea, XI vv.601-604 →
Hom. Od. 11.601
Other data
- All: check for NULL values, fix typos, replace separators with
| dcho_keyword: Split, Cluster, Editcho_title: Editcho_author: Cluster, Edit, Reconcile with VIAF (Person)cho_period: Cluster, Editcho_type: Cluster, Editcho_location: Cluster, Edit, Reconcile with VIAF (Organization)
Conclusion
There is already a lot you can do to increase data quality
- Explore data by using facets, filters and sorting
- Transform data by single- and mass-editing, clustering and replacing
- Reconcile data with external sources
Try and experiment! You are working on a copy, and you can alyaws trace back