Open Refine With Powerhouse Museum collection data
Based on this video
Discover how to prep data using OpenRefine.
- Install OpenRefine and open in browser: http://127.0.0.1:3333/
- Create project and import data (phm-collection.tsv) -
either download and upload, or add via Get Data From > Web addresses URL > https://data.freeyourmetadata.org/powerhouse-museum/phm-collection.tsv
Tip: go to https://data.freeyourmetadata.org/powerhouse-museum/ and right click on filephm-collection
and choosecopy link address
option. - Notice we are essentially dealing with a giant spreadsheet with rows and columns, each column has a menu with options for manipulating and filtering the data for that column. On the left hand side options for facet and filter - in data science, faceting allows different views of your data by breaking it up, defined by a specific characteristic - like many sides of a gem - hence the OpenRefine logo. concept of viewing is interesting here - we have a copy of data, different views etc
facet /ˈfasɪt,ˈfasɛt/ noun one side of something many-sided, especially of a cut gem. “a blue and green jewel that shines from a million facets”
a particular aspect or feature of something. “a philosophy that extends to all facets of the business”
Also think about the meaning/s of refined - both removing impurities, but elegant, and “improved to be precise or subtle”:
Two main types in OpenRefine text facet and numeric facet
- Create
numeric facet
onRecord ID
column - “No numeric value present” error message is presented- you need to tell OpenRefine to treat this information as numeric data by going to
Edit cells > Common transforms > To number
You can now see a histogram that shows the value range of the Record ID (from 0 - 500 000) tip - you can drag the handles to select a specific range - Remove blank rows
Note that there are two non-numeric values - this indicates there are blank rows in the data.To remove them, uncheck the
numeric
check box in the histogram, then click on the menu (triangle) in the all column and go toEdit rows > Remove all matching rows
- Remove duplicate rows
- Choose column in your dataset with a unique identifier, in this case we’re going to choose Record ID (tip: click x on top left corner of histogram to remove from view)
- Sort duplicate IDs so they are next to each other by going to
Record ID
menu, thensort
selectnumbers
and checksmallest first
thenOK
- Note - *In OpenRefine sorting is merely a visual aid for helping you understand and clean data. to make changes made permanent, you need to click on a new menu
Sort
that should have now appeared in the top centre of the interface. To make changes permanent, click onSort>Reorder rows permanently
. (note yellow highlighter action alert that momentarily pops up - denoting an action has been performed)- Identical rows should now be next to each other. We can now use a feature called
blank down
. This leaves the first instance of a duplicate record intact, and turns all subsequent duplicate entries into blank cells. To do this, Go toRecord ID > Edit cells >Blank down
Alert: 20 duplicate record IDs (cells) have now been blanked (Running total 41915 rows) - Now we want to remove all rows where the Record ID has been blanked. Go to
Record ID > Facet > Customized facets > Facet by blank
then in the left hand Facet/Filter menu, click ontrue
to select the 20 rows where the Record IDs have been blanked. - To remove them go to
All > Edit rows > Remove all matching rows
. The alert should tell us 20 rows have been removed. (running total 41895) tip note undo-redo functionality - go back to any previous point in time
- Fix categories
- Scroll to
Categories
column. Note each record has multiple categories separated by a pipe (|) operator (Training note: possible place to comment on file types, and the ’s' in csv and tsv) and that there are numerous spelling mistakes and inconsistencies (e.g. capitalisations) in the data that we want to clean. - To fix, we first want to split each value into its own cell, this is a process called Atomisation. To do this, under the
Categories
menu and chooseEdit cells > Split multi-valued cells
then change the separator value from,
to|
(important - make sure there are no extra spaces) then clickOK
. Note - we’ve gone from 41895 rows to 93181 rows. We will clean and remerge later. - Now go to
Catgories > Facet > Text facet
(note if there’s a warning too many choices to display, up the limit viaSet choice count limit
) - you should now be able to see a list of every unique category. Scroll to the bottom to see
(blank)
with 821 records (this means no category was specified). Scroll through the items to see problems with the data - multiple entries due to inconsistent spelling and capitalisation when we really only want one category rather than unwanted variants (in most cases). Items which should be in the same category are separated. - To fix this we use a process called clustering. In the
Facet/Filter
left menu, click on theCluster
button. This automates the process of finding similar items, and gives different options to collate and combine these - Under Method, leave this as
key collision
but it’s also useful to explore the other optionnearest neighbour
to get a sense of how clustering works. The same goes for theKeying Function
. For this demonstration we’ll usengram-fingerprint
- Set
Ngram Size
to4
to accomodate longer phrases than the default2
- Click on
Select All
in the bottom left of the window, but uncheck the . On the right hand side you can change to the correct value of a cell manually if required. - Once happy, choose
Merge Selected & Re-Cluster
. Only the unchecked first entry should remain. Click on theclose
button. - Now we’ve removed inconsistences from categories, we’re going to merge them back into a list for each record. Under
Categories
selectEdit cells > Join multi-valued cells
. Set the separate to|
(a pipe). (important - make sure there are no extra white spaces!!) We should now be back to 41895 records (rows)
- Remove repeated categories from each row using GREL
- GREL stands for General Refine Expression Language, similar to Regular Expressions (regex). It allows you to apply programmatic transformation to cells and do custom faceting.
- Under the
Categories
menu selectEdit cells > Transform
to bring up the GREL editor. - In the
Expression
field typevalue.split('|').uniques().join('|')
then clickOK
. Explainwhat each step does
. Notice we’ve updated x (1461) number of cells in theCategories
column.
- Export the cleaned dataset
- Click on
Export
menu in the top right of the interface and selecttab-separated value
. This downloads the cleaned data file as a .tsv file on your local computer.