Open Refine With Powerhouse Museum collection data
Based on this video
Discover how to prep data using OpenRefine.
- Install OpenRefine and open in browser: http://127.0.0.1:3333/
- Create project and import data (phm-collection.tsv) -
either download and upload, or add via Get Data From > Web addresses URL > https://data.freeyourmetadata.org/powerhouse-museum/phm-collection.tsvTip: go to https://data.freeyourmetadata.org/powerhouse-museum/ and right click on file
copy link addressoption.
- Notice we are essentially dealing with a giant spreadsheet with rows and columns, each column has a menu with options for manipulating and filtering the data for that column. On the left hand side options for facet and filter - in data science, faceting allows different views of your data by breaking it up, defined by a specific characteristic - like many sides of a gem - hence the OpenRefine logo. concept of viewing is interesting here - we have a copy of data, different views etc
facet /ˈfasɪt,ˈfasɛt/ noun one side of something many-sided, especially of a cut gem. “a blue and green jewel that shines from a million facets”
a particular aspect or feature of something. “a philosophy that extends to all facets of the business”
Also think about the meaning/s of refined - both removing impurities, but elegant, and “improved to be precise or subtle”:
Two main types in OpenRefine text facet and numeric facet
- “No numeric value present” error message is presented- you need to tell OpenRefine to treat this information as numeric data by going to
Edit cells > Common transforms > To numberYou can now see a histogram that shows the value range of the Record ID (from 0 - 500 000) tip - you can drag the handles to select a specific range
- Remove blank rows
Note that there are two non-numeric values - this indicates there are blank rows in the data.To remove them, uncheck the
numericcheck box in the histogram, then click on the menu (triangle) in the all column and go to
Edit rows > Remove all matching rows
- Remove duplicate rows
- Choose column in your dataset with a unique identifier, in this case we’re going to choose Record ID (tip: click x on top left corner of histogram to remove from view)
- Sort duplicate IDs so they are next to each other by going to
Record IDmenu, then
- Note - *In OpenRefine sorting is merely a visual aid for helping you understand and clean data. to make changes made permanent, you need to click on a new menu
Sortthat should have now appeared in the top centre of the interface. To make changes permanent, click on
Sort>Reorder rows permanently. (note yellow highlighter action alert that momentarily pops up - denoting an action has been performed)
- Identical rows should now be next to each other. We can now use a feature called
blank down. This leaves the first instance of a duplicate record intact, and turns all subsequent duplicate entries into blank cells. To do this, Go to
Record ID > Edit cells >Blank downAlert: 20 duplicate record IDs (cells) have now been blanked (Running total 41915 rows)
- Now we want to remove all rows where the Record ID has been blanked. Go to
Record ID > Facet > Customized facets > Facet by blankthen in the left hand Facet/Filter menu, click on
trueto select the 20 rows where the Record IDs have been blanked.
- To remove them go to
All > Edit rows > Remove all matching rows. The alert should tell us 20 rows have been removed. (running total 41895) tip note undo-redo functionality - go back to any previous point in time
- Fix categories
- Scroll to
Categoriescolumn. Note each record has multiple categories separated by a pipe (|) operator (Training note: possible place to comment on file types, and the ’s' in csv and tsv) and that there are numerous spelling mistakes and inconsistencies (e.g. capitalisations) in the data that we want to clean.
- To fix, we first want to split each value into its own cell, this is a process called Atomisation. To do this, under the
Categoriesmenu and choose
Edit cells > Split multi-valued cellsthen change the separator value from
|(important - make sure there are no extra spaces) then click
OK. Note - we’ve gone from 41895 rows to 93181 rows. We will clean and remerge later.
- Now go to
Catgories > Facet > Text facet(note if there’s a warning too many choices to display, up the limit via
Set choice count limit)
- you should now be able to see a list of every unique category. Scroll to the bottom to see
(blank)with 821 records (this means no category was specified). Scroll through the items to see problems with the data - multiple entries due to inconsistent spelling and capitalisation when we really only want one category rather than unwanted variants (in most cases). Items which should be in the same category are separated.
- To fix this we use a process called clustering. In the
Facet/Filterleft menu, click on the
Clusterbutton. This automates the process of finding similar items, and gives different options to collate and combine these
- Under Method, leave this as
key collisionbut it’s also useful to explore the other option
nearest neighbourto get a sense of how clustering works. The same goes for the
Keying Function. For this demonstration we’ll use
4to accomodate longer phrases than the default
- Click on
Select Allin the bottom left of the window, but uncheck the . On the right hand side you can change to the correct value of a cell manually if required.
- Once happy, choose
Merge Selected & Re-Cluster. Only the unchecked first entry should remain. Click on the
- Now we’ve removed inconsistences from categories, we’re going to merge them back into a list for each record. Under
Edit cells > Join multi-valued cells. Set the separate to
|(a pipe). (important - make sure there are no extra white spaces!!) We should now be back to 41895 records (rows)
- Remove repeated categories from each row using GREL
- GREL stands for General Refine Expression Language, similar to Regular Expressions (regex). It allows you to apply programmatic transformation to cells and do custom faceting.
- Under the
Edit cells > Transformto bring up the GREL editor.
- In the
what each step does. Notice we’ve updated x (1461) number of cells in the
- Export the cleaned dataset
- Click on
Exportmenu in the top right of the interface and select
tab-separated value. This downloads the cleaned data file as a .tsv file on your local computer.