OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it, transforming it from one format into another, and extending it with web services and external data. OpenRefine allows to explore large data sets with ease

Type of content: Assets
Type of asset:
Big data potential
Policy domains: Innovation, Science & Technology
Phase in the policy cycle:
Policy Design and Analysis
Open license availability
Tags: Big data IT IT processes
SWOT Analysis for
Helpful Harmful
Strengths• Excellent tool to clean, transform and explore data
• Reconcile and Match Data:
• OpenRefine can be used to link and extend your dataset with various webservices. Some services also allow OpenRefine to upload your cleaned data to a central database, such as Wikidata.
• More powerful than Excel with large sets of data
• Platform independent
• Great history tracking
• Can export commonly used functions for reuse
• Powerful Undo/Redo functionality
• Excellent support for UTF-8 and other character sets;
• GREL and, for example, the possibility to “join” columns from different datasets
• Its interactive templating export tool
• Available in English, Chinese, Spanish, French, Russian, Portuguese (Brazil), German, Japanese, Italian, Hungarian, Hebrew, Filipino, Cebuano, Tagalog
• Online courses available
Weaknesses• Low TRL
• Low ease of use
• Some frequent operations on data are more complicated than necessary. (e.g. 5 steps are required to remove duplicate rows when exact values are found in a column).
• Some functions require light programming knowledge
• Some queries run slowly
• Relies on many external services that may no longer be supported
• Much more annoying is the lack of stability of the tool which degrades after a while introducing inconsistencies into data (for example, facets return wrong terms and omit some relevant ones which potentially introduce inconsistencies). The only solution in this case is to restart OpenRefine and in the worse case, when this is not enough, to start the project over.

Opportunities• Getting a better understanding of the data before automating the processing of the full dataset using python or java on hadoop.
• High need for tools that help extract valuable information from big volume of complex data.

Threats• Competition: Emergence of other self-service data preparation tools like Trifacta and Talend Data Preparation.
• Software patents pose a constant threat to the existence of
any free program. We wish to make sure that a company cannot effectively restrict the users of a free program by obtaining a restrictive license from a patent holder.
• Data privacy

Open data - Download the Knowledge base

You are free to download the data of this Knowledge base.

To do this you must be an authenticated user: log in or sign in now.

All the data are licensed as Creative Common CC-BY 4.0.