Data is everywhere, it’s inescapable. The minute you load a web page in your browser you are exposed to data. The problem, is that this data, although really useful, is trapped - not useable. How do we gain access to this data to use it for good? I’m going to show you the power of extracting and structuring data from the web and how to use this data for good using import.io.
What is import.io?
import.io is a set of free tools that allow you to pull data from a website and have this data update in real time - without having to code anything. You can then download, analyse and share this data. Put simply, it allows you to structure the web.
Why would it be used by library and information professionals?
To begin with I would like to describe a problem. The internet is a huge, huge place. It’s the fastest expanding entity in the world (with the exception of the universe of course!) and is the single biggest storage unit of information.
Since 2000, the internet has grown by a staggering 741.0%, it has a global user base of 3,035,749,340 and doesn’t seem to be slowing down. This means that there is data everywhere which is a good and bad thing.
A knowledge of how to use tools like import.io would mean that library and information professionals could help to consolidate this data into usable chunks, contextualising it into useful information. This could help with:
- Library marketing and analytics – using public website data and library datasets to plan services and measure impact before and after a particular activity or event
- Research - gathering supplementary data for academic or more casual research projects
- Education - running sessions to train patrons in how to find, extract and use data from the web in a meaningful way.
How do we structure the web?
I’m going to use the example of the Amazon bestsellers lists. You can see at this link that you have a list of all the top selling books since 1995. On the web it’s not much use but if we are to take this information and structure, we open ourselves up to a whole host of options.
Using our crawler tool, you can point where the data is on the page and after training the program on 5 similar pages the crawler will go through the website and only extract data on pages that match training, giving you 100s of best selling books in seconds. Obviously on it’s own it’s not much use so, using import.io, you should try and create more crawlers to different websites until you have a whole host of books that you can compare and contrast. Combining APIs into 1 dataset is called federation. This final dataset can then be used throughout lots of different applications (Excel and Google Sheets are the most popular) that will give you the chance to manipulate and analyse the content to help you in choosing the correct books for the correct people.
Once the APIs have been created and are all working nicely, it’s then just a case of hitting a button to refresh all of the data and get the most up to date information from those websites. In other words, whenever the data changes on the web, that will be reflected in your dataset.
Data extraction for good
Another common reason people use web extraction is to visualise data. Data visualisations allow us to easily gain an understanding of the problem, not just a single part of it. Tables are great for computers to analyse though not so great for readers to look at. To visualise data you either need to create or extract it.
I recently started a project to extract health data by state in the United States of America with a view to plot it into a chart at the end of the extraction to highlight certain trends in health figures and funding.
Using our magic tool, I was able to extract the population count for each state in the US without any interaction with the page at all - no need to even click on the data that you want. Our algorithms work out what the data is on the page you need and extracts it. You’re more than welcome to try this, simply go to import.io and insert this link into the bar on the webpage and click get data and you will see how easy it is to structure data.
To demonstrate health issues, I extract cancer, HIV and AIDs figures in each state and inserted them into a Google Sheet. After this, I found a website containing the funding for HIV and AIDs, extracted that in the same way and inserted it into the sheet. After all the extraction was complete (it took only 5 minutes) I could then plot this information into a graph.
Reference your sources
Like any good author, when you use someones data it’s always best practise to reference this. You will notice a source page in my Google Sheet, this is a list of all of the sources I used for each bit of information. Using a sites data can have a positive effect on the webmaster of that site. You will find that you actually drive more traffic to the webmaster’s site by using their data but one must remember that the webmaster’s decision is final and if they would like to stop you from extracting the data from their site then that is final.
A tool to drive the future
If data is the future, then import.io is the tool for the future. Everyone should have access to data that sits on the internet and the more people that use and utilize it, the better. Imagine a world with a perfectly structured web allowing anyone access to the data no matter what their background, wealth or geographical location.
There are lots of other free data mining and extraction tools available. Here are some examples:
- Using ImportXML in Google Sheets to crawl the web
- KimonoLabs, similar to import.io, allows use to turn websites into structured APIs (there is a free option available)
- Parsehubs provides a similar service with a free option
- Screaming Frog is more focussed on extracting data from a website that is relevant to search engine marketers (e.g. page title, meta data) but can also be used more generally
- Tableau Public provides free data visualisation software
A note on copyright and data mining
A note on copyright and data mining from CILIP Policy Officer Yvonne Morris:
“As well as mining the open web – the largest single database the world has ever known - researchers can now legally make copies of any copyright material they have lawful access to for the purpose of computational analysis thanks to a new text and data mining (TDM) exception to UK copyright law. Under the TDM exception, which was implemented in June 2014, pure facts and data can then be shared, or if the material to be shared contains copyright materials, the new quotation exception can be used to quote extracts. This TDM exception only applies when the research is for non-commercial purposes, however. More information on this and other exceptions to copyright law can be found at www.cilip.org.uk/copying”