When you work on data science tutorials the dataset is usually perfect, without missing data (unless it’s an explanation on how to deal with NAs), no weird outliers (except if the focus is on anomaly detection), and available in a neat CSV file (or some file format similar to it).

On the other hand when you have to deal with real data the scenario is quite different. Indeed as I wrote in Lessons of 5 Years in the Industry “data don’t live in tidy files”.

In many cases this means having to combine data from different sources, fix inconsistency or other similar tasks.

But there are times in which you have the data, they are even almost clean but are in a format not exactly devised to allow analysis. Two of the examples from my experience: pdf tables and plots.

gru-extract-data

In this post I want to outline how I dealt with them, so in 6 months I will not have to find the solution again on Stack Overflow or ChatGPT-61.

PDF Tables

It actually happened a couple of times to have colleagues asking me for help because the data they needed to work on with Excel were trapped in PDF tables. Copying and pasting the table seldom yields the expected result. A bunch of onlines tools can extract tables from pdf, but in both scenarios I did not want to share the data online and I needed something that could deal with several tables at once.

Fear not, as often in the Python ecosystem there is a package just for this: Camelot. It extract the tables from pages (even many per page). By specifying the flavor argument in read_pdf it can deal with tables with clear lines dividing the cells (lattice) and fuzzier cells (stream). It usually does a really good job and also provides reports of accuracy. The results are pandas dataframe that can be manipulated to fix any inconsistencies, for instance trailing characters or empty cells on the borders. Exporting it to Excel is just a matter of using to_excel. For instance extracting all the tables in a pdf and save each table in a different sheet in an Excel file boils down to:

import camelot
import pandas as pd

path = "your-trapped-data.pdf"
tables = camelot.read_pdf(path, pages="all", flavor="stream")

# here some cleaning of the tables

with pd.ExcelWriter('your-free-data.xlsx') as writer:
    for index, table in enumerate(tables):
        table.df.to_excel(writer, sheet_name=str(index), index=False)

It is also available also a web interface for people that don’t want spend hours on bash and REPL. Keeping up with great names this package is called Excalibur.

Plots

If you need to visualize trends, plotting them is the way to go. The problem is when you have just the JPG of the plot (or even a screenshot of it) but need to work on the data underpinning it. That’s the case of some paper or maybe your colleague generated the plot but the data is not available anymore.

reverse-engineer-dataset

It looks like I am not the only one with this problem. And fortunately a nice tool to solve it has been already implemented: WebPlotDigiter. You can either use it online or download the app.

It’s enough to upload a plot and specify the x values for two points on the x axis and the y values for two points on the y axis. Then the algorithm extracts the values on the line (of the color selected). You can also fix the points detected and then export the dataset. It works also with multiple variables but you have to watch the tutorial to understand how to do it! The output may be not perfect (mostly because of inaccuracy in selecting the 4 initial points) but for my use case it worked great.

Having Camelot and WebPlotDigiter in your toolbox, can unlock many interesting datasets that are hidden in plain sight online on in your organization. Now it’s time to go and save some of them!