How to tap into the potential of your unstructured data
Recently we were approached with a rather unusual request well outside of the BI tasks we traditionally take on. The customer was asking for a solution that would allow them to tap into and structure image-formatted invoice data that could further be used for various analysis and cost optimization.
So far, the customer had been outsourcing this task to external partners who would manually type in all the relevant fields to excel. As you can imagine, this is a trivial and expensive (time-consuming) task that is very error prone, and furthermore would delay the analytical results greatly.
A preliminary search quickly showed a bunch of OCR tools available in the market. Especially for financial controlling, a lot of solutions were focused on facilitating the process of posting aggregated financial transactions to various chart of accounts. However, finding a tool that could convert a bulk number of invoices to a classical database format, was hard to find.
The task itself aligns quite well with the future trend that data practitioners will have to adapt to one way or another. Namely, that the sheer volume of unstructured data has exploded greatly in recent years and will be an even more important asset in the years to come. Big companies have partaken the challenge of harvesting value from this data, but it seems like a lot of companies are lacking behind on the inhouse capabilities and skillset in releasing the power of their unstructured data. Some companies have started to subscribe to third party solutions that can possibly help them leapfrogging their most time consuming processes and getting a technological edge, but few have found a spot for tame and let the unstructured data into their way of working with BI.
Spending a few weeks on a draft solution, that could take several invoices in image format and convert them to a structured database format, was definitely an eye opener. Not only did it show how seamlessly the SQL suite can be interlinked with third party OCR scraping tools allowing you to tap into the unclaimed territory of image data; it also showed how misleading the interpretations of raw data could end up in your chart of accounts. Giving you false conclusions on your expenditures, simply because information was lost in the aggregation- or classification process.
Have you already started the journey in getting structure in your unstructured data, or could this be a topic for the year to come? Please do not hesitate in starting a dialog and get inspiration from our field observations and experience on this matter.