The importance of data for AI

Artificial intelligence is popping up all around us everywhere we look, both in areas that border on science fiction (self-driving vehicles) to the more mundane (what show should I watch on Netflix?). While AI is a fairly broad area of study in computer science, most of the excitement these days is centered on an area of AI called machine learning and in particular a technique called deep learning. Machine learning aims to let algorithms learn and predict answers to problems by analysing data to make predictions on their own.

Machine learning is already powering software you are using everyday. Services like Google and Facebook photos, voice assistants like Siri, Alexa and Google Home, recommendation systems from Amazon and Netflix just to name a few. You can expect to see more and more applications leveraging this powerful technology.

Think of an AI application as a three legged stool.

The first leg of the stool is the AI algorithm itself. Open source machine learning libraries like TensorFlow and Theano have removed a lot of the low level complexity involved in designing and building AI applications. These tools are free, well documented and supported by vibrant communities. The availability of these tools has made building machine learning applications far more accessible to developers.

The second leg of the stool is computing horsepower, both in the form of raw CPU power and large scale data storage solutions. Cloud services like Amazon Web Services, Google Cloud, Microsoft Azure and others make renting servers, virtual machines and big data tools as simple as pushing a few buttons (provided you get your credit card out first!).

The last leg of the stool is data. Before you can consider hiring data scientists, renting servers and installing open source machine learning libraries, you must have data. The quality and depth of data will determine the level of AI applications you can achieve.

Preparing your data for AI

While your organisation may not be at the stage where you are ready to start building AI applications, at a minimum you should be planning on a future where your data will be used to power smart solutions. Treat every new initiative or project as an opportunity to build a foundation for future data models.

To generate and collect well designed data, start with focusing questions.

Does your organisation have a cohesive data collection policy?

This question has become critical in light of the GDPR legislation. Are there clear and followed guidelines about what and why data is collected when a new feature or product is being developed? Does that data have a purpose or is it being collected just because?

Is your data being collected and stored in a common format?

When you are collecting data, is it being saved in a usable format across all your data collection touchpoints? Are the field names the same? Is the same level of validation and error checking applied across products?

Is your data being stored in a central location?

Data needs to be flowing into data stores and be available in real time to all areas of the business. Given that AI applications generally become better the more they can correlate different sources of information, siloed data sets that are hard to access become an impediment to finding value in an organisation’s data.

In the not-too-distant future, what we are now calling AI will be embedded in our culture and we won’t call it AI any longer. It will just be how things work. What you have in your control today is your data. It’s crucial that you start preparing for a future where AI applications can start using your data and that starts with the quantity and quality of the data itself.