These days, the amount and diversity of data is much greater – coming from the web, mobile apps, social networks, enterprise systems, operational technology, sensors and IoT networks – and many organisations believe that, if analysed correctly, this data can provide a window on their market and operating environment.
Data science departments have sprung up everywhere in an attempt to harness the power of data. Using analytics, machine learning and artificial intelligence (AI), organisations seek to guide decision making across all aspects of their operations and market interactions, while enabling new capabilities such as automation.
Rapid technological transformation puts organisations in a tight spot, despite generating excitement and some real advances. In an evolving environment it is not always clear where limited resources should be invested. And in the absence of certainty, decision making can be motivated simply by fear of missing out (FOMO). Much of the uncertainty around data is justified.
But for organisations, the perceived opportunity combined with the fear of missing out has resulted in big investment in data and analytics infrastructure, without a clear strategy or understanding of how to derive value from those investments.
Last year ANZ organisations invested $1.3 billion in data analytics tools or software. But 60 per cent of those same organisations reported having only a 'beginner' or 'basic' level of data maturity. A 2021 Global Data report, which surveyed over 1000 companies worldwide, found that despite significant investment, many organisations are failing to capitalise on their data due to infrastructure and management practices.
In fact, 89 per cent of Australian organisations participating in the survey reported that they are struggling to manage their data. The survey reveals that organisations are awash with data they can’t find, don't trust, fail to understand and, ultimately, proves useless for decision making or enabling new capabilities.
Similarly, a 2020 survey of over 200 IT leaders across Australia revealed high levels of concern around data strategy and data management. For example, 90 per cent of IT leaders are concerned about how they will deal with rapidly increasing quantities of data. Figure 1 shows the reported concerns of Australian organisations.
We’re painting a worrying picture, but these statistics can be improved through a data strategy.
Slowly, organisations are recognising the need to design strategies for dealing with their data. Many have invested in all the infrastructure and software needed to collect large amounts of data, but they lack an overarching strategy. And by strategy we mean documented procedures and management practices to ensure that this data is discoverable, of high quality, and available to the data practitioners within the organisation.
A data strategy can be the mechanism that is needed to pull together investments in database management systems, data lakes, analytics applications, reporting systems and visualisation tools. Indeed, a data strategy should be independent of the underlying technology, whereby implementation is really about aligning the organisation on goals for their data practice and changing the way that people are working with data.
Dark data is data that is collected and stored, but never used. Although some dark data is held for regulatory or compliance purposes, the amount that is held in an organisation gives us some idea of the effectiveness of its data strategy. Estimates of dark data as a proportion of the total data collected for an average organisation range from over 50 per cent to a whopping 90 per cent.
In a world that is collecting 7.5 septillion gigabytes of data every day, that is a lot of data. An even smaller proportion of data that an organisation generates or uses is actually analysed. In 2017 IBM estimated that most companies only analyse 1 per cent of their data.
Dark data represents a failed data strategy. That’s assuming someone thought it was a good idea to collect and store the data in the first place. The processes that enable data to be used as intended should be one of the outcomes of a well thought out data policy. The same applies to ceasing collection and disposing of data that is discovered to be not needed after all.
Designing a data strategy from scratch is a large undertaking. It needs to address governance issues such as who within your organisation can see what data, who is responsible for enforcing protocols and processes, which data to collect and retain, how to manage data quality and context, and how to store data securely and cost effectively.
While these are important issues, there is no point having data to secure and govern if it is not being used to improve decision making or develop new capabilities within an organisation. A data strategy needs to address certain key aspects of data management to facilitate and improve the way that data is made available and used within the organisation.
Firstly, data needs to be discoverable. An Excel spreadsheet on John from accounting's laptop is not discoverable data because it is unlikely that people who need it will find it.
The usual way of achieving discoverability is by establishing and maintaining a data catalogue. This is a centralised database of meta-data identifying the organisation’s data assets along with operational, corporate and governance details relevant to the data. A data catalogue allows users to search, filter and browse the data assets of the organisation – it enables self-service.
Although a data catalogue can be manually maintained, there are plenty of third-party products available. These include cloud-based products from the three major cloud vendors designed to create and maintain a data catalogue using automation and AI algorithms for extraction of meta-data from source data.
Once a data catalogue is established it can also be used as a repository for documentation about the data. There is nothing more demoralising to an analyst or data scientist than a CSV file with cryptic field headers and no documentation. A common time sink in analytics is working out exactly what the data represents.
Ambiguous or undecipherable field names are often the culprits – are we seeing deltas or raw readings? what are the allowed values? are they readings or calculated values? etc.
Mandating minimum levels of documentation for any dataset recognised as a data source reinforces the concept of self-service data. It ensures users have all the information needed to interpret the data without spending valuable time tracking down and questioning a subject matter expert. Establishing minimum standards of data documentation should be a key part of a data strategy, and the data catalogue is a convenient place to keep that documentation.
As we saw above, a very common issue in Australian organisations is the high level of distrust that employees have in the data available to them. An effective data strategy can address this by documenting data quality management processes.
These processes should be applied to all on-boarded data while also logging and making available the results of data quality procedures. From a technical perspective, a data profiling methodology similar to the following can be implemented:
The types and levels of quality control measures will depend on the data in question but there are a few broader ideas that can be applied when considering a data quality strategy.
Provide a mechanism for data quality issues to be reported by users, assign to an appropriate data steward for rectification, and keep logs of the discussions around the issue.
A data strategy can also mandate general principles to be followed during error remediation. For example, it could be mandated that data quality issues should be addressed as close to the ingestion point as possible, rather than remediation occurring as part of downstream data processing.
To this end, data capture in a structured form using software that enforces good data hygiene (for example using web pages that perform field validation and sanity checks) should be preferred over manual data entry and large amounts of free text. The goal of a data strategy is to not only produce high quality data but also ensure that people within the organisation have confidence in the data.
Collecting data that cannot be analysed or used to enable new capabilities has little value. An organisation’s data strategy should recognise this by establishing data management processes that enable efficient and effective analysis and reporting, and which contribute to the development of new capabilities based on machine learning and AI.
For example, meaningful analysis is only possible when it is based on data that has context. Data should therefore be associated with meta-data that gives it context and locates it within an overall data architecture – that is, each data set should be annotated or “tagged” with essential meta-data.
Also, it is very rare that a single data project will deal with a single dataset. In many companies, data scientists and analysts construct their own data pipelines, integrating data from multiple sources over and over again for each new project. This takes an enormous amount of time and it can be done more efficiently during data ingestion – ideally just once, and then automated. Implementing advanced analytics, such as machine learning and AI, is also very difficult without annotated data (in fact, supervised machine learning required that even individual records are ‘labelled’).
It should ensure that the provenance of all data used in analysis is easily traceable, that the data has context and meaning, and that new data sources can be created quickly by combining different datasets. Not only does this improve the efficiency of analytics teams but it also helps to ensure that everyone within an organisation is using the same data and that it can be understood and trusted.
The classic example of the power of meta-data is internet advertising. We have all been subjected to targeted advertising on the internet because of what we watched, searched for or talked about in an email. This happens because companies like Google and Facebook are experts in automatically collecting raw data from their systems, annotating it and combining it to form a rich behavioural profile of each person using the internet.
These profiles are constantly updated using machine learning, and are able to predict the interests, attitudes and likely future behaviour of each user. Whether you agree with this practice or not, it is highly effective. And this is precisely where many companies stumble – they are not able to integrate the raw data from their different systems to form a coherent picture of their operations or their clients. Without this integrative capability they will fail to unlock the potential of advanced analytics, machine learning and AI, even when it is clear where these technologies could be applied.
Australian organisations are clearly facing issues getting a handle on their data and drawing actionable insights from analytics. The root cause of this is often the lack of an overarching strategy for dealing with their data. Many organisations have invested in data infrastructure and an analytics capability, but have not considered the essential processes and management practices needed to bring these together.
We have just touched on a few aspects of a developing a sound data strategy, but many organisations have quite a bit of work to do in order to align their data gathering activities with their analytics practices, and facilitate better and more believable analyses. This transformation will not be easy, but a good data strategy is the critical foundation that will enable a company to take advantage of advanced analytics and derive real value from their investments in data collection and analytics.
Eric Louw is Director, Data, Risk and Analytics at Aurecon. He has twenty years' experience with leading management consulting firms and as an independent strategy consultant. He is the co-author of three business books, as well as numerous articles and academic papers.
 Shahzad, M. Ahmad (January 3, 2017). "The big data challenge of transformation for the manufacturing industry". IBM Big Data & Analytics Hub.
Please change your browser to one of the options below to improve your experience.