Data Quality, Accuracy and Reliability
Big data is not immune to inaccuracies. For example, according to a recent report from Experian Data Quality, 75% of businesses believe their customer contact information is incorrect.
When organizations use big data that houses bad data as part of their strategy to make customer relationships stronger, it can lead to big problems. From small embarrassments to complete customer dissention, overconfidence in the accuracy of data can lead to:
- Overall poor business decisions
- Predicting outcomes that never come to pass
- Not capitalizing on, or a misunderstanding of, customer purchase trends and habits
- Moving a customer relationship along at an improper pace
- Conveying a wrong or misguided message to a customer
- Decreased customer loyalty and trust that, in turn, leads to customer retention issues and revenue loss
- Wasted marketing efforts
- Inaccurately assessing various risks
Not only can big data hold wrong information, but it can also include contradictions and duplicate itself as well. Having a database full of inaccurate data wouldn’t lend itself to providing the necessary precision insight needed to support innovation and growth initiatives. But because of the massive volume of data that’s involved from so many sources, it would be a bit surprising if big data was 100% accurate 100% of the time.
Reasons for bad data
How does big data wind up in such bad shape? There are countless possible reasons given that there could be multiple causes in combination that result in a specific error. While human error, criminal behavior and collection errors stand as examples of general reasons for data errors, here are some more targeted examples:
- Incorrect conclusions about customer interests
- Usage of biased sample populations
- Lack of proper big data governance processes that would identify data inconsistencies
- Evaluative or leading survey questions that skew true opinion, behavior or belief
- Usage of outdated or incomplete information
- Multiple data sources improperly linking data sets
- Cybercrime activity that alters or corrupts data
So while it’s no secret that big data can be inaccurate, it doesn’t mean that you shouldn’t do whatever you can to control the accuracy and reliability of your data. Eliminating or minimizing the various ways data inaccuracy festers within your network is key to combating this issue.
While many factors can contribute to the quality, accuracy and reliability of your data, here are a few common problem areas to consider:
A data silo is a warehouse of information under the control of a single department, closed off from outside visibility and isolated from the rest of an organization. It’s not unlike a farm silo. We can all see it from the road and we know it’s there, but those without a key have no idea what’s inside. Instead of grain or corn, however, a data silo houses business-critical information.
The issue with data silos is their isolation. They store data in disparate units that can’t share information with each other. There is simply no integration on the back end and therefore the data you’ve collected can’t provide the meaningful, comprehensive insights that should you should gain from it.
Essentially, data silos are catalysts for inefficiency and redundancy that cause resources to be misused and productivity to be reduced. They’re a breeding ground for inaccurate data that prevent you from seeing the big picture.
What impact do data silos have on your organization?
Basically, there are two results from data silos; the same data is stored by multiple teams, or teams store complementary, but separate, data. Neither situation yields positive results.
There is obviously cost associated with the storage of data, and paying extra to store the same data in multiple areas is not only inefficient, but it also soaks up valuable resources that could be better utilized in other areas of your business.
There’s also risk involved. There is the possibility that the “same” data collected in two different data silos can vary slightly. How would you decide which dataset is correct? Or more appropriately, how would you decide which dataset is the most accurate or up-to-date? If the wrong one is chosen, you risk relying on insight driven by outdated information.
Data Silos – An overwhelming challenge
In a 2016 survey, F5 Networks, Inc. asked organizations how many applications were in their portfolio. 54% of those respondents said they have as many as 200 on their networks. 23% said as many as 500 and 15% said as many as 1,000. Additionally, 9% said between 1,001 and 3,000. Forbes reported through a separate study by Netskope that the typical enterprise has more than 500 applications in place.
With those staggeringly high numbers, the thought of investigating a data problem and the process of checking each data silo to make sense of relevant information, is overwhelming at best.
In this very real scenario, issue resolution is dreadfully slow not only because each silo must be sifted through, but also because you must determine which fragments of information are relevant to the problem at hand.
How do you solve the data silo problem?
Adding new big data initiatives typically heightens isolation issues, thereby increasing data silos and the problems that come with them.
But adding agnostic big data architecture can enable access to data across your organizational silos and provide comprehensive visibility of that segmented information. This essentially breaks down the data silos and eliminates their negative impact, while providing you with the ability to effectively leverage all your data investments across any deployment platform or technology stack.
As you know, data isn’t always usable as it’s received. Preparing it so it can be used for whatever purpose, otherwise known as data cleaning or data cleansing, is normally a slow and difficult process.
There are some estimates that state poor-quality data costs the U.S. economy up to $3.1 trillion per year. That’s certainly a high number, but not necessarily a surprising one given that weak data quality can lead to incorrect results from big data analytics and can also lead to unwise decision making. Additionally, it can potentially open businesses up to issues with compliance because the regulatory requirements of some industries require data to be as accurate and current as possible.
Appropriate design and management of processes can help lessen the potential for poor data quality at the front end, but they can’t wipe it out. The solution is to make bad data usable through the removal or correction of errors and inconsistencies in a dataset. More specifically, the solution is data cleansing.
The Data Cleansing Challenge
Data cleansing is a tedious, time-consuming task that requires multiple complex steps. According to a survey by CrowdFlower, data scientists spend nearly 80% of their time preparing and managing data for analysis.
A detailed analysis of the data must be performed to uncover existing data errors or inconsistencies that ultimately need to be resolved. While this can be done manually, it typically requires the help of analytics tools and programs to streamline the process and make things more efficient.
Depending on the number and type of data sources, part of the data cleansing process may also include:
- Steps to format the data to gain a consistent structure
- Transforming bad data into better quality, usable data
- Evaluation and testing of formatting and transformation definitions and workflows
- Repetition of analysis, design and verification steps
To minimize the potential for working the same data twice, once the data is cleaned, it should be placed back into the original sources to replace its inaccurate, error-rich counterpart.
To be effective, the process of data cleansing must be repeated each time your data is accessed or anytime values change, making it far from a one-off task.
Best Practices to Clean and Preserve Your Data
While we’ve established that data cleansing is a labor-intensive process, there are some best practices you can use up front to help minimize the workload. Here are a few to consider:
Keep Your Data Updated
Set standards and policies for updating data and utilize technology to simplify this task, such as the use of parsing tools to scan incoming emails and automatically update contact information.
Validate Any Newly Captured Data
Set organizational standards and policies to verify all new data that is captured before it enters your database.
Reliable Data Entry
Implement policies to ensure all necessary data points are captured at the applicable time and ensure all employees are aware of these standards.
Duplicate Data Removal
Utilize tools to help remove any potential duplicate data generated by data silos or various other data sources.
Every insight potentially has value, but the challenge of finding the right one at the right time within a huge (and growing) lump of data often proves to be quite difficult.
If uncovered, a few bits of information could provide the invaluable business intelligence you need to push past your competitors. But those bits often get lost amongst the irrelevant information that surrounds them. Knowing that the information you need to establish dominance within your industry is right at your fingertips, but you’re unable to grab it, can be frustrating and maddening.
Maksim Tsvetovat, author of the book “Social Network Analysis for Startups”, points out that in order to use big data, “There has to be a discernible signal in the noise that you can detect and sometimes, there just isn’t one. You approach (big data) carefully and behave like a scientist, which means if you fail at your hypothesis, you come up with a few other hypothesizes and maybe one of them turns out to be correct.”
Leaning on the expertise of a seasoned data scientist can help you discover the source of the noise within your big data ecosystem more quickly, giving you the chance to gain the actionable insight you need to make better business decisions and capitalize on growth opportunities.