How to cope with the big data variety problem
Dealing with the variety of data and data sources is becoming a greater concern for enterprises. Here are ways to attack the data variety issue.
In addition to volume and velocity, variety is fast becoming a third big data "V-factor." The problem is especially prevalent in large enterprises, which have many systems of record and also an abundance of data under management that is structured and unstructured. These enterprises often have multiple purchasing, manufacturing, sales, finance, and other departmental functions in separate subsidiaries and branch facilities, and they end up with "siloed" systems because of the functional duplicity.
Consequently, what enterprises are finding as they work on their big data and analytics initiatives is that there is a need to harness the variety of these data and system sources to maximize the return from their analytics and also to leverage the benefits of what they learn across as many areas of the enterprise as they can.
Decentralized purchasing functions with their own separate purchasing systems and data repositories are a great example.
"When procurement is decentralized, as it often is in very large enterprises, there is a risk that these different purchasing organizations are not getting all of the leverage that they could when they contract for services," said Andy Palmer, CEO of Tamr, which uses machine learning and advanced algorithms to "curate" data across multiple sources by indexing and unifying the data into a single view. "Theoretically, purchasing agents should be able to benefit from economies of scale when they buy, but they have no way to look at all of the purchasing systems throughout the enterprise to determine what the best price is for the commodity they are buying that someone in the enterprise has been able to obtain."
Palmer says Tamr provides a solution in this area by offering a "best price" on premise website solution that purchasing agents from different corporate divisions can reference. The service uses Tamr's machine learning and algorithms to analyze different purchasing data categories across disparate purchasing systems in order to come up with best prices, which purchasing agents throughout the enterprise can then access. "We use an API (application programming interface) so the service can be instrumented into different procurement applications," said Palmer. "The results for some of our customers have been annual procurement savings in the tens of millions of dollars, since they now can get the 'best price' for goods and services when they negotiate."
Purchasing is just one use case that points to the need large enterprises have in using their systems of record to drive the big data analytics they perform. "These enterprises started off by putting their big data into 'data lake' repositories, and then they ran analytics," said Palmer. Later, enterprises added query languages like Hive and Pig to help them sort through their big data. However, what they eventually discovered was that they needed to provide the right business context in order to ask the right analytical questions that would benefit the business. They could only do this by using their systems of record, and the organization of data inherent in those systems, as drivers for their big data analytics.
Palmer says that data "curation" is one way to attack the variety issue that comes with having to navigate through not only multiple systems of record systems but multiple big data sources. The combination of machine learning and advanced algorithms that seek "high confidence levels" and data quality in the task of cross-referencing and connecting data from a variety of sources into a condensed single source is one way to do this. "The end result is not a system of record, but a system of reference that can cope with the variety of data that is coming in to large organizations," said Palmer.
Finding ways to achieve high data quality and confidence for the business by harnessing data variety is not the only thing enterprises need in their big data preparation; there are also steps like ETL (extract, transform, load) and MDM (master data management) that are part of the data prep continuum. Nevertheless, dealing with the variety of data and data sources is becoming a greater concern.
"We have seen a large growth in these projects over the past three to six months," noted Palmer. "Organizations want to take their structured data from a variety of systems of record, unify it, and then use it to drive business context into their unstructured and semi-structured big data analytics."