The Enterprise Big Data Lake and Data Catalogs

The current technology landscape is both a blessing and a curse for your company:

A blessing, because now that we have access to very flexible and powerful cloud computing systems, we can spin virtually unlimited storage and computing power on demand, and then run state-of-the-art data analysis, machine learning and AI systems on them.
A curse, because having such an easy access to so many practical and flexible technology solutions as well as powerful open source data systems create the illusion that you can auto-magically overcome the challenges of creating more value based on your company’s data assets.

In principle, no company would object to being more data-driven and quickly employing flexible, smart automation solutions, and enhance their business processes with machine learning and artificial intelligence systems

Unfortunately, many of those companies carry the risk of underestimating the importance of well-designed data management systems, processes, platforms and technologies, which, in turn, are inherently tied to data quality. Unless these topics are handled properly, it’s very challenging and costly to build reliable predictive data analytics, machine learning, and AI systems whose raw materials are high quality and well-managed data assets. As TM Data ICT Solutions, we help our customers to overcome such strategic challenges, in order to ensure their ongoing data efforts will have the expected ROI.

One of the recent books, The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science provides striking examples of not only best practices for complex enterprise big data management in service of creating value out of data assets, but also draws attention to pitfalls and risks involved. The book’s focus on topics such as the importance of smart, automated and well-designed Data Catalogs, challenges in capturing the context, metadata, and data lineage surrounding Critical Data Elements include many lessons that easily apply to many enterprises big and small alike. Many companies also are negatively impacted by the dark data in their data landscape. If the quotes from that book below ring a bell for you, if they make you feel uncomfortable because you’re all too familiar with such challenges and negative situations, don’t hesitate to contact us!

“The vision is often to eventually get rid of the data warehouse to save costs and improve performance, since big data platforms are much less expensive and much more scalable than relational databases. However, just offloading the data warehouse does not give the analysts access to the raw data. Because the rigorous architecture and governance applied to the data warehouse are still maintained, the organization cannot address all the challenges of the data warehouse, such as long and expensive change cycles, complex transformations, and manual coding as the basis for all reports. Finally, the analysts often do not like moving from a finely tuned data warehouse with lightning-fast queries to a much less predictable big data platform, where huge batch queries may run faster than in a data warehouse but more typical smaller queries may take minutes”

“Why is it so difficult to find data in the enterprise? Because the variety and complexity of the available data far exceeds human ability to remember it. Imagine a very small database, with only a hundred tables (some databases have thousands or even tens of thousands of tables, so this is truly a very small real-life database). Now imagine that each table has a hundred fields—a reasonable assumption for most databases, especially the analytical ones where data tends to be denormalized. That gives us 10,000 fields. How realistic is it for anyone to remember what 10,000 fields mean and which tables these fields are in, and then to keep track of them whenever using the data for something new?

Now imagine an enterprise that has several thousand (or several hundred thousand) databases, most an order of magnitude bigger than our hypothetical 10,000-field database. I once worked with a small bank that only had 5,000 employees, but managed to create 13,000 databases. I can only imagine how many a large bank with hundreds of thousands of employees might have. The reason I say “only imagine” is because none of the hundreds of large enterprises that I have worked with over my 30-year career were able to tell me how many databases they had—much less how many tables or fields. Hopefully, this gives you some idea of the challenge analysts face when looking for data.”

“Ideally, the analysts should be able to request access to the data they need. However, if they cannot find the data without having access to it, we have a catch-22.”

“In most enterprises the knowledge about where data is, which data sets to use for what, and what data means is locked in people’s heads—this is commonly referred to as ‘tribal knowledge‘.”

“Without a Data Catalog, in order to find a data set to use for a specific problem, analysts have to ask around until they find someone—if they’re lucky, a subject matter expert (SME)—who can point them to the right data. SMEs can be difficult to find, though, so the analyst may instead run into someone who tells them about a data set that they used for a similar problem, and will then use that data set without really understanding what was done to it or where it came from.”

The solution is a more agile approach to access control that some enterprises are beginning to adopt. They create metadata catalogs that allow the analysts to find any data set without having access to it. Once the right data sets have been identified, the analysts request access to them and the data steward or data owner decides whether to grant access, for how long, and for which portions of the data. Once the access period expires, the access can be automatically revoked or an extension requested

“It is much easier to document data sets when they are first created, because the information is fresh. Nevertheless, even at Google, while some popular data sets are well documented, there is still a vast amount of dark or undocumented data.”

“In traditional enterprises, the situation is much worse. There are millions of existing data sets (files and tables) that will never get documented by analysts unless they are used—but they will never be found and used unless they are documented. The only practical solution is to combine crowdsourcing with automation.”

“More modern data catalogs—especially catalogs with automated tagging—allow data quality specialists and data stewards to define and apply data quality rules for a specific tag.”

“For example, a numeric field with three-digit numbers ranging from 000 to 999 is very likely to be a credit card verification code if it is found next to a credit card number, but a field with exactly the same data is very unlikely to be a credit card verification code if found in a table where all the other tags refer to medical procedure properties.“

“To solve the completeness problem, create a data catalog of all the data assets, so the analysts can find and request any data set that is available in the enterprise.”

Discover more from TM Data ICT Solutions