Data Lake - or Data Swamp?

6 min read

Aug 18, 2022

Is your data starting to feel a little murky?

As OA drives significant changes in how content is created, promoted, and valued, research publishers have a new interest in understanding the complex customer journeys and interactions of their readers and are ramping up their investment in data as a result.

Publishers often have multiple valuable data sources collecting reader information, content-level performance, advertising activity, and more - but because that data is often siloed and trapped within each platform, organizations can’t unlock the valuable insights and trends buried within.

It’s this recognition that you can’t benefit from data you can’t clearly see that has driven scholarly publishers to embrace the concept of a data lake.

At its simplest, a data lake serves the purpose of storing all your data in one place. A data lake is formed when data from disparate platforms is entered in its “natural state,” giving publishers the ability to build dashboards, harness data visualization, and even real-time analytics and machine learning in order to access deep audience and content insights across their full library of content.

But what may appear as rich, fertile, and full of promise from the outside may turn out to be a data swamp in disguise. In reality, most data lakes are dense, hostile, and publishers can lose good expeditioners in there for years.

When does a data lake become a swamp?

Despite the fact that organizations have begun to acknowledge how critical data is to their business and invest in technology to capture this data, a recent CMO survey found that marketing analytics is only used in decision-making about 39% of the time.

If your data management system isn’t helping to keep data from getting cloudy, you might be wading into a data swamp instead.

Keep an eye out for these common scenarios that might be “swamping” up your data lake:

Your data lake is stagnant
You don’t automatically get value from data just by collecting it. Data is only worthwhile if you can use it to make smart decisions in a timely manner, and if your team doesn’t have the resources available to dive in and extract those insights, all you’ve done by setting up a data lake is move information from one place to another.

Even with a smart data scientist on board, data lakes can be limiting because they don’t account for the ongoing, real-time actions of your readers.

Data lakes are “static,” meaning the connections between the data stored within them is not continuously or automatically updated. This archive of data houses information that scientists can retrieve and combine in a variety of ways; and while old data can still surface new insights, the value of these insights deteriorate quickly over time and can’t be used reliably for personalization.
Your data lake is polluted
The insights that come from your data lake can only be as good as the data flowing into it. (Garbage in, garbage out!)

While many publishers are attracted to data lakes as a way to “un-silo” data, simply dumping in data from these silos can lead to even muddier waters. If data silos have existed in your organization, there’s a good chance that your data lake will contain duplicate records, name misspellings, misformatted information, and other inconsistencies - all things that can skew data analysis take significant time and manual effort to correct.

Without a plan for taxonomy and semantic consistency, a data cleaning strategy, and data security - you can poison the insights that are being pulled from your data lake.
You’re drowning in irrelevant data
Not all data is created equal. Because data lakes offer expanded capabilities for manipulating data, it’s easy to get distracted by vanity metrics or creating numerous charts and tables, while completely missing out on the metrics that actually drive results.

Has your data lake improved speed to insights? Have you seen notable impacts that lead to quantifiable ROI? Are you answering and solving the targeted business problems? If the answer is no, you may be wading in the weeds.
You don’t have permission to fish in the data lake
Storing all your data in a data lake without a smart data security strategy? You’re exposing your organization to immense regulatory risk.

Encryption and data security practices are constantly evolving, and can have a drastic impact on your data lake, especially when it comes to storing data that includes credit card numbers, personally identifiable information (PII), or confidential business activity data.

Regulations like GDPR and CCPA have placed restrictions on use of certain types of data, and with the upcoming elimination of third-party cookies, an organization’s commitment to first-party data and data compliance are more important than ever.
The value of your data is drying up
The volume and breadth of information in a data lake can make it a powerful tool in the hands of a data scientist - but when sales and marketing aren’t able to act on the insights uncovered, the value of that data evaporates.

What good does it do to un-silo your data, just to find that you’ve created a new silo in your data lake?

Data lakes don’t integrate with your MarTech, which means your team has to manually launch personalization in advertising and email marketing platforms. This doesn’t just tank the efficiency and effectiveness of your team - it can also introduce room for inconsistency and human error that blunts the power of your insights.

A CDP vs a Data Lake

If a data lake functions, for lack of a better metaphor, like a lake - a Customer Data Platform (CDP) functions more like a municipal water supply system. Your CDP is constantly ingesting new information as people engage with your content, validating it for quality and consistency, combining it with existing data to form a complete 360-degree view of each reader in your ecosystem, and packaging it for analysis and distribution to third-party platforms.

A CDP can churn through new data with consistency, surfacing actionable insights more quickly, without the need for advanced data processing and analytics. This means that your team can clearly see how your audience is engaging with your content, how certain pieces or types of content are performing, and act on that data directly within the systems they rely on.

Watch this four-minute video to learn more about what a CDP can do.

“Un-Swamp” Your Data

At Hum, we don’t think that publishers should need an in-house data scientist to hack their way through the swamp in order to get real, powerful insights from your data.

Our CDP was purpose-built for publishers, and designed to make it easy for your team to access and act on the right insights to identify and grow your audience, engage readers, and maximize revenue.

Grab your free copy of Hum's latest whitepaper - Turning Disparate Data into Solid Gold - to learn more about the strengths and limitations of technology solutions, such as data lakes, data warehouses, DMPs, and more.