Home > NewsRelease > Fishing In the Data Lake
Fishing In the Data Lake
InfoCommerce Group -- Specialized Business Information Publishing Expert InfoCommerce Group -- Specialized Business Information Publishing Expert
For Immediate Release:
Dateline: Philadelphia, PA
Friday, May 8, 2020


You have likely bumped into the hot new IT buzzword “data lake.” A data lake is simply a collection of data files, structured and unstructured, located in one place. This is in the eyes of some an advance over the “data warehouse,” where datasets are curated and highly organized. Fun fact: a bad data lake (one that holds too much useless data) is called a “data swamp.” 

What’s the purpose of a data lake? Primarily, it’s to provide raw input to artificial intelligence and machine learning software. This new class of software is both powerful and complex, with the result that it has been bestowed with near-mystical qualities. As one senior executive of a successful manufacturing company told me, his company was aggressively adopting machine learning because “you just feed it the data and it gives you answers.” Yes, we now have software so powerful that it not only provides answers, but apparently formulates the questions as well.

The reality is much more mundane. This will not surprise any data publisher, but the more structure you provide to machine learning and artificial intelligence software, the better the results. That’s because while you can “feed” a bunch of disparate datasets into machine learning software, if there are no ready linkages between the datasets, your results will be, shall we say, suboptimal. And if the constituent data elements aren’t clean and normalized, you’ll get to see the axiom “garbage in, garbage out” playing out in real life.

 It’s a sad reality that highly trained and highly paid data scientists still spend the majority of their time acting as what they call “data wranglers” and “data janitors,” trying to smooth out raw data enough that machine learning will deliver useful and dependable insights. In a timely response to this, software vendor C3-AI has just launched a Covid 19 data lake. Its claimed value is rather than just a collection of datasets in one place, C3-AI has taken the time to organize, unify and link the datasets. 

The lesson here is that as data producers, we should never underestimate the value we create when we organize, normalize and clean data. Indeed, clean and organized data will be the foundation for the next wave of advances in both computing and human knowledge. Better data: better results.


Pickup Short URL to Share
News Media Interview Contact
Name: Russell Perkins
Group: InfoCommerce Report
Dateline: , United States
Direct Phone: 610.649.1200, ext. 2
Jump To InfoCommerce Group -- Specialized Business Information Publishing Expert Jump To InfoCommerce Group -- Specialized Business Information Publishing Expert
Contact Click to Contact
Other experts on these topics