Data Lake: everything you need to know about this concept

Our goal here is to show you all the nuances of this data storage and centralization technology that is part of an undeniable technological revolution focused on data.

We know the inestimable value that data has. The reality of today’s companies is that it is still necessary to improve data collection, storage, organization and interpretation techniques.

This is because more and more professional qualifications and cutting-edge technology are needed so that guidelines and main business decisions are made based on this data.

The central issue is that we live in a reality where raw information is very large and it is necessary to filter and interpret it to make the best decisions. What to do with a torrent of data?

Use them or store them? Faced with such questions, the concept of data lake emerges as an innovative alternative for data storage.

Table of Contents

What is a data lake?

We are talking here about a repository capable of storing large amounts of raw data together and in their respective original formats, that is, native.

In other words, this repository makes up an interesting strategy for data management, so that the manager can have a more refined view of them.

When we are faced with an overwhelming amount of data, we need to understand it holistically so that the quality of storage improves.

Thus, the data lake works exactly like a large lake of data (as the name suggests in English) stored together and in its native form. Exactly like water that comes directly from different springs and flows into the same lake, without any filtering or purification process.

This way it is possible to observe and select certain sets of data that may be good – or that are not so relevant – for a particular company.

The data that makes up a data lake is defined only after querying. If you work directly with raw data, the data lake allows you to access this data through advanced analytical tools or analytical modeling.

When we query a data lake, we can select which data sets can be selected for analysis and, logically, when there is a need for this, which can be accomplished through the application of schemas.

This procedure of selecting data for analysis is called reading scheme, as the data is in a raw storage state, waiting to be analyzed.

How to explore data from a data lake?

Data lake users can analyze and explore lake data however they want, without needing to move that data to another system.

Typically, the generation of reports and the collection of insights from a given data lake takes place on an ad hoc basis , that is, there is no need for users to frequently extract analytical reports from another repository or platform.

However, it is interesting to apply some automation scheme that can copy a given report, if necessary.

Another important aspect of the functioning of data lakes is continuous maintenance so that data sets can be accessed and, consequently, used.

Without constant monitoring and maintenance of this data, there is a risk that it will become useless, too heavy, too expensive and inaccessible. When data becomes just virtual garbage, it becomes called data swamps or, as the expression in English, data swamps .

Although data does not have a fixed schema before storage in a data lake, governance is essential to avoid data swamping.

What is the architecture of a data lake like?

As it is possible to keep data stored in an unstructured, semi-structured or structured way, the architecture of a data lake is relatively simple.

Furthermore, it is also possible for the collection to be carried out from several sources within the same organization, as the data warehouse will store them in folders or files. Don’t worry, as we’ll talk about the differences between a data lake and a data warehouse.
In addition to what we have already discussed, it is worth adding that it is possible to host a data lake in the cloud or on premise .

Traditional storage systems don’t offer the kind of storage scale of a data lake, which can reach the impressive exabyte scale.

This is quite relevant because, when creating a data lake, it is very likely that the manager will not be aware of the amount of data that will be stored.

This type of architecture is very useful for data scientists as it makes it possible for them to extract this data and explore it within the company, and it is also possible to share it and discover new insights through cross-referencing with heterogeneous data from different fields.

It is also possible to use big data analysis and machine learning as ways to analyze and evaluate the data in a data lake.

Additionally, it is important to tag data with metadata before introducing it into a data lake, to ensure its later accessibility.

Distinction between data lake and data warehouse

Some people tend to believe that data lake and data warehouse are the same thing and do not even understand the need to have a data lake if there is already a data warehouse available.

Let’s deconstruct this idea: they are two different things and the only characteristic in common is that they are both big data repositories.

The data warehouse is older and provides a structured and organized data model, ready to generate reports. The data warehouse makes data available for use and analysis.

A standard data warehouse has:

Relational database for storage and management;
Data mining with report generation and statistical analysis;
Sophisticated analytical applications;
Tools that enable customer analysis.

The data that constitutes a data warehouse, as a rule, derives from multiple sources, such as transaction applications and log files.

It is also possible to create a history record for business analysts and data scientists.

Data warehouses were developed with the aim of analyzing data. Its analytical processing is carried out on data that has already been prepared for analysis, that is, it has already been contextualized and converted in order to generate information.

Furthermore, they are capable of working with large amounts of data from different sources. A data warehouse is an opportune instrument for an organization that wants to process advanced data analysis from multiple sources, based on historical data.

Data warehouses have the following main characteristics:

Variability over time;
Consistency between different data from different sources;
Data stability: once inserted into a data warehouse, data does not change;
Possibility of analyzing data according to the subject or area of the organization.

As for architecture, common data warehouses can be:

– Safe and protected areas that enable informal and rapid data exploration;
– Simple: share a design where raw data, summary data and metadata are stored in the central repository;
– Simple with staging area: data is filtered in a staging area, where it is cleaned and processed before entering the data warehouse;
– Hub and Spoke : as soon as the data is ready to be used, it will be moved to the data marts, which are between the central repository and end users.

Data lakes, in turn, can store a huge amount of raw (unfiltered) data that can be used in the future if there is a specific need.

This is generally data from IoT devices, mobile applications, line-of-business data, social networks and many others that are collected in raw form and stored in the data lake.

Both the integrity and the entire structure, selection and format of the data sets originate at the time of analysis by the professional who carries it out.

If the company needs a lower-cost form of storage for unstructured and unformatted data that comes from multiple sources – but that needs to be stored for later use -, a data lake will be the ideal option.

As they have been used for approximately 30 years, data warehouses, despite being extremely useful to organizations, were not planned for the current volume of data and its diversity of nature, not to mention that organizations spend approximately 20% of their time analyzing data and 80% preparing it, which indicates that it is clearly more complicated to organize and structure data – which is not always meaningful to the organization.

With the data lake, this issue can be resolved, as there is no pre-defined schema or model. This saves time previously spent preparing and structuring data, as this repository stores data in its raw form, which can be very advantageous for the company.

In this way, with the reduction in cost, the volume, speed of gathering and storing data, the flexibility of the raw data collection system and the ease of access to them, the data lake is not only an excellent option for the organization but also a great tool to use in conjunction with a data warehouse.