AWS Cloud Data Lake, or how to analyse business data effectively – Case Study
A Data Lake will solve several important problems you might face when storing and processing large data volumes. It will enable easier access, more effective data analysis (including by AI and machine learning algorithms), scalability, data integration and real-time processing. Importantly, it will also reduce your expenditure, using cheaper and more flexible solutions to store and process your data. In this Case Study, we will explain, step by step, how we built an AWS cloud Data Lake for a client dealing with large volumes of unstructured data.
What is Data Lake and why might you want it?
The Data Lake environment is a set of services for processing and storing large volumes of data from disparate sources. It allows you to process structured data (e.g. the tabulated data of relational databases, CSV files), partially structured data (e.g. JSON, XML files) and unstructured data (audio, video, photos). The great advantage is that data lakes can handle any data quantities, ranging from small volumes to petabytes of data.
A Data Lake also allows you to pick your preferred data processing frequency, ranging from classical batch processing once per day all the way to stream processing, where results show up in near-real time starting from the moment your input data are generated.
Data Lake and your business
From a business point of view, a Data Lake allows you to tap the full benefits of your data. It collects data in all formats from many sources and company systems. The data do not need to be processed and are not stored with any specific application in mind, so they give you a clear overview of the situation. The data can be used for all sorts of analysis, which can then be used for reports, dashboards, real-time analytics and even machine learning.
Conclusion: knowing how to use Data Lake in your company allows you to take better business decisions, based on actual data rather than guesswork.
AWS Cloud Data Lake case study – our client
Let us tell you about a project we did for one of our clients. The company in question approached us because it had a problem with ineffective processing and use of unstructured data sourced from several databases (including popular relational databases, several business apps and social media). These unstructured data made it difficult to get a good overview of the company’s customers, products and business processes, and the costs of the repository kept growing.
Our client wanted to get back control over their data so they could use them for market and customer analysis. They assumed that they would have to open up to other data sources in the future, so they wanted a scalable solution, the costs of which they could manage. The AWS cloud seemed like the best option.
Building an AWS Cloud Data Lake
To build our cloud solution, we decided to rely on serverless services, so that our client would not have to foot any further support and maintenance costs. The company now only pays for the service time it actually uses and, when needed, it is automatically scaled up or down. They no longer have to worry about data volumes or the number of simultaneous users who might want to use the resources in the new environment.
Data are now stored in S3, which allows the compute and storage services to be separated, enabling independent data processing; the data are available without interruptions. The data processed in the newly built cloud solution are saved as objects within S3 buckets. They can be transferred using various ETL tools, e.g. the Informatica Power Centre, or via AWS services, such as AWS Glue or DMS (Data Migration Service). However, we are not going to discuss the ETL mechanisms here; rather, we want to focus on data processing within the AWS cloud within the framework of the Data Lake environment.
The figure below shows the overall architecture of the solution:
Data structuring and cleaning
The situation of the client was as follows: input data were unstructured, not optimised for processing, and their contents required modification.
For this purpose, we used the AWS Glue service, which processed our input data, performed data transformation, cleaning and anonymisation, and eliminated irrelevant data. The results were saved in a format with three important features for further data processing in the Data Lake cloud environment:
1. Data saved in an appropriate format
Big Data environments work well with three file formats: Parquet, Avro and Orc. The formats are binary and row- or column-ordered, which allows various degrees of file structure and compression changes.
When designing the Data Lake, you need to pick a file type. We decided in favour of Parquet, since it is a column-ordered format, which works well with the data processing engine, Spark, used by AWS Glue. If a binary file is column-ordered, each column has its own separate structure, and you don’t need to read entire file rows or blocks to read values from a column of your choice. The file does not require indexing. However, the data you save for further processing should be divided into smaller files to allow them to be processed in parallel.
2. Compressed data
S3 data can be compressed, e.g. as Gzip for flat CSV files. Big data files have dedicated compression algorithms and the choice is up to the designer. Remember, however, that more effective algorithms may use up more processing power for the purposes of compression and decompression. We used the snappy algorithm.
3. Key-partitioned data
We used the S3 service catalogue and subcatalogue structure to partition data in our target cloud solution. After processing in AWS Glue, we obtained the desired data structures in S3. The data then had to be catalogued so that they could be made available to users in a convenient way. For this purpose, we relied on the AWS Glue Catalogue. We created a Glue Crawler, which scans S3 files to build metadata (names of tables, columns, data types, partition lists) and then saves this information to the Glue Catalogue. Thanks to these services, practically no manual data description is necessary, which would normally be very time-consuming. If a new table is added to the Data Lake, the existing table structure is adjusted automatically. All you need to do to refresh the structure in the Glue Catalogue is launch the Crawler.
Data ready for users
At this point, we could make the data available to users. For this purpose, we chose AWS Athena, a service that can read data directly from S3 files, using information in the AWS Glue Catalogue. Athena uses SQL as its native language for data processing. The service is serverless, which means that our client again only pays for the queries actually processed, represented as the number of scanned file data bytes.
In order to further lower the cost of the environment, we tapped specific S3 file organisation properties, i.e.:
- Our SQL queries only use the necessary columns (as mentioned before, Parquet files are column-ordered);
- Compressed data allow the number of bytes read in Athena to be reduced, which further reduces costs;
- Data partitioning allows the number of files that need to be scanned to be limited, so our SQL queries include where conditions wherever possible.
Our finished Data Lake and project outcomes
The cloud solution we created in this way meets all the data processing security standards. In the S3 service, objects use KMS key encryption and the communication between AWS Glue and Athena occurs over an encrypted SSL channel. Athena’s access to data is authorised and controlled by AWS Lake Formation.
All this allows the environment to deliver data to users securely and efficiently, without any downtimes. It is also a way to process large data volumes in the cloud cheaply and effectively.
What did the client get?
- A scalable solution where they only pay for the resources they actually use;
- A safer solution that is fully accessible to users;
- A greater awareness of where their monthly expenses come from;
- Lower data storage and processing costs;
- 30% savings three months after the project.