Data Reply migrated the B2B vehicle manufacturer’s on-premises data lake to the AWS cloud to ensure flexible, cost-efficient analysis.
One of the main focuses of MAN Truck & Bus is on services that help fleet managers maintain, repair and manage vehicles. To make this possible, the B2B vehicle manufacturer relies on constant progress in technology, which in turn means building a firm foundation for data management.
Back in 2016, in order to bundle reporting and data science initiatives MAN Truck & Bus commissioned Data Reply to develop an on-premises data lake, which the consultants built and have run ever since. However, it wasn’t long before more innovative options emerged.
MAN Truck & Bus operates an IT landscape that is scattered across numerous departments. Data is produced and collected in a myriad of applications, databases and systems. Some of it dates back 30 years or more and comes in many different formats. The first data lake was based on Cloudera Hadoop and Apache Kafka.
But with major cloud providers in the ascendant, that approach quickly proved far less scalable and flexible than their rapidly growing services. This prompted MAN Truck & Bus to start a migration project to the AWS cloud, and once again the company relied on the expertise of Data Reply.
Data Reply received the order to set up a data lake in the cloud of Amazon Web Services (AWS). The team started by building a centralised solution for data storage and management based Amazon’s Simple Storage Service (S3). The data stored in the Apache Hadoop Distributed File System (HDFS) was then migrated to the cloud and organised in layers, in keeping with data lake best practices. To begin with, the data was written to a landing layer, primarily by means of Kinesis and Apache NiFi and mostly in the file formats of the individual source systems. ETL pipelines then processed the data and stored it in a smaller number of selected file formats, The pipelines masked sensitive information and augmented the data with the help of a solution developed by Data Reply. The result was stored in a final layer called Datahub. Finally, various AWS accounts were given access to individual data packets that are required for specific application cases. The division into multiple accounts enables costs to be matched with their respective application cases.
MAN Truck & Bus explicitly requested that the project rely on serverless solutions, and Data Reply complied when circumstances allowed.
AWS S3 is used for data storage, together with AWS Glue for Spark-based ETL pipelines that are consolidated in Glue workflows.
Athena serves as an SQL interface. BI analysts can also use Quicksight to run SQL queries and generate reports. Data scientists are provided with their own EMR clusters, along with any other tools they require.
The infrastructure is managed via AWS CloudFormation and Sceptre.
Data Reply relies on a service it developed to configure Glue workflows and jobs.
Proprietary solutions from Data Reply calculate the necessary resources. An additional masking solution ensures that sensitive information is monitored.
Data Reply relies on a service it developed to configure Glue workflows and jobs. It starts automatically when data is uploaded to the S3 data lake. With the help of the basic configuration in AWS Systems Manager, the service calculates the optimal number of data processing units (DPUs) required to process the underlying data. This prevents a situation where too many cloud resources are requested and costs get out of hand.
In addition, Data Reply uses AWS Managed Services for Redis and Elasticsearch. These systems are used for Data Reply’s masking solution and for functional monitoring of ETL pipelines.
The centrepiece of the data lake is the central AWS account, where the data is allocated to multiple S3 buckets based on its source system. This account is also used by the AWS Glue ETL pipelines, which prepare data packets for a myriad of application cases.
The most important preparatory step here is the masking of sensitive information, for example, based on the GDPR. Confidentiality problems can be prevented by performing these steps in the main account. At the same time, an additional service allows other accounts to transfer the data in plain text, provided there are legitimate reasons and appropriate authorisations.
In addition to the main account, there are also a number of accounts set up for specific application cases. In accordance with best practices for AWS cross account access, read permission can be granted for required data.
Developers of data processing applications can use their preferred technologies without making the infrastructure of the main account unnecessarily complex. That frees up time the operations team would otherwise have spent provisioning and maintaining applications such as these.
calibrate the AWS cloud services to individual standards
Serverless solutions make it possible to use the key benefits of the AWS cloud.
Data Reply keeps an eye on all the collected data and can manage access on a very granular scale. This allows end users at the company to focus on creating added value for the business, rather than on arduously collecting data from a multitude of systems in a variety of file formats.
Although Data Reply offers templates and additional support to the data scientists and analysts at MAN Truck & Bus, it is ultimately up to each user to decide which technologies they want to use for their application case. Data Reply makes the data available in modern, widespread formats such as Parquet or Avro.
Sensitive information is protected automatically.
MAN Truck & Bus is one of Europe’s leading commercial vehicle manufacturers and transport solution providers, with more than 9.5 billion euros in annual sales (2020). Its product portfolio includes transporters, trucks, buses, diesel and petrol engines and services connected with the transport of passengers and goods. MAN TRUCK & BUS is a company of TRATON SE.
As part of the Reply Group, Data Reply helps customers work in a data-driven manner. Data Reply operates in a variety of industries and business segments and works closely with customers to help them use their data effectively so that can they obtain substantive results. To this end, Data Reply focuses on the development of data analytics platforms, machine learning solutions and streaming applications that are automatic, efficient and scalable and that do not compromise on IT security.