Case Study: Using AWS-based machine learning for automated financial transaction classification
One of the projects we completed for a client in the financial sector involved the deployment of a fully configurable tool for managing sales processes and offering banking products. Thanks to a cloud-based machine learning model, the new, expanded system can now automatically assign categories to incoming payments in clients’ bank accounts. Below, we explain, step by step, how the solution was built and which AWS services we used.
Machine learning models are now used with increasing frequency for banking software development, especially in the area of automated inflow category assignment in the banking sector.
They rely on classification algorithms that can be trained on large data sets to assign specific categories to specific transaction types. After a period of training, the model will be fit for use in automated category assignment to new transactions.
If your classifications are precise and accurate, machine-learning models may be used to streamline the process of categorising bank account inflow operations, which can save you a lot of time and resources. However, if the model is not well-trained or based on a non-representative dataset, it will be prone to categorisation errors.
Read on to learn about a solution we have created for our client from the banking sector. The article will describe the problems you need to address before the system can work smoothly and quickly and spare you further AWS costs.
What was the challenge?
Our client wanted the system to use machine learning technology to automatically assign categories to cash inflows coming into its clients’ bank accounts.
The category appropriate for each operation was to be inferred from fields describing each individual cash transfer, such as: purpose, amount or sender’s name. This information could also be supported by data from other inflow transactions recorded in the same bank account. This meant categorisation could rely on various historical data and indicators related not only to the transaction at hand, but also the account as a whole. Our first task in the project was to divide cash inflows into 9 separate categories, such as, e.g. salaries, bonuses, social insurance benefits. At the same time, we had to make sure the list of categories could be easily expanded in the future.
Machine-learning model
Category assignment is best done through batch processing: the system is fed with data on operations performed on a certain set of bank accounts over a certain period of time and applies an appropriate category tag to each record, returning a new data set at the output. The whole process should be fully automated and easy to launch by the staff. Because we wanted the decision-making core of the system to be based on machine learning, we also had to work out a practical model training process. The training would be based on data provided by the Bank, which included a history of inflows coming into its clients’ accounts and their assigned categories. We wanted to make sure our client would be able to easily train the model on their own.
Why the AWS cloud?
The client wanted a cloud-based solution. We chose Amazon Web Services because it provides an easily available machine-learning service (Amazon SageMaker) and useful ETL tools for the pre-processing of bulky datasets. Because the solution was to be hosted outside the client’s infrastructure, one of the key issues had to do with data security. There is no need to emphasise this: client account activity data are obviously sensitive and a possible leak will create many opportunities for their mishandling. This is why it was fundamental to introduce special safeguards, especially for the two-way data transfer between the cloud environment and the bank’s systems, as well as within AWS as such (in-rest and in-transit).
Other important requirements included: easy deployment in the target AWS account, secure management through appropriate policies and authorisations, and system audits.
Solution architecture overview
Functionally, the system consists of two components: a process that assigns categories to bank operations using a machine-learning model and a process that trains a new model based on previously-categorised data.
From a logical point of view, the processes can be broken down into the following steps.
1. Categorisation process:
- Pre-filtering and normalising operation records;
- Converting data, record by record, into a numerical format expected by the ML model (pre-processing);
- Performing an inference (categorisation) based on processed data;
- Processing results;
- Adding the resulting categories to the original dataset, keeping the records that were filtered out at the beginning.
2. Model training process:
- Pre-filtering and normalising operation records from the training dataset;
- Building a dictionary based on the text fields in the operation records;
- Converting data in the same way as in the categorisation process, but including already-assigned categories;
- Dividing the processed data into subsets intended for training, validation and testing;
- Training the model using the processed data;
- Registering the model and making it available for use in the categorisation process.
Importantly, we had to design each of these processes in a way that would allow various AWS services to be elicited and run in the long-term, without generating any orchestration costs.
Using AWS Step Functions
AWS Step Functions proved a perfect match for our requirements. The central idea of the service is that of a state machine, which can be understood as a diagram, in which each node describes an action carried out by an AWS service triggered for that purpose, while the consequences are defined by the edges. When triggered, each action gets an object, known as a state, as an input and then generates a new output state object that passes over to the next action in the sequence. This pattern allows you to run long processes with minimum resources.
Another advantage of Step Functions was that the state machine can be easily controlled from the AWS console, which enables process monitoring (a very helpful presentation of the machine as a diagram), but also provides access to its history.
These factors made AWS Step Functions a natural choice and both processes, categorisation and training, were implemented as state machines. Launching a process boils down do starting the correct machine and feeding it a set of input parameters.
AWS Step Functions and orchestration
The downside is that as an orchestration tool, Step Functions will only provide you with a backbone in which the steps of the process are embedded. To carry out any action, even the simplest one, you need to use a dedicated AWS service. The services we used specifically in this project are described below.
Data storage: Amazon S3
The first thing we had to take care of was data storage. Because of their large volume, exchange format (CSV), and need for easy access from various AWS services, we decided in favour of Amazon’s Simple Storage Service, or S3 in short. Apart from simplicity and versatility, its many advantages include low cost and data security safeguards using encryption keys (KMS encryption service).
The main S3 bucket is where all the data going through the system eventually end up; this includes input data (provided by the client), working data and output data. When the process is triggered, their location within the bucket can be described with corresponding parameters. S3 also contains saved scripts and all other necessary process resources.
Data processing: AWS Glue
As we mentioned earlier, our input data consist of the history of bank operations in a CSV format. Each row of the file describes one incoming transaction, including fields such as: unique transaction ID, sender’s data, recipient’s data, transfer purpose and amount, etc. Each single dataset consists of many CSV files and covers a complete history of operations for a set of bank accounts over a selected time period (e.g. 6 months). There are no specific requirements as to the distribution of operation records within individual CSV files or their sorting order. Generally speaking, what we have is a large volume of tabulated data with a well-defined structure, all contained in CSV files. A perfect solution for processing data of this type is Apache Spark, available in the AWS environment e.g. as part of AWS Glue, which is the service we decided to use.
What is AWS Glue?
AWS Glue is a large set of tools to support ETL processes on large volumes of data, allowing you to hook up to various data sources. For the purposes of our project, we selected only two Glue functionalities: Glue jobs and Glue Data Catalogue.
AWS Glue jobs
Glue jobs are de facto programs running in the Spark environment (in our case: PySpark); the environment is managed automatically to ensure maximum scalability and built-in libraries provide support tools. In addition, Glue also provides easy, two-way access to data in AWS-specific sources, which was important for our project, since we stored all our data in S3.
Glue jobs were used for all the data processing steps involved in categorisation and training processes:
- Filtering out, normalising, cataloguing data;
- Building file dictionaries;
- Converting data to formats expected by the model;
- Processing categorisation results;
- Preparing final datasets (with categories); Dividing training data into subsets.
Glue jobs read S3 input data, process (or aggregate) them and save the results in a separate S3 folder. Data obtained in the cataloguing step go to S3 through the Glue Data Catalogue.
Jobs had to be prepared so that they would be easy to trigger locally, within the Glue image delivered by Amazon, which was a bit of a challenge. The ability to start jobs locally enabled easy (and cost-free) code testing and local data pre-processing as we worked on the architecture of the model.
ML model: Amazon SageMaker
The solution crucially relies on machine learning, tasked with assigning categories to inflow operations. All the other components can be considered as its packaging, which ensures the system’s “communication with the world”. The ML model is crucial for both processes of the system.
The machine learning model was created in the AWS SageMaker. The model as we understand it here is an instantiation of an inference model registered in SageMaker, which, above all, consists of a launch environment Docker image (one of the images provided by AWS) and a link to a S3 file with a trained and saved model. The file contains both the parameters learned during the training and the executable code that the SageMaker will run for the purposes of the inference process within the Docker image.
Model categorisation
To perform categorisation, you need to launch a Batch Transform job in SageMaker with the name of the model as a parameter. Other parameters will also include a path to the operation data catalogue in S3 and a selected save path for the results. The job is launched in Step Functions, which is responsible for category assignment.
Model training
Model training also requires launching a dedicated job in SageMaker. This involves the other Step Functions machine and this time, the job is a Training Job. The list of parameters here is longer than that required for inference, and includes, e.g. model hyperparameters, paths to individual subsets of training data (training, validation, testing), a link to the training script and the Docker image address to launch the script.
When the training ends, resulting data are saved to the appropriate location in S3 and a new instantiation of the model is registered in SageMaker. Files saved in S3 include not only model data but also files with information on the course of training and the efficacy of the model, as measured on individual subsets.
Training script
At the heart of the training job in SageMaker is the training script. The script is responsible for building the model, its iterative training and saving the generated model code along with the newly learned parameters. Optionally, it may test the model on a data subset isolated from the training set. The results of the test are saved and ultimately moved to S3 for analysis. During training and testing, metrics that provide a “live” peek into the changing efficacy of the model are saved to CloudWatch logs. Two metrics are particularly important here: accuracy, or the percentage-based correctness; and the value of the loss function, i.e. the mean error size (the closer it is to 0, the better). A pair of these metrics is published for each of the three subsets.
To use a trained model for categorising operations, you just need to indicate it when triggering the categorisation process.
We also added an option for retraining an existing model on a new portion of data. In this case, the training script will load a previously-saved model and then continue with the standard training process. The files of the retrained model will be saved separately in S3 and a new instantiation of the inference model will be registered in SageMaker.
Process support: AWS Lambda
As an orchestration tool, Step Functions does not use an elaborate logic that would go beyond the conditional execution of selected diagram branches and simple state transformations. You cannot, for example, add new fields to the state, let alone operate on S3 files. To perform such actions, the state machine must be connected to an AWS service that allows any possible code to be executed. A good tool of this kind is AWS Lambda, a service that helps execute predefined, short-lived, serverless functions.
We used Lambda functions for the following purposes, which differed slightly for each of the two processes:
- Validating input parameters and preparing the initial state of the Step Functions machine;
- Preparing work files in S3;
- Cleaning work data in S3 and Glue Data Catalogue after the successful completion of the process.
In the training process, the Lambda function is also used to collect the newly generated files in a single catalogue.
Solution monitoring: AWS Config and AWS CloudTrail
Two services were used to control solution security:
1. AWS Config
A service that scans the configuration of the services you use in accordance with selected security rules in Config. The resulting scan report will allow you to quickly check whether all the services are correctly configured and identify any issues you still need to address.
For example: one of the AWS Config rules allows you to check whether all S3 buckets have enabled default data encryption and enforce a secure data exchange between S3 and other services. It is a good idea to perform regular service configuration checks, because the configuration of AWS services may change over the app life cycle (e.g. with the deployment of new upgrades).
2. AWS CloudTrail
Another service used for security control, AWS CloudTrail, collects data on any changes in service configuration, API service triggers and successful and unsuccessful authorisation attempts.
During configuration, you will need to decide which events you want to monitor, because this is the kind of data collected by the service. As a result, if the service stops working or does not work as expected, you will be able to find out who modified its configuration. It will also let you know who tried to log in to our services or downloaded data from S3.
These two simple services guarantee that our application is secure and the events within it are being monitored.
Data pre-processing
As per principle, machine-learning models can only work with numerical data, which means that for categorisation to be possible at all, all operation records must be converted into a sequence of numbers (a vector), encoding as much information as possible to facilitate correct category assignment. While fields such as transaction date or amount are easily converted into numbers, text fields, such as transfer purpose or sender’s name, pose more difficulties. How can you obtain a numerical vector for data about a key word in the text, considering that it may appear in different inflectional forms and contexts?
Natural Language Processing in machine learning
This is the basic problem you have to address when implementing a machine learning solution in the field of Natural Language Processing, or NLP. To do so, you can fall back on various text vectorisation techniques. In our project, we used a relatively simple solution, which was able to detect the most frequent words (including categories) and marked their occurrence with a non-zero value at specific vector locations. Of course, at the dictionary-building stage, the words are processed, e.g. to neutralise different forms of the same word.
Vectorised text field values are an important but not the only part of the operations vector. The vector also includes a number of extra fields that provide a numerical representation of the transfer transaction and the recipient’s account. For the latter, we took into account a number of characteristics describing cash inflows, including historical indicators closely related to the analysed operation, which can tell us, e.g. to what extent the transfer is repetitive (e.g. a salary, payable every month in the same amount).
Data pre-processing turned out to be the most time-consuming step in both processes. The time it needs was reduced by performing Glue jobs in parallel. We also had to optimise the code. The nature of the generated data was particularly important here: it consisted of records with a large number of fields (several thousand).
Model architecture and implementation
The model was implemented using the TensorFlow library with API Keras, because the tool was convenient and we already knew it well; also, SageMaker offers a number of machine-learning Docker images that use this particular library (CPU and GPU versions).
The model as such was based on a feed-forward multilayer neuronal network, which takes records of operations converted into numerical vectors as its input and returns a vector of a size that corresponds to the number of categories, whose individual values can be understood as membership coefficients for successive categories.
Because the network is fed with so-called sparse vectors, with a large number of zero values, the role of the first layers of the model is to compress the input vector to an acceptable size (several hundred values). The system of subsequent layers was designed in a way that allows the model to optimally identify the interrelationships that may correspond to membership in a specific category.
We tested out more than a dozen different variants of the architecture to pick one that was simple but also highly effective (a lower risk of “overlearning”, which reduces the model’s efficacy when applied to real datasets). This model architecture was implemented within the training script described previously.
The quality of the final machine-learning model
Our production model, or more specifically, its first iteration, was trained on a dataset that included operations on 100 thousand bank accounts over a period of 15 months, reduced by a testing set containing c. 10% of transactions from the final 12 months of the period in question. In total, the training used approximately 6.5 million records.
The efficacy of the model for the training set ranged from 98.9% to 99.2%. Importantly, most errors stemmed from the fact that the model “detected” irregularities in the categories assigned to specific operations in the training set provided by our client.
Solution deployment
For quick and automated installation and upgrades of the machine-learning solution on several different AWS accounts, we used HashiCorp’s TerraForm, which provides an alternative to Amazon’s CloudFormation service. The infrastructure of the solution involved configuration files containing the declarations of all necessary AWS components in the TerraForm-appropriate language. We also configured files that had to be saved in S3 for the solution to work (e.g. Glue job scripts). Values such as target account ID or region are read from variables, which enables easy deployment in any AWS account as long as you have the necessary permissions
The deployment as such boils down to triggering a simple TerraForm command, which syncs the state of the AWS account with the contents of configuration files and locally stored resources (packed Lambda functions, S3 uploaded files and others).