I began my journey at Paytm back in 2015, before moving to join the Paytm Labs team in 2017. The project I've written about here is Homepage Personalization on the Paytm app, which is used by over 400 million users in India. This project was executed by myself and my co-worker, J, this past year.
- Santoshi, Machine Learning Engineer.
With the scale at which Paytm operates, we generate a huge amount of rich data. This calls for leveraging machine learning techniques to help us understand and act on the story that data tells us. Being a Machine Learning Engineer at Paytm Labs means getting to work on interesting machine learning use cases like Personalisation, Adtech, Risk Modelling, Fraud Prevention, etc. This specific project that the two of us MLE's focused on was for personalization since it provides the most value to our users.
The homepage of Paytm app and website is a collection of hundreds of different services that Paytm provides, which ranges from a small task like topping up your mobile talk time balance to something huge like buying health insurance. Each of these services has it's own separate page which can be navigated to from the homepage using icons (highlighted by the red box) or banners (highlighted by the pink box).
Having a personalized homepage not only means that our users get easy access to the services they want at any given time but also that users discover offers and discounts that are relevant to them.
Category level personalization
Since the icons and banners on the homepage are each representing a service category, our model works at a category level i.e. for each customer we determine what category icon and banner should be shown on their homepage.
The two basic elements of our personalization model are: to quantify what the interests of a particular user are and what is the general interest of all the users in the platform. The combination of these two signals helps us show categories that users have shown particular interest in (a category affinity model) along with content that is trending (general perception of the category). We call this model “Bayesian Trend filtering” because we build it on the Bayesian framework.
The input to the model is the interaction between users and content in the app. There are various types of interactions that can be modeled but to simplify let’s say we are considering a click in a category’s icon or banner as an interaction.
Assuming that clicks reflect user interest to a particular category:
p(click | category=c) = (p(category=c | click)p(click)) / p(category=c)
- p(category=c | click) is the probability that the user’s click being in category c. It can be estimated by the click distribution observed in a specific time period.
- p(category=c) is the probability of a banner or icon being about category c.
One of the possible interpretations is to treat this as the proportion of the global interest in a specific category in a given time period. For instance, in a month of summer, there could be days in which the global trend should be about BBQ accessories. This should be reflected in the prior probability. Another interpretation of this prior is simply the proportion of banners that is about category c in the time period. For instance, for a slider, the proportion of the banner that is category c is x. We can compute this probability directly using the data we have about the banner during a period of time or we can approximate this probability with the click distribution of all customers.
- p(click) is the prior probability of the user click, regardless of the category.
Bayesian Trend Filtering
Ultimately we would like to estimate the click distribution of the user in the future. Therefore we want to model:
pnext(category=c | click) = pnext(click | category=c) pnext(category=c) / pnext(click)
Note that we can estimate pnext(click | category=c) =p(click|category=c) by assuming the user interest doesn't change drastically over a small window of observation. We can also assume that pnext(click)=p(click).
pnext(category=c | click) = p(category=c | click) pnext(category=c) / p(category=c )
Note that pnext(category=c) is the trend or popularity component which is estimated using data from click distribution of all customers in a given time window. Therefore, if pnext(category=c)=p(category=c), then the above equation gets even simplified and the next click is basically driven by the user interest and not the trend.
Why the Bayesian framework?
At Paytm, we believe in building solutions that are simple first and then iteratively adding more complexity. The two of us decided to use such a framework as it's very simple to formulate and implement. This helps us to quickly come up with a good baseline model that can be improved later on.
The formulation is easy to understand and the resulting model predictions are interpretable by understanding how much individual components contribute to the predictions as it is just a multiplication of user interest and trend.
The model is very fast as computing the probabilities and can be computed in real-time too.
This framework also allows us to add more complexity by replacing the individual components with more sophisticated machine learning models or distributions than simple counts.
If you’re a Machine Learning Engineer who loves solving complex problems that come with a scaling company, you just might be the right person to join our nimble team. Our Machine Learning Engineers work on interesting use cases including; Personalisation, Adtech, Risk Modelling, Fraud Prevention, and more. Think you have what it takes? Apply here!
Mule Account Detection
Hana, Cyber Analyst
The problem of mule account detection was first introduced in January 2020. A mule account is an account that is used to remove money from the system, with the intent to be an account not associated with the criminal. Since many types of fraud use mule accounts, detecting these accounts can hinder the entire fraud landscape.
Paytm Bank needed help to identify mule accounts in their system, and ideally block new mule accounts at the time of creation. So some of the Fraud Prevention team members at Paytm Labs traveled to India to get as much details as possible about the problem and brainstorm ideas to detect and prevent mule accounts.
Paytm bank already had multiple rules to use for detecting mule accounts. However, the performance of the bank's mule account detection mechanism was quite poor. Roughly 20% precision and 50% recall - meaning that a large number of the accounts detected as 'mule' by their approach were false positives and only half of the mule accounts were identified. Out of 90M accounts at that time, only ~120K had been labeled fraudulent; therefore, the assumption was that the actual number of fraudulent accounts was higher and the initial labeling done by Paytm Bank was not accurate.
The problem presented to the Paytm Labs Fraud Prevention team was to provide a more effective solution for detecting mule accounts. That is to say, we were supposed to come up with a feature set that would present the prominent characteristics of mule accounts and then explore different machine learning models to find a model that would result in a higher recall while not sacrificing precision. The reason for emphasizing on a model with a higher recall as opposed to one with a higher precision was that if we were 20% sure an account was a mule account, the account could be further investigated or limits could be imposed on the account. If we were 95% sure, the account could be blocked immediately. Thus, we decided that the focus of our ML model would be on minimizing false negatives.
Our goal was to use machine learning to create models and then benefit from our in-house rule engine - Maquette - to build rules on top of our machine learning model. This would help us monitor accounts on a daily basis, and for each account, our ML model would come up with a score which would be the probability of being a mule account.
We explored different ML algorithms from Random Forest Regressor to Gradient Boosting Classifier and Neural Networks. We worked hand in hand with the Fraud Prevention team in India to come up with the best set of features and we did lots of feature pruning in the process. We ran a large number of experiments to conclude the right combination of features as well as the ML algorithm that would work best for our scenario. We used precision, recall, and PR & RoC curves for evaluating the performance of each version of the model. It was a challenging yet a very exciting journey!
As the next step, we wanted to make sure we could proactively stop mule accounts activity on our platform. The offline model that we set up, operated retroactively and looked at all suspicious account activities during the past day. Such a mode of operation would allow us to block the fraudulent accounts from causing further damage but wouldn’t prevent or minimize it. Therefore, we decided it was time to set up a real-time model that would proactively monitor account activity on the platform and prevent potentially fraudulent transactions from happening. For each transaction, the real-time model would be able to report a score at the time of transaction and make a decision whether to allow or block the transaction.
Our current real-time model has been built as a simpler version of our existing offline model by not utilizing the full set of features - this real-time model uses 17 different features. The model uses a Random Forest Classifier with a class weight of 1 to 5 for non-mule vs mule accounts, and has been trained on 1.5M and verified on 700K customer accounts. This model is currently up and running in production, resulting in a precision of 65.1% and recall of 74.1%. Although the current model is flagging mule accounts way more effectively than the previous versions of the model, we know there is still room for improvement and we can always incorporate “smarter” features to prevent fraud!