Architecting for global real-time fraud prevention with a performant data platform
Nick Blievers:
Hi, everyone. I'm Nick Blievers, VP of Engineering at ThreatMetrix. With all the events that are going on in the world, I'm unable to travel as I think that applies to a lot of us. So I figured I'd at least start the slideshow off with some pictures of the two cities where my team is located at San Jose and Sydney. I'm assuming you might've picked up it's Sydney, but maybe not so much with San Jose. And today I would like to talk to you about our global real-time fraud prevention product and how we base that on a highly performant data platform.
So starting out, who are we? What is it we do? We are a company that provides a risk analysis for online transactions. Our customers come to us if they have any kind of transaction, I say transaction, it can be a shopping cart, it could be a login scenario, it could be a new account creation. It could be anything really where there is a risk that someone might be doing something fraudulent and wanting to know whether they should trust this event or not. And I think the easiest way to show what we do is to have a little demo and we have a demo site specifically for this purpose. It's a little bit contrived, but that's the nature of these things. So as a customer might have a website and that website might allow account creation and once that account creation page is loaded, we have some profiling tags that will collect device data on page load. Obviously in real website you wouldn't have a little pop-up saying we're doing this, but it makes a nice demo.
This particular demonstration website has some sort of pre-canned data that we can fill out. This scenario is, as you can see, location and device spoofing. It's a fraudulent sign-up event. And this is a scenario that many of our customers face where they have a new account sign-up, by definition at that point they don't really know anything about that specific user and they need some way of ascertaining the risk of that. Now, real website, you obviously wouldn't click create an account and be shown all of this information, but this is demonstrating what the outcome of the policy that is executed that that website, that vendor might use to choose the next course of action for that particular customer journey. So we have a bunch of information here. The first thing, the most important thing is the review status. Sorry, the policy is executed. It's been assigned a score of negative 75, which is not a very good score and it's been given a review status of reject. In a real world scenario, if you had to reject, you would probably just deny this login.
It would have to be pretty bad, but that's viable. It may also fall into the review bucket where it's bad, but it's maybe a real customer and you might want to take some kind of action at that point, whether that's a step-up scenario or some other in a new account creation scenario, maybe you do some kind of verification over the phone or whatever. It depends on the customer. Beyond just the policy score, we have specific user information and there's a little bit of interesting data here. We have an account name, we have an email address. We've managed to use those pieces of information and combined with some of the other information we've got to generate a digital ID. This is an identifier that represents a collection of attributes that are seen together that perhaps represents an entity.
There's some interesting things here with the first scene and the last event dates. This indicates that we have actually seen this email address before, but because the first scene and the last event are at the same time, we've possibly only seen this email address once before. And I don't know when this demo was actually taken, so these dates were quite possibly closer to when the demo was recorded. So maybe we've only seen this email address in the last week or so. The account name is the same, we've only seen it once. That doesn't give a great deal of confidence that it's a real name or it's a real email address.
The age of the email address is something that's quite an important piece of information. And if an email address is newly created and has no history associated with it, we generally don't assign it much trust. Beyond the user information, we do also have more device-specific information. So in this scenario, the web browser is Chrome. That's nothing inherently evil in that arguably, but we also have the exact ID and smart ID. Now exact ID is a cookie-based identifier. This also has something we've only seen once previously. Smart ID is a little bit more interesting in that it's an identifier that's based on a number of different attributes. It's nowhere near as simple as just, hey, we've seen this cookie before. It's using all of the attributes we found about the device, the screen resolution, the OS, the OS version though all of these things combined and saying, well, we think we've seen this device before and the date that we've first seen this smart ID is considerably earlier than what we've seen for the exact ID. This suggests that cookies have been cleared and that the user has taken some steps to hide their identity.
On top of device information, we have network information. Network information gives us quite a bit of useful information actually. Maybe we've detected a VPN or in this case we've detected a proxy. We've been able to get some location data from those two IP addresses, the proxy IP address and what we've determined is the end user's actual IP address. And we've been able to look up those locations and get a geographic location for them. And you can see those two locations are not near each other. They're not the same location. That's often an indicator that there is something fishy going on. And in fact, if we look at the map, you can quite clearly see that we believe the originating device is in Russia and it's come through a VPN or a proxy and is popping out in California.
Those things all combine and are analyzed in the policy and the policy generates a string of reason codes. Reason code is an indicator that a particular rule has fired. So we have operating system spoofing. We've received some conflicting signals as to what operating system was being used. This is an indicator that something's not quite right, contributes to the risk score. As I said, we've detected a proxy that by itself may not be a big deal, but combine that with the proxy IP Geo doesn't match the true IP geo, two different locations. That's not a usual scenario for someone unless they're trying to hide where they're coming from. There's a few other things there. Device language, irregular screen resolution. Screen resolution's an interesting one. If you are using a virtual machine for example, and the VM window is not taking up the entire screen space and monitor, you're going to end up with an odd resolution. That's something that we detect. It's considered a fraud indicator. We also have the email in our global block list that's potentially a different organization has seen that email address before, consider the bad added to the block list.
All of these things combined to give us the risk score that we saw earlier and ultimately to allow the customer to say, "This is not an account that I want created." How they actually respond to that is obviously up to them and it's up to their business and what makes sense in that scenario. So that's kind of the journey from the user perspective sort of with obviously a lot of this stuff the user wouldn't see, but that's what the web workflow would look like. However, there's also the customer's perspective. So they're not looking at one event, they might be looking at millions of events in a day. And so they have the ability to log into our portal and to analyze all of the events that happened over whatever the time period and look for patterns or anomalies and maybe even just respond to events where they need to review.
So I mentioned earlier that sometimes the policy score will have a result that's not strong enough to immediately block a transaction, but you also don't necessarily want to let that transaction through. There may be other steps that are required and a fraud operator could look through our portal and use the information provided to make a decision. So in this particular case, we have a week's worth of data. You can see the peaks and troughs of events throughout the day and we have a bit of a spike there on October 30th of events which had a poor risk rating.
If we identify an element that is associated with that particular spike in this case, we are going to look at an email address, we can then filter and have a look at all of the different events associated with that email address. Now, one thing I should point out is the email address is just a hash. That's not what the customers actually see. This is what a user that's internal to ThreatMetrix sees when they're logging into the portal. We can log in and view customer data, but we can't see any sensitive data. So things like IP addresses, we can see, email addresses, we can't. We get the hash data, this is useful for analysis, but obviously that doesn't allow us to, or it limits what we can do with that data.
Once we have an event that we're particularly interested in, we can drill down even further and look at all of the data that's associated with that one event. In this case, we have an event that has some behavioral biometrics data. We've, as part of the process when the user was logging in or doing whatever it was they were doing, we captured some information about the way they interacted with that page. So you can see that the name was auto-filled, the shipping address was typed in with the keyboard, the country and the phone number were pasted and some other things there as well. Things like using tab to navigate between fields or copying and pasting data is often an indicator of fraud. Maybe not a strong indicator, but it is an indicator and it contributes to the overall score, which clearly means I'm a fraudster because I copy and paste everything I can. I'm lazy.
On top of that, we have a number of different tabs here we can drill into, and I won't go into all of them, but a customer could drill into specifics around the device or around the web browser or some of the custom data that was passed in maybe or the network data. And that can all be analyzed here either to process a specific transaction if it's in review state, but also for analysis of how the policy's performing. So if there's known fraud, known loss, then those events could be reviewed here with the aim to improving the policy. So the next time this kind of fraud happens, it's actually picked up.
So this brings us to the digital identity network. The digital identity network is the core of our risk decisioning process. It is all of the data that we collect and process and it is what allows us to provide useful risk information, useful fraud information to our customers. Going back to our example previously we have the page load, we have the profiling data that's collected. We have the actual risk query from our customer, the API call. All of these things contribute data and we store some of this data in our digital identity network so that it can be used for future events. This might be transactional data, it might be session data that the customer passes to us or it might be the device data that we've collected.
Over the years, we have collected quite a bit of data and this graph shows our transaction volume over the last 10 years. We've been collecting data for the digital identity network for a little bit longer than that. It's probably closer to 15 years. But those first years, there wasn't a lot of data collected. But going back 10 years, we were doing 5 million transactions per month in two years or so, just under two years, we 10-xed that up to 50 million. We up to another couple of years and we 10-xed that again to half a billion. Unfortunately, our growth rate did slow down a little bit at that point and it took nearly five years to do the 10x again to get up to 5 billion transactions a month.
Obviously, this is a lot of data and to collect that amount of data, we're doing a reasonable number of transactions up to four to 5,000 transactions per second, 250 million transactions per day. And this results in a database that has over 4 billion devices and nearly a billion email addresses, over a billion login names, half a billion phone numbers. This information then is combined, and I mentioned this earlier, but combined into LexID digital identifier that maps associations between attributes. We have one and a half billion different digital identities mapped in our digital identity network. All of this data, we query at an average of 60 milliseconds for a transaction. It is this data that allows customers to have new accounts created and be able to make a valuable risk decision for that customer that they have not necessarily seen before, but maybe someone else in the network has.
This data comes from all around the world, and you can see here in this chart that overall the vast majority of the world's countries contribute roughly once an hour at least to this digital identity network. The countries that we don't see for the most part are not countries I've been to or in some cases even heard of, not doing many online transactions. So we have this digital identity network. Great. What does it actually look like from a data perspective? How do we handle that number of records in the latency budget that we have available to us? Well, it starts with the customer. Customer calls us, gives us data. Easy, right? Well, we have 130 billion keys in our database and we need a highly performant real-time database to achieve this. And that's where Aerospike comes in. And we've been using Aerospike for a while now, I'm going to say five years. It's that ballpark. And we've had a lot of success with Aerospike.
We started off with a single large cluster and over time we've split it into multiple smaller clusters for a variety of reasons. It's been a little bit easier to administer and scale and do capacity planning with smaller clusters, more dedicated specialized clusters. And whenever we collect data from a customer API or from a profiling event, and we want to use that data in future events in future real-time decisioning processes, that gets stored in Aerospike. So Aerospike, it's a bunch of service, but this is what it allows us to do. So the top graph here, this is from one of our data centers, the Sacramento data center. This is a week's graph. You can see the peaks and troughs that happen throughout the day. Although our customers are global, we do have more customers in certain areas. So we do still get peaks and troughs, although the trough is still a fairly high amount of traffic.
You can see there I've been saying an average of 60 milliseconds. It's actually probably a little bit less than that, but there's a bit of noise there, but we'll call it 60 milliseconds. You can see the bottom graph. We have what we imaginatively called the storage engine, the SE. This is the component that interacts with AeroSpike for the most part. It has a metric SE get record, which is pretty much exactly what it says. That's the amount of time in microseconds that it takes to fetch your record over the network, and do the lookup, retrieve the data. And you can see that that's pretty consistent. It does if you look very closely, follow the peaks and troughs of the top graph, but you'd be forgiven for losing that in the noise of the graph. All in all AeroSpike allows us to have a fairly low latency, fairly consistent solution.
And if we break that down a little bit further and have a look at what an actual API call looks like in terms of the reads and the writes, I picked somewhat arbitrary point in the day. It wasn't peak, it was we were doing about 1200 API calls per second in Sacramento. And I went through and I collected all of the various writes and reads per second that we're doing and added them all up as you can see across our five main clusters. And what we get is around 215,000 writes per second around 130,000 reads per second. And they can be mapped to the API calls and we can say that for a single API call, we have 180 writes and 110 reads. Now, not all of these reads or writes happen synchronously with the actual API call, so we've got our 60 millisecond budget. The writes for the most part either happen during the profiling phase or happen asynchronously to the actual transaction. The reads though they're predominantly synchronous with the API call.
Some of those reads are going to happen before the API call as part of the profiling step. But we could fairly safely say that we've got about 100 reads that are happening for a single API call. Now, to serve these requests, we have 48 bar metal servers. We have the specs up here on screen. If you know your hardware, you'll probably identify that this hardware is not the latest and greatest, it's not the most powerful hardware. And in fact, we do have some new hardware coming, which I'll talk about in just a second. But overall, this hardware has actually served us very well and allows us to have a really great response time.
These five different clusters, as you can see, they have very different IO patterns. The data that we store in each cluster is for the most part quite different. AeroCDL, for example, the last one, this is an in-memory cluster. We don't actually write anything to disk synchronously. It is replicated to disk in case there's a power failure or something. But transactionally it's all in-memory because it's a relatively small data set and we want very fast lookups there. Some of the other clusters are more data heavy and require more disk space.
So hardware, it's a fun game to be in, isn't it? It changes all the time. I'm excited about the new hardware. As you can see, we're getting more cores, more memory. We are switching to Intel PMEM, which I'm really excited about. This will allow us to have much higher density nodes. PMEM is cheaper than RAM when you're buying large amounts of it to stick in a single node and we'll find out, but the performance impact should be minimal from what we've heard. SSDs have improved a lot over the last five years. And in fact, the SSDs in our current systems are SATA-based. The Intel SSDs we're getting in the new nodes are NVME. So even just changing from one protocol bus to a different one, we'll get quite a bit of a performance boost.
The CPUs themselves, even ignoring the extra cores are more modern, more powerful should be a 30 or 40% IPC improvement just on a single thread. So that should be great. That should buy us some more time because we'll spend less time actually processing. And speaking of which, where do we spend time during an API call? We can break it down into, we can carve this up in many different ways, but one way of doing it is we could look at reading the profiling data, calculating some device IDs, reading the entity history, which that's our digital identity network, essentially, calculating our LexID digital and then running the customer policy. So the customer can define whatever rule set they like of whatever size they like. But these first four steps all need to happen before the customer policy can actually be executed.
And there is actually some other steps in there that are not broken down here. Execute policy is not just the customer policy. We do actually have a global policy that is our own internal policy that generates some useful information that could be used by a customer later. We have for example, a behavioral biometrics policy that does some analysis of the behavioral biometrics data, but I'm getting ahead of myself. So if we look at a specific customer here, and I won't share the name of the customer with you, but this is actually a UK customer that I'm sure all of you here know. We can see that they have a thousand rules in their policy. We can see that reading the actual IO, reading the data from the databases is it's about seven milliseconds. There's a little bit of calculation that happens as well. Then 22 milliseconds later, we can respond to the customer with the policy score, the risk score, the risk rating, all of that good stuff that they've defined that answers their questions.
So in this case, 33 milliseconds. This is perhaps on the lighter side than some of our other customers. And in fact, if we look at a customer that has more complex policies, this particular policy has 2,800 rules. The amount of time though that is spent in IO is roughly the same. It can be a little bit different. It can vary a little bit depending on exactly what data we get past to us. It may trigger, so in this case it's 5.7 milliseconds versus five milliseconds. I would think that that is simply due to us having extra data passed to us that requires extra lookups to get the entity history. However, once all of that is done, we then process the customer rules 2,800 rules, that takes 70 odd milliseconds. And then we can respond to the customer again with the risk rating and the policy score and allow them to continue their transaction.
So this is all well and good, but why do we care how long the IO takes? What we need to understand is that our customers rely on this information, this risk score, the risk rating to come in a timely fashion. And we have a fairly limited budget to respond in. And especially if you factor in network latency, so these numbers are within our data center, and if we're taking 80 milliseconds, there might be another 30 or 40 milliseconds for each direction of the packets to go between the customer data center and ours. So the customers want to response quickly. It allows them to make the decision that they need to. And if the customer has clicked a buy button or a log in button, you don't want that delay to be very long. So if we do faster IO and we have less time tied up dealing with reading the digital identity network. And the digital identity network is not getting smaller, storing more data and we want to do more complex things with it. And this is where an improvement in IO allows us to do more complex things.
Like what you might ask, well conveniently, I have a slide on this. So as probably all of us over the last few years we've been exploring solutions involving machine learning. Our first foray into machine learning was perhaps a little bit simplistic, but you've got to dip your toe in the water somewhere. But it is something we've been working on for a number of years now, and there are features that are coming up in the near future that I'm quite excited about as well as some applications under the hood that are not generally promoted, but I'm also excited about. So one of the things we've been playing around with is Microsoft has this really cool distributed machine learning toolkit. It allows the generation of LBGM models and LBGM models in our internal tests have proven to be very powerful.
But we've got this real-time rules engine that we have developed for 15 years now. It's not a model execution engine. So what do we do? How do we deal with this? Well, actually it turned out to be relatively straightforward for us because we embedded Lua into our rules engine. Lua is a scripting language. It's designed to be efficient, it's designed to be embedded in other binaries. It's used in all sorts of things under the hood that you probably are not aware of, and we use it for some various pieces and we decided that it wouldn't be too hard for us to convert PML. PML is a standard file format that exists for describing a model. We reckon we could convert this into a Lua script and this is what we did. So we created this product. This is an open source product. You can find it on our ThreatMetrix GitHub pamplemouse.
We called it that because there was a collision with PML to Lua, a project that someone had created and abandoned a long time ago. So we came up with a name, and this is a project that is available for download today if you like. Its intent is to take PMML models and it doesn't support 100% of the PMML format, but it does support a number of useful things including OGBM and it generates Lua that's optimized. And in fact this has proven to be quite powerful. We have some models which are multiple megabytes of Lua script and size, and we can execute them in a few milliseconds.
As far as our internal use of these, we've been slowly converting some of our features like our proxy detection feature or our behavioral biometrics, I've mentioned a few times. We actually make use of machine learning models to generate some of the scores around that. They tend to fly under the radar a bit. They're internal features that we use that we do expose, but we don't generally expose a lot of detail to customers around. However, we have had for a while now a product called smart learning that uses machine learning techniques to optimize policies. And if we have a look at our results from smart learning, that would be the black line down the bottom where we do actually get some uplift. The results are a reasonable area under the curve of 0.92, which is not terrible, but they are fairly limited because we've been limited by what we've been able to do with our rules engine, historically. This is no longer the case. Now we can play around with all sorts of different things.
And so one of the first things we looked at was a linear tree approach and we got some uplift there. It's not bad. But ultimately we didn't deploy a product around this because we looked into a far superior approach, which is LGVM. And you can see our area under the curve there is 0.98, which is extremely good. It does vary a little bit from customer to customer. And if we have to look at a different example, that's perhaps not as impressive, although it's still very impressive. Net result is that this is an approach that we will be pursuing in the future. It allows us to have more fraud detection, less false positives, and it also allows our customers to rely more on automation for tuning policies than fraud analysis. So this is an exciting development and I'm pretty much out of time. So I hope you've enjoyed that talk and I think there'll be a Q&A later and I'm happy to take any questions. Thank you.
About this webinar
Creating the world’s largest digital identity platform requires decisioning based on analysis of data from a number of global sources in real-time. Any digital business, regardless of industry, depends on speed and efficiency to drive operational decisions. Making faster, accurate, and real-time customer trust decisions removes friction and delivers superior business outcomes.
In this session, Nick Blievers, VP Engineering at ThreatMetrix®, discusses how:
Risk-based authentication leveraging digital identities is key to empowering customer transactions
Real-time customer trust decisions can reduce fraud and improve customer satisfaction
Selecting the right high performance data platform can improve decisioning and avoid spiralling complexity
Machine learning is powered at the data layer