Connecting Data, People and Ideas since 2016.
05 November 2018

Deep link analysis drives machine learning at massive scale

Graph databases have been on a meteoric rise for a while now. The key advantage of using a graph database is the ability to model data more intuitively and query it more efficiently, when the application domain resembles a graph. But working with graphs can also be resource-consuming. As a result, there often is performance degradation when dealing with large graphs.

TigerGraph is a new graph database, based on the premise of a native parallel graph architecture that aims to offer improved performance. TigerGraph came out of stealth in late 2017, but it has been in the works since 2012, and it feels like it has achieved a lot in a relatively short period.

From the onset, TigerGraph has been emphasizing improved performance. Over time, TigerGraph has been also making progress in other key areas, such as query language and schema, support for graph algorithms and analytics. This means a lot with respect to the ease of use and value you can get out of a database. Let’s see how this works for one of the world’s largest telcos, China Mobile.


Preventing fraud at scale: it’s all in the connections


China Mobile has over 600 million subscribers and wanted to address the issue of phone based fraud. Detecting fraud is not just a requirement in terms of regulatory compliance, but it also provides substantial business value: establishing trust is a key element of success for any brand.


With such a large user base, however, it also becomes a very hard problem. This is even more pronounced for pre-paid SIM cards. As these cards have very little KYC (Know Your Customer) data, they can become the weapon of choice for fraudsters.


In this case, the only way to find a fraudster is by understanding his/her behavior: their calling patterns with other subscribers and how those subscribers relate to each other and the potential fraudster caller. But with billions of calls on a weekly basis, how can signs of fraudulent activity be identified?


This is where machine learning provides value, offering the ability to identify behaviors and patterns of likely fraudsters. More and more organizations are leveraging machine learning, along with graphs, to prevent various types of fraud, including phone scam, credit card chargeback, advertising, money laundering, and more.


Using a graph model, a machine becomes more adept at recognizing suspicious phone call patterns and is able to separate them from billions of regular calls. Selecting which data attributes (or features, as is the machine learning terminology) to use to monitor fraud is the first step in building an algorithm.


In the case of phone-based fraud, they could include calling history of particular phones to other phones that may be in or out of the network, the age of a prepaid SIM card, percentage of one-directional calls made, and the percentage of rejected calls.


Traditional rule-based approaches focus on features for the individual node, but this often leads to high volumes of false positives. The same data can be fed into a machine learning algorithm, but this does not mean the problem is solved automatically: poor data in, poor insights out!


For example, a phone involved in frequent one-directional calls may belong to a sales representative, who is calling prospects to find leads or sell goods and services. It may also be involved in harassment, where one user is calling another as a prank.


A high volume of false positives results in a wasted effort to investigate non-fraudulent phones, leading to low confidence in a machine learning solution for fraud detection.


Part of the problem is the fact that fraudulent calls are few and far between: less than 0.01 percent of total calls. This means that the volume of training data with confirmed fraud activity for machine learning algorithms is very small. Having such a limited quantity of training data results in poor accuracy.


To address this, China Mobile uses what TigerGraph calls “real-time deep link analytics”.  Over 15 billion calls for 600 million mobile phones are analyzed, and 118 features are generated for each mobile phone.


These are based on deeper analysis of calling history and go beyond immediate recipients for calls. The solution identifies what type of fraud is suspected and displays a warning message on the callee’s phone, all while the phone is still ringing.



The good, the bad, and the tricky ones: it’s complicated



A customer with a good phone calls other subscribers, and the majority of their calls are returned. This helps to indicate familiarity or trusted relationships between the users. A good phone also regularly calls a set of other phones — say, every day or month — and this group of phones is fairly stable over a period of time (“Stable Group”).


Another feature indicating good phone behavior is when a phone calls another phone that has been in the network for many months or years and is called back. We also see a high number of calls between the good phone, the long-term phone contact and other phones within a network calling both these numbers frequently. Good phones have many in-group connections.


Lastly, a good phone is often involved in a three-step friend connection — meaning our good phone calls another phone, phone two, which calls phone three. The good phone is also in touch with direct calls with phone three. A three-step friend connection indicates a circle of trust and interconnectedness.


By analyzing call patterns, “bad” phones can be identified. These phones have short calls with multiple good phones, but receive no calls back. They also do not have a stable group of phones called on a regular basis (empty stable group). When a bad phone calls a long-term customer in the network, the call is not returned. Bad phones also have many rejected calls and lack three-step friend relationships.


To see how graph-based features improve accuracy for machine learning, let’s consider an example, using profiles for four mobile users: Tim, Sarah, Fred and John.


Traditional calling history features, such as the age of the SIM card used, percentage of one-directional calls and percentage of total calls rejected by their recipients, result in flagging three out of four of our customers, Tim, Fred and John, as likely or potential fraudsters as they look very similar based on these features.


Real-time deep link analytics help machine learning classify Tim as a prankster, John as a salesperson, while Fred is flagged as a likely fraudster. Let’s see how.

Tim has a stable group, so he is unlikely to be a sales guy since salespeople call different numbers each week. Tim doesn’t have many in-group connections, which means he is likely calling strangers. He also doesn’t have any three-step friend connections to confirm that the strangers he is calling aren’t related. So it’s quite likely that Tim is a prankster.


Let’s consider John who doesn’t have a stable group, which means he is calling new potential leads every day. He calls people with many in-group connections. As John presents his product or service, some of the call recipients are most likely introducing him to other contacts if they think the product or service would be interesting or relevant to them.


John is also connected via three-step friend relations, indicating that he is closing the loop as an effective sales guy, navigating the friends or colleagues of his first contact within a group, as he reaches the final buyer for his product or service. The combination of these features classifies John as a salesperson.

In the case of Fred, he doesn’t have a stable group, nor does he interact with a group that has many in-group connections. Plus, he does not have three-step friend relations among the people he calls. This makes him a very likely candidate for investigation as a phone scam artist or fraudster.



Real-time deep link analytics at scale for the win



This solution is made possible by utilizing certain properties of TigerGraph’s graph database. Graph databases overcome the challenge of representing massive, complex and interconnected data by storing data in a format that includes nodes, edges and properties. They are well-suited for relationship analysis. But graph analytics at scale require the ability to access and process massive graph data rapidly.


Graph based features such as “Many In-group Connections” require multiple hops – from the caller (Phone 1), go to the their calls (Hop 1) –> from  each call go to the callee such as Phone 4, 5 and 6 (Hop 2) –> From each callee go to other calls that do not involve Phone 1 as the callee (Hop 3) —> From each other call, go to the callee such as Phone 5 or 6 in case of Phone 4 (Hop 4) —> Find connection of these callees back to Phone 1 by looking at their calls that have Phone 1 as the callee (Hop 5).  


Real-time analytics in cases like this requires three or more hops in the graph. Native Parallel Graphs are designed to process paths in parallel with condition calculations and collection of nodes/edges meeting certain criteria. For example, remembering which phone calls have been already visited when traversing graphs, or collecting phone calls whose callee or caller does not involve Phone 1.


Fraud detection for China Mobile needs to happen in real time. It’s no use finding potential fraudulent calls after 2 days, as by then, the fraudster has already swindled a bunch of people and moved on to use a different SIM card.


Native Parallel Graphs are able to compute deep link features in real-time at large scale,  such as more than 2 Billion calls per day for China Mobile, and raise real-time fraud alerts for protecting customers.


The graph database platform leverages more than 100 graph features such as Stable Group that highly correlate with good and bad phone behavior for each of 600 million mobile phones. This generates 70 billion new training data features to feed machine learning algorithms.


The technical outcome is improved accuracy of machine learning for fraud detection, with fewer false positives (i.e., non-fraudulent phones marked as potential fraudster phones) as well as lower false negatives (i.e., phones involved in fraud that weren’t marked as such).


The business outcomes? Less time spent on identifying fraud, therefore more time spent in productive activities. Ensuring regulatory compliance, therefore less scrutiny and revenue loss due to fines. And, perhaps more importantly, better service and establishing a feeling of trust in the customer base. Triple bottomline – what’s not to like?


To learn more about TigerGraph, and to see it in action, come join us in Connected Data London on November the 7th, 2018. Todd Blaschka, TigerGraph COO, will share how to scale up business value with real-time operational graph analytics.


This is what Connected Data London 2018 brought to the fore. Connected Data London 2019 is on! Secure your chance to learn from experts and innovators, get your ticket early! Limited number early bird tickets available.

Connected Data World 2021  All Rights Reserved.

Connected Data is a trading name of Neural Alpha LTD.

Edinburgh House - 170 Kennington Lane
Lambeth, London - SE11 5DP