This blog post will show you how to combine two technologies: Google Cloud Vertex AI which is a ML development platform and Neo4j which is a graph database. These technologies can be combined to create and deploy graph-based machine-learning models.
You can find the code that underlies this blog post in a notebook .
Graphs are useful for data science.
Many business problems can be solved by graphs. Graphs are data structures which describe the relationships between data points just as well as the data.
One way to look at graphs is to consider the relationship between verbs and nouns. The nouns (or nodes) are things like people, places, and objects. The verbs or relationships are what connect them. People get to know each other, and things are sent to them. These relationships are powerful.
Graph data is often large and difficult to manage. It can be nearly impossible to use it in traditional machine-learning tasks.
Google Cloud and Neo4j provide scalable, intelligent tools to make the most of graph data. Neo4j Graph Data Science, Google Cloud Vertex AI and Neo4j Graph Data Science make it easy to build AI models from graph data.
Dataset – PaySim Fraud Identification
Machine learning using graphs has many applications. Combating fraud in all forms is one common application. Fake transactions are identified by credit card companies, and insurers face false claims. Lenders also watch out for stolen credentials.
Machine learning and statistics have been used for decades to combat fraud. One common method is to create a classification model based on the individual characteristics of each payment and its users. Data scientists may train an XGBoost model that predicts if a transaction is fraudulent. It uses the transaction amount, date, time, origin and target accounts, and the resulting balances.
These models are susceptible to fraud. Fraudsters can bypass checks that only look at one transaction by channeling transactions through a network. To be successful, a model must understand the relationships between fraudulent transactions, legitimate transaction and actors.
These types of problems are best solved with graph techniques. This example will show you how graphs can be used in this scenario. Next, we will show you how to build an end-to–end pipeline for training a complete model with Neo4J or Vertex AI. We’re using the PaySim dataset from Kaggle, which includes graph features.
Loading Data into Neo4j
We first need to load the data into Neo4j. We’re using AuraDS for this example. AuraDS provides Neo4j Graph Data Science and Ne4j Graph Data Science as managed services on top of GCP. You can sign up for a limited preview right now.
AuraDS is an excellent way to start GCP. The service can be fully managed. All we have to do to set up a Paysim database is to click through a few screens, and then load the dump file.
There are many ways that Neo4j can explore the data once it has been loaded. To run queries, you can use the Python API within a notebook.
With Neo4j, you can create embedded designs
Once you have explored your data set, a common next move is to use the Neo4j Graph Data Science algorithms to create features that encode complex, high-dimensional graph data into values that can be used by tabular machine learning algorithms.
To identify patterns, many users begin with simple graph algorithms. To find disjointed groups of account holders who share common logins, you can look at weakly connected parts. Louvain methods can be used to identify fraud rings that are laundering money. You can use page rank to determine which accounts are the most important. These techniques will require that you know the exact pattern you are looking for.
Neo4j can be used to generate graph embeddings. Graph embeddings reduce complex topological information within your graph into a fixed-length vector. This is where the graph’s related nodes have proximal vectors. If graph topology is important, such as how fraudsters behave and with whom they interact, embeddings will capture it.
Some techniques use the embeddings by themselves. You can use a T-sne plot or compute raw similarity scores to locate clusters visually. Combining your embeddings and Google Cloud Vertex AI is what makes the magic happen. This allows you to train a supervised model.
This creates a 16-dimensional graph embedding with the Fast Random Project method. This nodeSelfInfluence parameter is a neat feature. This allows us to tune how many nodes in the graph have an influence on the embedding.
Once the embedding calculation is complete, we can dump it into a pandas databaseframe, convert it to a CSV, and then push it to a Google Cloud storage bucket so that Vertex AI can use it. These steps are described in the notebook .
Vertex AI Machine Learning
Once we have encoded graph dynamics into vectors we can now use tabular methods in Google Cloud’s Vertex AI to train our machine learning models.
We first pull data from a storage bucket. Then we use this to create a Vertex AI dataset. Once the dataset is created, it’s possible to train a model using it. The notebook will display the results. You can also log in to the GCP console to view the results from the Vertex AI’s GUI.
Console views are great because they include ROC curves as well as the confusion matrix. These are useful in helping to understand how the model performs.
Vertex AI offers useful tools for the deployment of the trained model. You can load the dataset into a VertexAI Feature Store. An endpoint can then be deployed. This endpoint can then be called to compute new predictions. This can be found in the notebook .
Future Work
We quickly realized how much work could be done in this field while working on the notebook. Machine learning with graphs, especially when compared to studying methods for tabular data, is still a new field.
We would like to expand our knowledge in the following areas:
Improved dataset – It’s difficult to share publicly fraud datasets due to privacy concerns. This is why we used the PaySim dataset. This is a synthetic dataset. Our investigation revealed that the generated dataset has very little information. A real dataset will likely have more structure.
We’d love to continue exploring the graph of SEC-EDGAR Form 4 transactions in future research. These forms are used to show trades made by public company officers. We expect the graph to be quite interesting as many of these people are also officers in multiple companies. Workshops are planned for 2022, where participants can work together to explore the data using Vertex AI and Neo4j. You can already find a loader to pull this data into Google BigQuery .
Boosting and Embedding : Graph embeddings such as Fast Random Project duplicate data because sub graphs end-up in every tabular datapoint. XGBoost and other boosting techniques also duplicate data to improve results. Vertex AI uses XGBoost. This means that models in this example have likely too much data duplication. It is possible that we would see better results using other machine learning methods such as neural networks.
Graph Features This example shows how to automatically generate graph features by using embedding. You can also manually create new graph features. These two methods would likely result in richer features.