This blog post will document all the Machine-Learning I use and learn about in all my professional and personal projects, giving an explanation of whats going on "under the hood" and explaining broader applications of certain topics. Please note that I am writing this as I learn, so the order of topics will probably be very different from what you would find in a textbook or a machine learning degree. I will also update it continuously so if you look at it twice 6 months apart then don't be surprised if some things have changed. Before I get in to the specific project, here are a few definitions:
Supervised Learning: Machine Learning where we input labelled data and give the model a specific problem of predicting an outcome based on that data. Examples include: Link Prediction, predicting credit card fraud etc. Predicting the outcome of a situation based on data pertaining to that situation. This can be thought of as when models do predictive data analysis.
Unsupervised Learning: Machine Learning where we input unlabelled data and give the model the problem of finding patterns within the data. Examples of these include: Clustering (Placing data points on a scatter plot into categories), dimensionality reduction (reducing amount of data without sacrificing important features) etc. Giving us information on the features of data that we feed into it. This can also be thought of as when models do exploratory data analysis.
Reinforcement Learning: Machine Learning where we let a model repeatedly attempt a problem and give it feedback every time until it learns how to do it on its own.
The problem of social media recommendation can use all three types of Machine Learning, but in this project we only use supervised and unsupervised learning.
Project 1: Social Media Friend Recommendation Tool Using Node2Vec Algorithm
The system diagram for the entirety of this project: We get social media data from the web scraper, put it into the link prediction pipeline, and voila: a Friend recommendation system!
"Node2Vec" was created in 2016 by Aditya Grover and Jure Leskovec. Much thanks for them for allowing the public use of their paper found here. Another few resources I found helpful when doing this project were this AnalyticsVidhya article by Prateek Joshi and this Medium article by Vatsal. The main topics covered in this project are:
1) Data Preparation for Machine Learning
- Binary Classification Problems
- Class Imbalances
2) Node2Vec
- Word2Vec / Skip-Gram Architecture
- Random Walks
- Stochastic Gradient Descent (SGD)
- Network Embeddings
- Link Prediction
3) Model Validation
- AUC ROC score
Part 0: Introduction
This project is on a specific type of link prediction in networks: "future" link prediction. Another type, "static" link prediction is different. future link prediction is when we have a network that is growing over time, and we want to predict what new links are going to crop up in the future. Static link prediction is when we have incomplete data about a network, that is, we don't know where all the links are in the network, and we want to predict where these "missing links" are. This project is an example of supervised machine learning; we give the model some training data, that is, we tell it a situation and we tell it what happens in the future, and then we give it a new situation and it predicts what will happen in the future based on the data we have trained it on. In particular it is an example of Binary Classification: the model gives a probability of an event occurring, and (by extension) a probability that it doesn't, in a situation where there are only 2 cases. Another example of this could be a model that guesses whether a person is a woman or a man based on some information we give it (Assuming we don't tell it about non binary people!). There are countless different ways to perform link prediction; some of them involving machine learning and some not. This blog is only covering one small corner of this massive field; using this model to perform link prediction on a simple, undirected graph.
A big issue with machine learning on networks is the level of complexity of network data. It's very hard to just put network data into a machine learning model and get a useful output. To solve this, node2vec was created. The basic idea of this algorithm is to create a vector "embedding" of all the nodes in a network, and then feed these vectors into a neural network that then predicts new links in the network. The idea of the embedding is to reduce the dimension of the input while still capturing the structure of the network.
This project is split into 3 parts: data preparation, the node2vec algorithm, and then actually doing the link prediction.
Part 1: Data Collection/Preparation
Before we can organise the data to feed into the machine learning pipeline, we need to collect that data from social media. The details on this project are in this blog post.
Definition 1.1: Variable Types: The "predictor variable" in our data set is the thing we are using to predict the "target variable".
Image 1: A simple 3 node network with 2 connections
To illustrate this definition; we go through an example. Suppose we have the network shown above, and we want to predict what connections are going to arise in the future. We can think of the predictor variables being a description of the state of the network now, and the target variables being a description of the network in the future.
Look at the following tables, 1&2:
Table 1: Predictor Variables
Node Pairs | Link (Predictor Variable) |
1-2 | 1 |
2-3 | 0 |
3-1 | 1 |
(0 means there isn't a link and 1 means there is a link.)
Table 2: Target Variables
Node Pairs | Link (Target Variable) |
1-2 | 1 |
2-3 | x |
3-1 | 1 |
# Note 1: for the purposes of this project we have only considered the case where the network is growing "uniformly", ie no 2 nodes are being "disconnected" over time. This makes it simpler because we only need to look at the node pairings that don't have a link, instead of the whole network. (Later on we will see this is actually the overwhelming majority of the pairings anyway, but still this is only a beginner project).
Here we are using the predictor variables (1,0,1) to predict the target variable(s) x. In reality we will have a lot more variables than this but you get the point. The goal of the project is to get a probability that x is 1. Notice how I said "x is 1", not "x will be 1". This is because x is a variable that describes whether there is a link between nodes 2 and 3 in the future. This is an important concept in this project: the predictor and target variables describe the same thing; just at 2 different times
# Note 2: when I say "in the future", it may be useful to think of a set time period, like a day, hour or week. Theoretically you could somehow iterate this model on a simulation of a growing network, and create distinct predictions for longer/shorter time periods. This is entirely theoretical though and to the best of my knowledge this hasn't been researched (and it might be a stupid idea) but its cool to think about.
Our task now is to create this data from a single "snapshot" of a network in time. To do that we are going to randomly remove certain edges from the network. Doing this essentially simulates what the network might have looked like at some point in the past. You might be thinking "but then we know the target variables". Yes, this is the point: we use this data to "train" the model to accurately predict new links, by giving it the answers so it can learn (Machine "learning").
# Note 3: In reality for the model to actually work, we need to either feed in specific growth data, ie give it 2 instances of the network, or remove edges in some non-random way that models network growth behaviour. This is because fundamentally, networks do not grow randomly, and not all networks grow in the same way; you don't choose a person entirely at random to become friends with, or to do business with etc, and connections in business are not made in the same way as connections in a biological network. Giving a model this data essentially trains it to think that networks do grow randomly, and do grow the same, and this makes its predictions less accurate. We have chosen to ignore this detail at this stage so that we can focus on the actual machine learning being done. I do have plans to extend this project to include this though.
An important part of the training data in this project is the node pairs that aren't connected. We call these the "negative samples" (there's more on this later). In order to find these pairs efficiently we draw up an "adjacency matrix" for the network. This is a symmetric matrix with entries of 0 and 1 indicating whether 2 nodes have an edge between them (1), or not (0). Consider the network from image 1 above. Its adjacency matrix would look like this:
Image 2: Adjacency Matrix of Image 1
0 | 1 | 1 |
0 | 0 | 0 |
1 | 1 | 0 |
Notice how the diagonal values are always zero. That's because we are working with simple graphs, so every edge has to have 2 distinct endpoints, meaning that a node cannot be connected to itself. If we just had the matrix then we could draw the network by just looking where the 1s are above the diagonal of zeros. There are 1s in the cells with coordinates (1,2) and (1,3) (coordinates meaning the row number followed by the column number), so there are edges between 1-2 and 1-3. the coordinate (2,3) has a 0 in it, so 2-3 are not connected. In the image above the important cells are shaded.
Something we need to make sure of is that we don't accidently disconnect the network when we remove edges, so we only remove edges where both endpoints have more than 1 edge connected to them, so as to not cut off any nodes from the rest of the network. If we did this it would massively change the way the model is trained and make it very inaccurate.
definition 1.2: Positive and Negative Samples:
When dealing with a binary classification problem, we separate training data into 2 sets: positive and negative samples. These correspond to the two possible outcomes we are trying to predict. In our example the positive examples are node pairings that end up connecting in the future, while negative examples are ones that stay unconnected.
Ok so now we need to obtain our positive and negative samples from our training data. Remember that in this example we are randomly removing edges, so the positive samples are the ones we have removed. I know this may seem a bit backwards at first, it did to me, but the idea is just that they are the edges that used to not exist, but now do. This is because when we remove the edges we are simulating what the network looked like in the past.
The negative samples are the pairings that aren't connected in our original network (that is, our "future network"). The image below demonstrates this with a simple example:
Image 3: we randomly choose to remove 2-3. Here 2-3 is the positive sample and there are no negative samples because the future network (blue) is fully connected(a complete graph).
Now that we understand positive and negative samples, we introduce an important concept: Class Balance:
Definition 1.3: Class Balance & Imbalance: In a binary classification problem, if the positive and negative samples are of a similar size, then we say the classes are balanced. If one is significantly bigger than the other, then we say that we have class imbalance. Class imbalance can cause a model to be biased towards the larger class. In our example, if we had many more negative samples than positive samples, our model may predict there to be only negative results, that is, it will predict that there will be no new edges.
# Note 4: Class Imbalance can also have more serious negative effects; If for example we train a model to predict whether a person has skin cancer based on a picture of their face (a real thing btw), and we train it on people who overwhelmingly have healthy skin, then it may under-diagnose and thus give people a false sense of security. There are, however, ways of getting around class imbalances, which I intend to cover later. These methods are important because (as we will see in a minute) certain problems have class imbalance built in to them (and so can't be outright avoided by considering possible bias in the training data itself).
In our example, we inevitably get highly imbalanced classes: for example in my training data I used the number of positive samples represented around 8% of the total amount of data. At the moment I am ignoring this for simplicity but I plan to cover it later in a more complex version of this project.
Part 2: The Node2Vec Algorithm and Logistic Regression:
It's important to note that node2vec doesn't do link prediction itself; it just converts the nodes in a network into vectors, that can then be passed into a machine learning model like Logistic-Regression to actually accomplish the task of predicting links. The node2vec algorithm does what its name suggests: it takes in the nodes of a network, and gives a vector representation of them as an output. This is useful because it converts the nodes in a network into a type of data (vectors) that can be used as an input in the logistic regression model.
node2vec is an example of unsupervised Machine Learning, whereas logistic regression is an example of supervised machine learning.
To understand how node2vec works on a deep level you should get a grasp of what neural networks are, and in particular the Skip-Gram model.
Comments