Skip to content

skepsun/ogbl_vessel_baselines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ogbl_vessel_baselines

Simple baselines for ogbl-vessel dataset

GCN and GraphSAGE can not even outperform MLP in the original ogbl-vessel examples. They are all outperformed by SEAL with huge margin (~50% vs. ~80% Val/Test ROC-AUC). The simplicity of GCN and GraphSAGE may conbribute to their poor performance compared to MLP. However, other parts are empirically more important.

Add self-loops

There exist some isolated nodes in this graph (val/test related nodes), possibly due to the data split. Their neighbor averaging output is zero, which can result in the unnecessary imbalance of data distribution, even with some residual tricks. The self-loops for these nodes should be added. GCN already adds self-loops in its standard implementaation, however, its symmetric normalization hurts model performance. The self-loops can improve GraphSAGE, for example, with this script:

python main.py --model sage --num_layers 1 --hidden_channels 64 --predictor DOT --node_feat_process node_normalize --add_self_loops

We can obtain similar performance (~50.28% Test ROC-AUC) to MLP. However, isolated nodes are not the only reason for poor performance of GraphSAGE, which is significantly outperformed by MLP (with other preprocessing methods and predictors) on this dataset.

Node embedding

After adding self-loops, replacing (or concatenating) coordinate features with learnable or pretrained node embedding vectors can improve GraphSAGE (with official preprocessing method and predictor) to achieve ~60% Val/Test ROC-AUC. But the overfitting problem (much higher training scores) still exists. Here is an example script:

python main.py --model sage --num_layers 1 --hidden_channels 64 --predictor DOT --node_feat_process node_normalize --add_self_loops --use_node2vec_embedding --lr 0.0001

In our main.py, the argument use_node_embedding refers to using learnable node embedding vectors, and the argument use_node2vec_embedding refers to using pretrained Node2Vec embedding generated by node2vec.py.

However, with suitable preprocessing method and predictor, the coordinate features can achieve much higher scores.

Preprocessing (for raw coordinate features)

The raw coordinates of nodes may be too big for model optimization. The official ogbl-vessel offers a node-wise normalization as the preprocessing method:

data.x[:, 0] = torch.nn.functional.normalize(data.x[:, 0], dim=0)
data.x[:, 1] = torch.nn.functional.normalize(data.x[:, 1], dim=0)
data.x[:, 2] = torch.nn.functional.normalize(data.x[:, 2], dim=0)

or equivalently:

data.x = torch.nn.functional.normalize(data.x, dim=0)

This method maps raw (possibly big) coordinates to [-1,1] for all nodes. The normalization is conducted on nodes, which means that the differences (including distances) between nodes are over-shrunk.

Thus, we propose other preprocessing methods to control the magnitude of coordinates: channel-wise normalization:

data.x = torch.nn.functional.normalize(data.x, dim=1)

max_min:

data.x = (data.x - data.x.min(dim=0)[0]) / (data.x.max(0)[0] - data.x.min(0)[0] + 1e-9)

z-score:

data.x = (data.x - data.x.mean(0)) / (data.x.std(0) + 1e-9)

log:

data.x = data.x.abs().clamp(min=1e-9).log() * data.x / data.x.abs().clamp(min=1e-9)

Predictor

The design of predictor is critical. GCN&GraphSAGE&MLP utilize element-wise multiplication of two node representation vectors then feed-forward networks (or called DotLinear), while SEAL concatnates top-K (after sort) vectors of the sampled subgraph then feed-forward networks. They are very different. DotLinear loses distance information.

We propose another simple predictors 'DIFF':

'DIFF' (DiffLinear): The difference vector of two (encoded) coordinate vectors.

DiffLinear should be more meaningful for preserving distance or relative vector information of realistic coordinates.

We also propose other predictors (not fully evaluated): 'CONCAT', 'MEAN', 'COS', 'SUM', 'MAX'

Note that most of these predictors are not identical for swapped input vectors (predictor(x_i,x_j)!=predictor(x_j,x_i)). Thus, we sort two vectors by the last element to ensure that we can obtain the same result for swapped vectors.

GCN/GraphSAGE and MLP can gain significant improvements with this predictor. Moreover, MLP can even achieve a very high score (~94% Val/Test ROC-AUC), and still outperforms MLP.

Results

By running script like:

python main.py --model mlp --epochs 1000 --eval_steps 20 --num_layers 3 --hidden_channels 64 --dropout 0. --predictor {choose_your_predictor} --node_feat_process {choose_your_preprocessing_method} --lr 0.001

we obtain results of MLP (3 layers of 256 hidden dim, 10 runs including the first row):

Preprocessing Model Predictor Train ROC-AUC Val ROC-AUC Test ROC-AUC
node_norm. MLP DotLinear 50.35 ± 0.00 50.40 ± 0.00 50.28 ± 0.00
channel_norm. MLP DotLinear 64.83 ± 2.74 64.82 ± 2.72 64.83 ± 2.74
log MLP DotLinear 68.62 ± 5.12 68.58 ± 5.11 68.59 ± 5.09
max-min MLP DotLinear 68.63 ± 3.50 68.64 ± 3.52 68.63 ± 3.50
z-score MLP DotLinear 75.22 ± 1.26 75.20 ± 1.26 75.19 ± 1.24
none MLP DotLinear 50.00 ± 0.00 50.00 ± 0.00 50.00 ± 0.00
node_norm. MLP DiffLinear 82.77 ± 5.66 82.76 ± 5.67 82.77 ± 5.66
channel_norm. MLP DiffLinear 85.38 ± 0.01 85.39 ± 0.02 85.40 ± 0.02
log MLP DiffLinear 93.24 ± 0.14 93.23 ± 0.14 93.22 ± 0.14
max-min MLP DiffLinear 94.03 ± 0.03 94.04 ± 0.03 94.02 ± 0.03
z-score MLP DiffLinear 94.16 ± 0.02 94.15 ± 0.02 94.14 ± 0.01
none MLP DiffLinear 94.01 ± 0.03 94.01 ± 0.03 94.00 ± 0.03

By running script like:

python main.py --model sage --epochs 100 --eval_steps 10 --num_layers 1 --hidden_channels 64 --dropout 0. --predictor {choose_your_predictor} --node_feat_process {choose_your_preprocessing_method} --lr 0.001

we obtain preliminary results of GraphSAGE (1 layer of 64 hidden dim with adding self-loops, 5 runs due to limited time):

Preprocessing Model Predictor Train ROC-AUC Val ROC-AUC Test ROC-AUC
node_norm. SAGE DotLinear 50.35 ± 0.00 50.40 ± 0.00 50.28 ± 0.00
channel_norm. SAGE DotLinear 50.50 ± 0.00 50.53 ± 0.01 50.36 ± 0.01
log SAGE DotLinear 50.47 ± 0.03 50.50 ± 0.02 50.38 ± 0.03
max-min SAGE DotLinear 50.71 ± 0.03 50.71 ± 0.03 50.55 ± 0.03
z-score SAGE DotLinear 50.83 ± 0.02 50.77 ± 0.02 50.62 ± 0.01
none SAGE DotLinear 50.00 ± 0.00 50.00 ± 0.00 50.00 ± 0.00
node_norm. SAGE DiffLinear 79.97 ± 1.95 72.00 ± 1.11 71.99 ± 1.10
channel_norm. SAGE DiffLinear 91.90 ± 0.31 76.11 ± 0.62 76.11 ± 0.62
log SAGE DiffLinear 96.74 ± 0.38 82.75 ± 0.79 82.74 ± 0.80
max-min SAGE DiffLinear 97.81 ± 0.23 84.71 ± 0.54 84.72 ± 0.53
z-score SAGE DiffLinear 98.06 ± 0.02 85.64 ± 0.38 85.65 ± 0.38
none SAGE DiffLinear 98.11 ± 0.06 87.71 ± 0.07 87.71 ± 0.07

Empirically, the coordinate features are already strong enough to obtain 90+% Train/Val/Test ROC-AUC. In most cases, applying graph convolution will make model overfitting with significant degradation. The neighbor averaging paradigm fails to capture additional valuable information and encounters overfitting (possibly oversmoothing) even with a single layer.

Since high ROC-AUC may appear with low Hits@xx. Maybe we should try Hits@xx to measure model performance.

Reference

  1. Open Graph Benchmark
  2. ogbl-vessel

About

Simple baselines for ogbl-vessel dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages