Simple baselines for ogbl-vessel dataset
GCN and GraphSAGE can not even outperform MLP in the original ogbl-vessel examples. They are all outperformed by SEAL with huge margin (~50% vs. ~80% Val/Test ROC-AUC). The simplicity of GCN and GraphSAGE may conbribute to their poor performance compared to MLP. However, other parts are empirically more important.
There exist some isolated nodes in this graph (val/test related nodes), possibly due to the data split. Their neighbor averaging output is zero, which can result in the unnecessary imbalance of data distribution, even with some residual tricks. The self-loops for these nodes should be added. GCN already adds self-loops in its standard implementaation, however, its symmetric normalization hurts model performance. The self-loops can improve GraphSAGE, for example, with this script:
python main.py --model sage --num_layers 1 --hidden_channels 64 --predictor DOT --node_feat_process node_normalize --add_self_loops
We can obtain similar performance (~50.28% Test ROC-AUC) to MLP. However, isolated nodes are not the only reason for poor performance of GraphSAGE, which is significantly outperformed by MLP (with other preprocessing methods and predictors) on this dataset.
After adding self-loops, replacing (or concatenating) coordinate features with learnable or pretrained node embedding vectors can improve GraphSAGE (with official preprocessing method and predictor) to achieve ~60% Val/Test ROC-AUC. But the overfitting problem (much higher training scores) still exists. Here is an example script:
python main.py --model sage --num_layers 1 --hidden_channels 64 --predictor DOT --node_feat_process node_normalize --add_self_loops --use_node2vec_embedding --lr 0.0001
In our main.py
, the argument use_node_embedding
refers to using learnable node embedding vectors, and the argument use_node2vec_embedding
refers to using pretrained Node2Vec embedding generated by node2vec.py
.
However, with suitable preprocessing method and predictor, the coordinate features can achieve much higher scores.
The raw coordinates of nodes may be too big for model optimization. The official ogbl-vessel offers a node-wise normalization as the preprocessing method:
data.x[:, 0] = torch.nn.functional.normalize(data.x[:, 0], dim=0)
data.x[:, 1] = torch.nn.functional.normalize(data.x[:, 1], dim=0)
data.x[:, 2] = torch.nn.functional.normalize(data.x[:, 2], dim=0)
or equivalently:
data.x = torch.nn.functional.normalize(data.x, dim=0)
This method maps raw (possibly big) coordinates to [-1,1] for all nodes. The normalization is conducted on nodes, which means that the differences (including distances) between nodes are over-shrunk.
Thus, we propose other preprocessing methods to control the magnitude of coordinates: channel-wise normalization:
data.x = torch.nn.functional.normalize(data.x, dim=1)
max_min:
data.x = (data.x - data.x.min(dim=0)[0]) / (data.x.max(0)[0] - data.x.min(0)[0] + 1e-9)
z-score:
data.x = (data.x - data.x.mean(0)) / (data.x.std(0) + 1e-9)
log:
data.x = data.x.abs().clamp(min=1e-9).log() * data.x / data.x.abs().clamp(min=1e-9)
The design of predictor is critical. GCN&GraphSAGE&MLP utilize element-wise multiplication of two node representation vectors then feed-forward networks (or called DotLinear), while SEAL concatnates top-K (after sort) vectors of the sampled subgraph then feed-forward networks. They are very different. DotLinear loses distance information.
We propose another simple predictors 'DIFF':
'DIFF' (DiffLinear): The difference vector of two (encoded) coordinate vectors.
DiffLinear should be more meaningful for preserving distance or relative vector information of realistic coordinates.
We also propose other predictors (not fully evaluated): 'CONCAT', 'MEAN', 'COS', 'SUM', 'MAX'
Note that most of these predictors are not identical for swapped input vectors (predictor(x_i,x_j)!=predictor(x_j,x_i)). Thus, we sort two vectors by the last element to ensure that we can obtain the same result for swapped vectors.
GCN/GraphSAGE and MLP can gain significant improvements with this predictor. Moreover, MLP can even achieve a very high score (~94% Val/Test ROC-AUC), and still outperforms MLP.
By running script like:
python main.py --model mlp --epochs 1000 --eval_steps 20 --num_layers 3 --hidden_channels 64 --dropout 0. --predictor {choose_your_predictor} --node_feat_process {choose_your_preprocessing_method} --lr 0.001
we obtain results of MLP (3 layers of 256 hidden dim, 10 runs including the first row):
Preprocessing | Model | Predictor | Train ROC-AUC | Val ROC-AUC | Test ROC-AUC |
---|---|---|---|---|---|
node_norm. | MLP | DotLinear | 50.35 ± 0.00 | 50.40 ± 0.00 | 50.28 ± 0.00 |
channel_norm. | MLP | DotLinear | 64.83 ± 2.74 | 64.82 ± 2.72 | 64.83 ± 2.74 |
log | MLP | DotLinear | 68.62 ± 5.12 | 68.58 ± 5.11 | 68.59 ± 5.09 |
max-min | MLP | DotLinear | 68.63 ± 3.50 | 68.64 ± 3.52 | 68.63 ± 3.50 |
z-score | MLP | DotLinear | 75.22 ± 1.26 | 75.20 ± 1.26 | 75.19 ± 1.24 |
none | MLP | DotLinear | 50.00 ± 0.00 | 50.00 ± 0.00 | 50.00 ± 0.00 |
node_norm. | MLP | DiffLinear | 82.77 ± 5.66 | 82.76 ± 5.67 | 82.77 ± 5.66 |
channel_norm. | MLP | DiffLinear | 85.38 ± 0.01 | 85.39 ± 0.02 | 85.40 ± 0.02 |
log | MLP | DiffLinear | 93.24 ± 0.14 | 93.23 ± 0.14 | 93.22 ± 0.14 |
max-min | MLP | DiffLinear | 94.03 ± 0.03 | 94.04 ± 0.03 | 94.02 ± 0.03 |
z-score | MLP | DiffLinear | 94.16 ± 0.02 | 94.15 ± 0.02 | 94.14 ± 0.01 |
none | MLP | DiffLinear | 94.01 ± 0.03 | 94.01 ± 0.03 | 94.00 ± 0.03 |
By running script like:
python main.py --model sage --epochs 100 --eval_steps 10 --num_layers 1 --hidden_channels 64 --dropout 0. --predictor {choose_your_predictor} --node_feat_process {choose_your_preprocessing_method} --lr 0.001
we obtain preliminary results of GraphSAGE (1 layer of 64 hidden dim with adding self-loops, 5 runs due to limited time):
Preprocessing | Model | Predictor | Train ROC-AUC | Val ROC-AUC | Test ROC-AUC |
---|---|---|---|---|---|
node_norm. | SAGE | DotLinear | 50.35 ± 0.00 | 50.40 ± 0.00 | 50.28 ± 0.00 |
channel_norm. | SAGE | DotLinear | 50.50 ± 0.00 | 50.53 ± 0.01 | 50.36 ± 0.01 |
log | SAGE | DotLinear | 50.47 ± 0.03 | 50.50 ± 0.02 | 50.38 ± 0.03 |
max-min | SAGE | DotLinear | 50.71 ± 0.03 | 50.71 ± 0.03 | 50.55 ± 0.03 |
z-score | SAGE | DotLinear | 50.83 ± 0.02 | 50.77 ± 0.02 | 50.62 ± 0.01 |
none | SAGE | DotLinear | 50.00 ± 0.00 | 50.00 ± 0.00 | 50.00 ± 0.00 |
node_norm. | SAGE | DiffLinear | 79.97 ± 1.95 | 72.00 ± 1.11 | 71.99 ± 1.10 |
channel_norm. | SAGE | DiffLinear | 91.90 ± 0.31 | 76.11 ± 0.62 | 76.11 ± 0.62 |
log | SAGE | DiffLinear | 96.74 ± 0.38 | 82.75 ± 0.79 | 82.74 ± 0.80 |
max-min | SAGE | DiffLinear | 97.81 ± 0.23 | 84.71 ± 0.54 | 84.72 ± 0.53 |
z-score | SAGE | DiffLinear | 98.06 ± 0.02 | 85.64 ± 0.38 | 85.65 ± 0.38 |
none | SAGE | DiffLinear | 98.11 ± 0.06 | 87.71 ± 0.07 | 87.71 ± 0.07 |
Empirically, the coordinate features are already strong enough to obtain 90+% Train/Val/Test ROC-AUC. In most cases, applying graph convolution will make model overfitting with significant degradation. The neighbor averaging paradigm fails to capture additional valuable information and encounters overfitting (possibly oversmoothing) even with a single layer.
Since high ROC-AUC may appear with low Hits@xx. Maybe we should try Hits@xx to measure model performance.