ogbl_vessel_baselines

Simple baselines for ogbl-vessel dataset

GCN and GraphSAGE can not even outperform MLP in the original ogbl-vessel examples. They are all outperformed by SEAL with huge margin (~50% vs. ~80% Val/Test ROC-AUC). The simplicity of GCN and GraphSAGE may conbribute to their poor performance compared to MLP. However, other parts are empirically more important.

Add self-loops

There exist some isolated nodes in this graph (val/test related nodes), possibly due to the data split. Their neighbor averaging output is zero, which can result in the unnecessary imbalance of data distribution, even with some residual tricks. The self-loops for these nodes should be added. GCN already adds self-loops in its standard implementaation, however, its symmetric normalization hurts model performance. The self-loops can improve GraphSAGE, for example, with this script:

python main.py --model sage --num_layers 1 --hidden_channels 64 --predictor DOT --node_feat_process node_normalize --add_self_loops

We can obtain similar performance (~50.28% Test ROC-AUC) to MLP. However, isolated nodes are not the only reason for poor performance of GraphSAGE, which is significantly outperformed by MLP (with other preprocessing methods and predictors) on this dataset.

Node embedding

After adding self-loops, replacing (or concatenating) coordinate features with learnable or pretrained node embedding vectors can improve GraphSAGE (with official preprocessing method and predictor) to achieve ~60% Val/Test ROC-AUC. But the overfitting problem (much higher training scores) still exists. Here is an example script:

python main.py --model sage --num_layers 1 --hidden_channels 64 --predictor DOT --node_feat_process node_normalize --add_self_loops --use_node2vec_embedding --lr 0.0001

In our main.py, the argument use_node_embedding refers to using learnable node embedding vectors, and the argument use_node2vec_embedding refers to using pretrained Node2Vec embedding generated by node2vec.py.

However, with suitable preprocessing method and predictor, the coordinate features can achieve much higher scores.

Preprocessing (for raw coordinate features)

The raw coordinates of nodes may be too big for model optimization. The official ogbl-vessel offers a node-wise normalization as the preprocessing method:

data.x[:, 0] = torch.nn.functional.normalize(data.x[:, 0], dim=0)
data.x[:, 1] = torch.nn.functional.normalize(data.x[:, 1], dim=0)
data.x[:, 2] = torch.nn.functional.normalize(data.x[:, 2], dim=0)

or equivalently:

data.x = torch.nn.functional.normalize(data.x, dim=0)

This method maps raw (possibly big) coordinates to [-1,1] for all nodes. The normalization is conducted on nodes, which means that the differences (including distances) between nodes are over-shrunk.

Thus, we propose other preprocessing methods to control the magnitude of coordinates: channel-wise normalization:

data.x = torch.nn.functional.normalize(data.x, dim=1)

max_min:

data.x = (data.x - data.x.min(dim=0)[0]) / (data.x.max(0)[0] - data.x.min(0)[0] + 1e-9)

z-score:

data.x = (data.x - data.x.mean(0)) / (data.x.std(0) + 1e-9)

log:

data.x = data.x.abs().clamp(min=1e-9).log() * data.x / data.x.abs().clamp(min=1e-9)

Predictor

The design of predictor is critical. GCN&GraphSAGE&MLP utilize element-wise multiplication of two node representation vectors then feed-forward networks (or called DotLinear), while SEAL concatnates top-K (after sort) vectors of the sampled subgraph then feed-forward networks. They are very different. DotLinear loses distance information.

We propose another simple predictors 'DIFF':

'DIFF' (DiffLinear): The difference vector of two (encoded) coordinate vectors.

DiffLinear should be more meaningful for preserving distance or relative vector information of realistic coordinates.

We also propose other predictors (not fully evaluated): 'CONCAT', 'MEAN', 'COS', 'SUM', 'MAX'

Note that most of these predictors are not identical for swapped input vectors (predictor(x_i,x_j)!=predictor(x_j,x_i)). Thus, we sort two vectors by the last element to ensure that we can obtain the same result for swapped vectors.

GCN/GraphSAGE and MLP can gain significant improvements with this predictor. Moreover, MLP can even achieve a very high score (~94% Val/Test ROC-AUC), and still outperforms MLP.

Results

By running script like:

python main.py --model mlp --epochs 1000 --eval_steps 20 --num_layers 3 --hidden_channels 64 --dropout 0. --predictor {choose_your_predictor} --node_feat_process {choose_your_preprocessing_method} --lr 0.001

we obtain results of MLP (3 layers of 256 hidden dim, 10 runs including the first row):

Preprocessing	Model	Predictor	Train ROC-AUC	Val ROC-AUC	Test ROC-AUC
node_norm.	MLP	DotLinear	50.35 ± 0.00	50.40 ± 0.00	50.28 ± 0.00
channel_norm.	MLP	DotLinear	64.83 ± 2.74	64.82 ± 2.72	64.83 ± 2.74
log	MLP	DotLinear	68.62 ± 5.12	68.58 ± 5.11	68.59 ± 5.09
max-min	MLP	DotLinear	68.63 ± 3.50	68.64 ± 3.52	68.63 ± 3.50
z-score	MLP	DotLinear	75.22 ± 1.26	75.20 ± 1.26	75.19 ± 1.24
none	MLP	DotLinear	50.00 ± 0.00	50.00 ± 0.00	50.00 ± 0.00
node_norm.	MLP	DiffLinear	82.77 ± 5.66	82.76 ± 5.67	82.77 ± 5.66
channel_norm.	MLP	DiffLinear	85.38 ± 0.01	85.39 ± 0.02	85.40 ± 0.02
log	MLP	DiffLinear	93.24 ± 0.14	93.23 ± 0.14	93.22 ± 0.14
max-min	MLP	DiffLinear	94.03 ± 0.03	94.04 ± 0.03	94.02 ± 0.03
z-score	MLP	DiffLinear	94.16 ± 0.02	94.15 ± 0.02	94.14 ± 0.01
none	MLP	DiffLinear	94.01 ± 0.03	94.01 ± 0.03	94.00 ± 0.03

By running script like:

python main.py --model sage --epochs 100 --eval_steps 10 --num_layers 1 --hidden_channels 64 --dropout 0. --predictor {choose_your_predictor} --node_feat_process {choose_your_preprocessing_method} --lr 0.001

we obtain preliminary results of GraphSAGE (1 layer of 64 hidden dim with adding self-loops, 5 runs due to limited time):

Preprocessing	Model	Predictor	Train ROC-AUC	Val ROC-AUC	Test ROC-AUC
node_norm.	SAGE	DotLinear	50.35 ± 0.00	50.40 ± 0.00	50.28 ± 0.00
channel_norm.	SAGE	DotLinear	50.50 ± 0.00	50.53 ± 0.01	50.36 ± 0.01
log	SAGE	DotLinear	50.47 ± 0.03	50.50 ± 0.02	50.38 ± 0.03
max-min	SAGE	DotLinear	50.71 ± 0.03	50.71 ± 0.03	50.55 ± 0.03
z-score	SAGE	DotLinear	50.83 ± 0.02	50.77 ± 0.02	50.62 ± 0.01
none	SAGE	DotLinear	50.00 ± 0.00	50.00 ± 0.00	50.00 ± 0.00
node_norm.	SAGE	DiffLinear	79.97 ± 1.95	72.00 ± 1.11	71.99 ± 1.10
channel_norm.	SAGE	DiffLinear	91.90 ± 0.31	76.11 ± 0.62	76.11 ± 0.62
log	SAGE	DiffLinear	96.74 ± 0.38	82.75 ± 0.79	82.74 ± 0.80
max-min	SAGE	DiffLinear	97.81 ± 0.23	84.71 ± 0.54	84.72 ± 0.53
z-score	SAGE	DiffLinear	98.06 ± 0.02	85.64 ± 0.38	85.65 ± 0.38
none	SAGE	DiffLinear	98.11 ± 0.06	87.71 ± 0.07	87.71 ± 0.07

Empirically, the coordinate features are already strong enough to obtain 90+% Train/Val/Test ROC-AUC. In most cases, applying graph convolution will make model overfitting with significant degradation. The neighbor averaging paradigm fails to capture additional valuable information and encounters overfitting (possibly oversmoothing) even with a single layer.

Since high ROC-AUC may appear with low Hits@xx. Maybe we should try Hits@xx to measure model performance.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dgl		dgl
pyg		pyg
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ogbl_vessel_baselines.pdf		ogbl_vessel_baselines.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ogbl_vessel_baselines

Add self-loops

Node embedding

Preprocessing (for raw coordinate features)

Predictor

Results

Reference

About

Releases

Packages

Languages

License

skepsun/ogbl_vessel_baselines

Folders and files

Latest commit

History

Repository files navigation

ogbl_vessel_baselines

Add self-loops

Node embedding

Preprocessing (for raw coordinate features)

Predictor

Results

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages