CSSU x Purefacts Programming Challange

Geometric Point Sampling Problem

0. Problem statement

Given a set A of points in a 2D space (input.csv - provided), select an N number of random Representative Points from this set such that they best "cover" the shape of the point set (as opposed to its density).

Output should be the source code and 5 separate output.csv files where N = 10, 25, 63, 159, 380

1. Prepared outputs

👁️ See the outputs folder.

🥇 hierarchical_clustering contains results generated using the hierarchical clustering approach. It has the best rep-rep/average-distance ratio, and perform more consistently across different N values. This is the preferred set of outputs.

🥈 graph_disconnect contains results generated using the graph-disconnect approach. It has the best rep-rep/max-distance ratio on small N values only.

See Analysis section for more details.

2. Usage

Prerequisites

Create a virtual environment & install the required packages This program requires Python3.10 or above.

Please install the necessary packages in requirements.txt.

pip install -r requirements.txt

jupyter==1.0.0
matplotlib==3.6.2
matplotlib-inline==0.1.6
networkx==2.8.8
numpy==1.23.4
scikit-learn==1.1.3

2.1. Python Script

👉 point_coverage.py

🟩 Computing for a single N value:

The python program computes the required points for one N value.

usage: point_coverage.py [-h] [-v] -n NUM_REPS -a ALGORITHM [-r RADIUS] [-b BRANCHING_FACTOR] -o OUTPUT input_file

positional arguments:
  input_file

options:
  -h, --help            show this help message and exit
  -v, --verbose         When set, print analytics.
  -n NUM_REPS, --num-reps NUM_REPS
                        Number of representative points to pick from the input data. Positive integer.
  -a ALGORITHM, --algorithm ALGORITHM
                        One of 'clustering', 'graph'.
  -r RADIUS, --radius RADIUS
                        If using 'graph' algorithm, select the max radius between connected points.
  -b BRANCHING_FACTOR, --branching-factor BRANCHING_FACTOR
                        If using 'graph' algorithm, select the branching factor for each point in the graph.
  -o OUTPUT, --output OUTPUT
                        Path to the output csv file. (e.g. output.csv)

➡️ Example with clustering algorithm:

python3.10 ./point_coverage.py -v -n 10 -a clustering -o out_c.csv input.csv

➡️ Example with graph algorithm:

python3.10 ./point_coverage.py -v -n 10 -a graph -r 6000 -b 180 -o out_g.csv input.csv

🟩 Computing for all specified N values:

We can use the provided run.sh to produce output files for all N values specified in the handout:

bash ./run.sh <algorithm> <input-file>

The script will also time the execution for each N value round.

Output files will be generated in the current directory, named output_<algorithm>_%d.csv.

➡️ Example with clustering algorithm:

./run.sh clustering input.csv

Computing for N = 10
Took 0.39 seconds.

Computing for N = 25
Took 0.41 seconds.

Computing for N = 63
Took 0.39 seconds.

Computing for N = 159
Took 0.39 seconds.

Computing for N = 380
Took 0.39 seconds.

➡️ Example with graph algorithm:

./run.sh graph input.csv

Computing for N = 10
Took 49.95 seconds.

Computing for N = 25
Took 91.06 seconds.

Computing for N = 63
Took 96.84 seconds.

Computing for N = 159
Took 99.31 seconds.

Computing for N = 380
Took 98.65 seconds.

2.2. Jupyter Notebook

👉 experiment.ipynb

The Jupyter Notebook contains step-by-step computation and analysis, as well as scatter plots to visualize results. Useful if you are interested in the details. The same required packages apply.

You can view and edit Jupyter Notebooks on your own computer using Visual Studio Code. For online editing, feel free to check out Google Colaboratory.

3. Analysis

The following three approaches to sampling N representatives from a collection of input points in order to attain optimal geometric coverage are evaluated:

k-Means Clustering
Hierarchical Clustering
Graph-Disconnect

Out of the three, only "Hierarchical Clustering" and "Graph-Disconnect" are deemed acceptable, and only "Hierarchical Clustering" performs consistently across many N values.

We will look at the three approaches in detail below.

If you would like to follow along with the analysis, please feel free to check out the Jupyter Notebook, as it provides step-by-step computation. You can even tweak parameters if you download the notebook to your local computer.

3.1. General Approach

3.1.1. Classic Clustering Algorithms

Initially, the given problem appeared to me as a clustering problem. For this type of approach, we group the input points into N clusters, then find the geometric centroids for each cluster, and pick the closest point to the centroid as a representative.

I have evaluated various clustering algorithms provided by the Scikit-Learn library. We must exclude the algorithms that do not allow specifying the number of clusters in the output, because they are not suitable for the given problem.

k-Means Clustering

Read more

This algorithm is perhaps the most well-known unsupervised clustering algorithm, so naturally, it became my first candidate.

It starts by picking N random points as centroids, tries to absorb adjacent points, and recomputes the centroids. This process repeats until the centroids no longer move significantly across iterations.

The KMeans function provided by Scikit-Learn returns cluster centroids at the end of the algorithm, which are ready to be converted to input data points.

Hierarchical/Agglomerative Clustering

Read more

This class of clustering algorithms either start with one cluster containing all input points, and divide them repeatedly until forming N clusters, or they start with each point in their own cluster, and merge clusters repeatedly until forming N final clusters. When the bottom-up merging approach is used, it's called agglomerative clustering.

I found most success with agglomerative clustering using maximum/complete linkage. Linkage is the criterion by which the algorithm decides which clusters are most "linked" to each other, in order to merge them. Maximum linkage refers to taking the maximum distance between all points in two clusters, and clusters who are closest by this maximum distance are merged together.

The AgglomerativeClustering function provided by Scikit-Learn does not return a collection of cluster centroids. It instead returns a collection of labels for each input data point describing which of the N clusters the data point belongs to. As a result, I grouped the input data by their cluster label, and computed a centroid for each cluster, so that they are ready to be converted to input data points.

Other Clustering Algorithms

I have also evaluated Spectral clustering and BIRCH for the purpose of the given problem, but they have produced underwhelming results, therefore I will not discuss them in more detail.

3.1.2. Graph Connectivity

Rather than using existing clustering algorithms, another approach to this problem can be described in terms of graph splitting, which I will refer to as "Graph-Disconnect".

Consider the following:

Create a graph containing all input data points as vertices
Connect all vertices to each other, with the edge weight being their Euclidean distance.
Repeatedly remove the largest edge from the graph, until there are N connected components in the graph left
Compute the mean centroids for all vertices in each connected component

In order to build and operate on such a graph, I used the NetworkX library.

The initial attempt following the steps described above suffered from major performance issues. The number of edges are exponential to the number of input data points, and the process of removing edges until N components are formed was taking in excess of 10 minutes, with no results.

To optimize performance, I first tried to use the geometric_edges function provided by the NetworkX library. It connects vertices that are within a certain radius of each other, which reduces the number of edges in the final graph. However, the resulting graph still contains way too many edges, since the input data contains dense regions with many points in a close distance from each other.

To further optimize performance, I modified the source code of the geometric_edges function to also include a branching factor constraint. The branching factor is the number of edges that can be incident from a vertex (a.k.a. the size of the neighborhood). This allows the graph to remain reasonably sized even in dense regions of the input data.

The final optimization allowed the Graph-Disconnect algorithm to successfully complete within a few minutes. For the given input data, I found that a radius constraint of 6000 and branching factor constraint of 180 produces optimal results. However, these constraints will be different depending on the input data, and there are no trivial ways to compute them automatically.

3.1.3. Processing Centroids

The centroids returned from clustering are not part of the input dataset. We must find input data points that are closest to these centroids as points that represent their region ("reps"), in order to form the final output.

Nearest neighbors algorithms are well suited for this task, as it tries to find k input points closest to a given set of reference points. The Skikit-Learn library provides such a function. For our purpose, we find the 1 nearest neighbor from the input data to each of the computed centroids, which we return as representatives.

3.2. Point Coverage Efficacy Evaluation

3.2.1. Visual Inspection

In the following plots:

dark gray points are input data
blue points are floating-point centroids
red points are input points chosen as representatives, which are closest to the blue centroids

For the Graph-Disconnect algorithm, the radius is chosen to be 6000, and branching factor is chosen to be 180.

Scatter Plots for N=10

k-Means		The reps appear to be concentrated in dense areas, rather than across the geometric span of the input points.
Hierarchical		The reps are reasonably distributed across the geometric region of the input points, with a few outlier points being far from a rep.
Graph-Disconnect		The reps are dipersed far from each other, and most outlier points are close to a rep.

Scatter Plots for N=63

k-Means		k-Means continues to favor dense regions, with many points on the exterior far from a rep.
Hierarchical		Even coverage across the geometric span of the input data, with both dense and sparse regions reasonably covered. Few points are far from a rep.
Graph-Disconnect		Outlier points seem to be favored over densely congregated points. Many points in dense regions are not covered by an adjacent rep.

3.2.2. Quantitative Measures

I will discuss 4 ways to evaluate the point coverage for each of the three approaches described above.

3.2.2.1. Naive Total Distance to Reps

Take all points in the input data, compute their distance to every selected rep point, and take the total sum.
For optimal point coverage, unselected points should be as close to rep points as possible. The total of such distances is therefore a general measure of how close are points to reps throughout the dataset.
Small values in this metric are desireable.
However, this metric favors dense input point distributions, since for dense inputs, more points will be closer to reps compared to sparse inputs.

3.2.2.2. Distances to Closest Rep Points

Take all points not selected as a rep, find their closest rep, compute their distance. Take all these distances to the closest rep, and find the max, min, and average.
For optimal point coverage, unselected points should be as close to their rep as possible. This improves upon the naive total distance measure, by focusing on reps that are supposed to cover points in their region.
Small values in this metric are desireable.

3.2.2.3. Closest Distance Between Reps

Take all points chosen as reps, find the minimum distance between any 2 reps.
For optimal point coverage, the selected reps should be as far from each other as possible, so that each rep can cover as many unselected points as possible.
Large values in this metric are desireable.

3.2.2.4. Min-Rep-Rep to Point-Rep Ratios

The ratio of previous two measures ("3.2.2.3. closest distance between reps" and "3.2.2.2. max|average distance from a point to a rep")
The greater this ratio, the farthest apart are reps to each other, and closer the unselected points to a rep.
Large values in this ratio are desireable.
The min distance between reps to max distance from point to rep (rr/max) is a summary measure of point coverage in the worst case, and favors outlier input points.
The min distance between reps to average distance from point to rep (rr/avg) is a summary measure of point coverage in the average case, and should be considered the optimal measure.

3.2.3. Comparing Efficacy Quantitatively

Compare all metrics on N=10

Algorithm	k-Means			Hierarchical			Graph-Disconnect
Total Distance	29272141.3949			34850046.2278			37936438.9315
Dist-Rep	Min	Max	Avg	Min	Max	Avg	Min	Max	Avg
Dist-Rep	7.2111	3246.6056	525.7065	43.8406	1976.1905	639.4779	59.4811	2120.6284	895.2549
Rep-Rep	997.0245			1411.0978			1880.5130
Ratio	rr/max	rr/avg	Worst	rr/max	rr/avg	Best Average	rr/max	rr/avg	Best Max
Ratio	0.3071	1.8965	Worst	0.7140	2.2066	Best Average	0.8868	2.1005	Best Max

As we can see, the k-Means approach does not live up to standard both visually and quantitatively, as it is too biased toward point density, and produces suboptimal rep-rep/dist-rep ratios.

We will exclude k-Means from now on and focus on the other two methods.

Compare ratios for all given N values

Algorithm	Hierarchical		Graph-Disconnect
N	rr/avg	rr/max	rr/avg	rr/max
10	2.2066	0.7140	2.1005	0.8868
25	1.5561	0.6281	0.2790	0.1034
63	1.1474	0.3402	0.1688	0.0426
159	0.9536	0.3582	0.2131	0.0779
380	1.1536	0.4710	0.1439	0.0456

We can see that, for both Hierarchical Clustering and Graph-Disconnect, their efficacy ratios seem to decrease with larger N values.

For N=10, Graph-Disconnect has superior worst-case coverage, while Hierarchical Clustering has superior average-case coverage.

For N>10, Graph-Disconnect suffers from a sharp drop in efficacy ratios, while Hierarchical Clustering continues to perform consistently well, and beats Graph-Disconnect on all measures.

It is worth nothing that, in my trials, the coverage efficacy of Graph-Disconnect is highly influenced by its performance constraint parameters (radius and branching factor). Increasing these parameters tend to improve coverage ratios, but will cause exponential performance degradation. For large N, I have not found performance parameters that allow Graph-Disconnect to outperform Hierarchical Clustering in coverage while completing in a practical amount of time.

In summary, Hierarchical Clustering produces better point coverage for scenarios.

3.3. Computational Performance

The following are recorded execution times for both algorithms on the same computer, measured in seconds.

For the Graph-Disconnect algorithm, the radius is chosen to be 6000, and branching factor is chosen to be 180. These parameters have a significant impact of the execution time of the algorithm.

N	Hierarchical	Graph-Disconnect
10	0.41	49.89
25	0.44	86.59
63	0.49	92.02
159	0.60	96.90
380	0.82	96.15

As seen here, Hierarchical Clustering is far superior to Graph-Disconnect when it comes to computational performance.

3.4. Conclusion

Across different approaches to selecting representative points for optimal point coverage given a collection of input points, Hierarchical Clustering is overwhelmingly successful, both in terms of coverage metrics, and in computational performance.

The only case to favor Graph-Disconnect is for very small N values, and if long execution times are acceptable.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analytics.py		analytics.py
experiment.ipynb		experiment.ipynb
input.csv		input.csv
point_coverage.py		point_coverage.py
requirements.txt		requirements.txt
run.sh		run.sh

License

sammdu/cssu-purefacts-fall-2022-samm-du

Folders and files

Latest commit

History

Repository files navigation