-
Notifications
You must be signed in to change notification settings - Fork 1
/
INSTRUCTIONS.txt
312 lines (222 loc) · 11.1 KB
/
INSTRUCTIONS.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
===========================
Distributed Join Algorithms
===========================
This documentation descibes how to build/configure/run the distributed radix hash
and sort-merge join algorithms.
===============
1. Dependencies
===============
The distributed join algorithms can be compiled for x86 CPUs. The processor needs
to support AVX2 vector instructions.
MPI is used as a communication library. Version 3 or above is required: previous
versions of MPI do not offer one-sided remote memory access (RMA) operations. Specify
the MPI location in the "MPI_FOLDER" variable in the Makefile. The code has been tested
with openMPI version "1.10.2".
On Cray XC30 and XC40 machines, we recommend to use Fast-One-Sided MPI [1] (foMPI).
The code has been tested with foMPI version "0.2.1". To compile the join with foMPI,
use the following compile time option "-D USE_FOMPI". Specify the location of the
foMPI headers and implementation files in the "FOMPI_LOCATION" variable in the
Makefile. Rename/use the file called "MakefileCray".
Performance counters are accessed through the PAPI [2] performance counter interface.
The code has been tested with PAPI version "5.4.3". Specify the location of the
headers and implementation files in the "PAPI_FOLDER" variable in the Makefile.
[1] https://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI/
[2] http://icl.cs.utk.edu/papi/
============
2. Compiling
============
Specify the location of the dependencies (see Section 1). Be sure that the following
variables are set correctly:
* MPI_FOLDER: Location of the MPI library
* PAPI_FOLDER: Location of the PAPI library
* FOMPI_LOCATION: Location of the foMPI library (only for Cray machines)
Use the "COMPILER_FLAGS" variable to pass any additional flags. Make sure to add the
option "-D USE_FOMPI" if you to compile the code against foMPI.
In the project folder call the following Make commands:
* make all: Builds the program
* make clean: Removes the build and output folders
After a successful build, the project should contain the following folder structure:
* ./src: The source code
* ./build: The object files
* ./release: The assembled program
===================
3. Running the join
===================
The binary can be executed in a single thread by invoking it directly. For a multi-
threaded execution, use the following command:
* mpiexec --np N ./release/join-binary
Replace "join-binary" with the name of the binary (by default casm-join/cahj-join).
Replace "N" by the number of processes.
=====================
4. Join configuration
=====================
The primary configuration file is located under "./src/hpcjoin/core/Configuration.h".
4.1. Sort-Merge Join:
---------------------
* SORT_RUN_ELEMENT_COUNT: The size (in tuples) of a sorted run which is transmitted
over the network during the reshuffling phase.
* MAX_MERGE_FAN_IN: The maximum fan-in used for merging sorted runs.
4.2. Hash Join:
---------------
* CACHELINES_PER_MEMORY_BUFFER: Size of a network buffer in number of cacheline
* MEMORY_BUFFERS_PER_PARTITION: Number of network buffers per partition
* ENABLE_TWO_LEVEL_PARTITIONING: Configures the algorithm to use 1 or 2 passes.
* NETWORK_PARTITIONING_FANOUT: Fan-out of the first (network) partitioning pass
* LOCAL_PARTITIONING_FANOUT: Fan-out of the second (local) partitioning pass
4.3. Common elements:
---------------------
* RESULT_AGGREGATION_NODE: The process which will return the performance data after
the join has completed.
* CACHELINE_SIZE_BYTES: The size of a cacheline in bytes.
* ALLOCATION_FACTOR: A scaling factor for preallocated memory.
* PAYLOAD_BITS: The number of non-zero bits of the payload/key data used for data
compression.
==========
5. Output
==========
The join algorithms output the performance data into a newly created subfolder. For
this you will need permission to write to the folder from which you call the binary.
Each of the will create a {id}.perf and a {id}.info file.
5.1. Info File:
---------------
The information file contains the following data points:
NUMNODES: number of processes
NODEID: process id
HOST: name of the machine
GISZ: global size of the inner relation
GOSZ: global size of the outer relation
LISZ: local size of the inner relation
LOSZ: local size of the outer relation
5.2. Hash Join:
---------------
The performance numbers captured by the hash join are as follows:
CTOTAL: total number of CPU cycles of the join
JTOTAL: total execution time of the join
JHIST: time required to compute the histogram
HILOCAL: time required for creating a local histrogram of the inner relation
HILOCELEM: number of histrogram elements of inner relation
HILOCRATE: histrogram processing rate for inner relation
HOLOCAL: time required for creating a local histrogram of the outer relation
HOLOCELEM number of histrogram elements of outer relation
HOLOCRATE histrogram processing rate for outer relation
HIGLOBAL time required for computing a global histrogram of the inner relation
HOGLOBAL time required for computing a global histrogram of the outer relation
HASSIGN: time required to compute a partition-node assignment
HIOFFCOMP: time required to compute the partitioning offsets for the inner relation
HOOFFCOMP: time required to compute the partitioning offsets for the outer relation
SWINALLOC: time required to allocate the MPI windows
JMPI: time required to partition the data
MIMEMALLOC: buffer memory allocation time for inner relation partitioning
MIMAINPART: time required to partition the data of inner relation
MIFLUSHPART:time required to flush data of inner relation
MOMEMALLOC: buffer memory allocation time for outer relation partitioning
MOMAINPART: time required to partition the data of outer relation
MOFLUSHPART:time required to flush data of outer relation
MWINPUT: time needed for setting up PUT requests
MWINPUTCNT: number of PUT requests
MWINWAIT: time spent in FLUSH call
MWINWAITCNT:number of FLUSH requests
SNETCOMPL: waiting time for incoming data
SLOCPREP: time required to initialize data structures after the partitioning phase
JPROC: time required to process the data after all the data has been received
LPPART: total time of the local partitioning phase
LPTASKTIME: time required to partition the data
LPTASKCOUNT:number of partitioning tasks
LPHISTCOMP: time required to compute fine-grained histograms
LPHISTELEM: number of histogram elements
LPOFFSET: time required to compute sub-partition offsets
LPMEMALLOC: time spent in memory pool malloc call
LPMEMSIZE: requested memory size during local partitioning phase
LPELEMENTS: number of partitioned elements
BPTASKTIME: total time to execute build/probe tasks
BPTASKCOUNT:number of build/probe tasks
BPMEMALLOC: time spent in the malloc call
BPMEMSIZE: requested memory size during build/probe phase
BPBUILD: time required to build hash tables
BPBUILDELEM:number of hash table elements
BPPROBE: time required to probe hash tables
BPPROBEELEM:number of probe elements
5.3. Sort-Merge Join:
---------------------
CTOTAL: total number of CPU cycles of the join
JTOTAL: total execution time of the join
JPART: time required to partition the data
ILOCHIST: time required for creating a local histrogram of the inner relation
ILOCHISTE: number of histrogram elements of inner relation
OLOCHIST: time required for creating a local histrogram of the outer relation
OLOCHISTE: number of histrogram elements of outer relation
IPART: time required to partition the data of inner relation
IPARTE: number of partitioned element of the inner relation
OPART: time required to partition the data of outer relation
OPARTE: number of partitioned element of the outer relation
JWINALLOC: time required to allocate the MPI windows
WINPREP: time required to prepare window object
JSORT: time required to sort the data
RUNPREP: time required to compute run sizes and prepare sort tasks
SORTTSUM: time required to complete all sort tasks
SORTTCNT: number of sort tasks
SORTECNT: total number of sorted elements
PUTSUM: time spent in PUT calls
PUTCNT: number of PUT calls
FLUSH: time spent in FLUSH calls
WAIT: waiting time for incoming data
JMERG: time required to merge runs
MERGLTIME: time required to process the levels of the merge tree
MERGLCNT: merge tree levels
MERGTTIME: time required to exeute merge tasks
MERGTCNT: number of merge tasks
JMATCH: time required to find matching tuples
MATCHTTIME: time required to scan through relations
===========
6. Various:
===========
6.1. Compression:
-----------------
The input data tuples are composed of a 64-bit key and an 64-bit record-identifier.
Hence, they are 128 bytes wide. Since the data range of 64 bit integers is not fully
used, several bits of the key and the payload are always zero. Data is compressed by
removing these zero bits. During the partitioning pass, data is compressed into
64-bit tuples.
Partitioning is done by looking at a subset of N bits. Since each node needs to have
at least one partition assigned to it, we can assume that #nodes <= 2^N. All N bits
of the same partition are identical for the join key and can therefore be dropped. As
a result, we are only required to store PAYLOAD_BITS for the key and (PAYLOAD_BITS+N)
bits for the record-identifier.
For the compression to work, the following two conditions need to hold:
(1) N+2*PAYLOAD_BITS <= 64
and
(2) The key/payload data is contained in the range between 0 and 2^(PAYLOAD_BITS+N).
Example: For 1024 (=2^10) nodes, the maximum payload value is 27. The join can support
input keys and record-identifiers between 0 and 2^(27+10) = 2^37 = 137.43 billion.
6.2. Peformance:
----------------
The join algorithms are dependent on a well-tuned communication library. Please take the
time to tune your MPI implementation. Select the MPI implementation with care. Make sure
it supports at least:
* RDMA / OS by-passing / Zero-copy messages
* Asynchronous processing
Not all implementations interleave the processing and the communication. which results in
poor performance causing a significant slow-down of the join algorithm.
Try to pin threads/processes to CPU cores. Avoid sharing the same core or hyperthreads as
both algorithms are sensitive to interference on the last-level caches.
6.3. Sorting/Merging Implementation:
------------------------------------
The sort and merge implementation of the sort-merge join is based on previous work and has
been written by Balkesen et al. [3] as part of the "parallel joins" project at ETH Zurich.
[3] http://www.systems.ethz.ch/projects/paralleljoins
6.4. Tested MPI Version:
------------------------
* openMPI 1.10.1 and 1.20.2
* foMPI 0.2.1
Other versions should be compatible as well. Use the "-D JOIN_DEBUG_PRINT" flag to enable
debug output in case you experience troubles.
========================
7. Copyright & Contacts:
========================
Parts of the partitioning routine of the radix hash join and the sort implementation of the
sort-merge join are modified versions of the code published by Cagri Balkesen. You can find
the full source code and documentation at https://www.systems.ethz.ch/projects/paralleljoins.
Copyright (c) 2016, Claude Barthels
Copyright (c) 2016, Systems Group, ETH Zurich
Contacts: Claude Barthels - [email protected]
www.systems.ethz.ch