-
Notifications
You must be signed in to change notification settings - Fork 0
/
assoc_red.tex
145 lines (117 loc) · 8.24 KB
/
assoc_red.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
Serial reductions in Halide (e.g.\ summation over an array, histogram, etc.) are implemented using \code{RVar}s or \code{RDom}s. An \code{RVar} is an implicit serial loop, and an \code{RDom} is an ordered list of \code{RVar}s specifying a serial loop nest. Since \code{RVar}s are not trivially parallelizable or vectorizable, a programmer must manually \emph{factor} a reduction into an \emph{intermediate} function that performs reduction over distinct slices of the domain, and a \emph{merge} function that combines those partial results.
To further complicate matters, it is hard to infer what binary reduction operator is equivalent to a Halide update definition, and even then, many binary operators are not obviously associative (e.g. $x + y + 7xy$ is in fact associative). We will defer these issues to Section~\ref{synthesize}, and for now assume that given a Halide update definition we can deduce the equivalent associative binary operator and its identity. Note that some transformations (e.g. factoring the inner loop of a reduction as in Figure~\ref{fig:rfactor}-3b) require the binary operators to be commutative in addition to being associative as they may change the order of computation.
To remove the burden of factoring a reduction from the programmer, we introduce a new scheduling primitive called \code{rfactor}. This splits a reduction into a pair of reductions, which we will call the \emph{intermediate} function and the \emph{merge} function. \code{rfactor} takes a list of (\code{RVar}, \code{Var}) pairs. The \emph{intermediate} reduces only over the \code{RVar}s \emph{not} mentioned in the list, and gains the associated \code{Var}s as new pure data-parallel dimensions. The \emph{merge} then reduces over these additional dimensions using \code{RVar}s that \emph{are} mentioned the list. The intermediate is thus \emph{higher}-dimensional than the original, but both the intermediate and merge do \emph{lower}-dimensional reductions than the original. We will specify the transformation in more detail after looking at several examples.
As a first example, consider computing the histogram of an image in Halide:
\noindent
\begin{minipage}{\linewidth}
\begin{lstlisting}
// Algorithm
Func hist;
Var i;
hist(i) = 0;
RDom r(0, input.width(), 0, input.height());
hist(input(r.x, r.y)) += 1;
// Schedule
hist.compute_root();
\end{lstlisting}
\end{minipage}
The \code{RDom} defines an implicit loop nest over \code{r.x} and \code{r.y}. Halide will not permit either of these loops to be parallelized, as that would introduce a race condition on the \code{+=} operation. Without the \code{rfactor} transformation, a user would need to rewrite the algorithm to manually factor the histogram:
\noindent
\begin{minipage}{\linewidth}
\begin{lstlisting}
// Algorithm
Func intm;
Var i, y;
intm(i, y) = 0;
RDom rx(0, input.width());
intm(input(rx, y), y) += 1;
Func hist;
hist(i) = 0;
RDom ry(0, input.height());
hist(i) += intm(i, ry);
// Schedule
intm.compute_root().update().parallel(y);
hist.compute_root().update().vectorize(i, 4);
\end{lstlisting}
\end{minipage}
Above, the programmer introduced an intermediate function \code{intm} that computes the histogram over each row of the input. This intermediate function is data-parallel over \code{y}, and so it can be parallelized. The original function \code{hist} now merely sums these partial histograms; since \code{hist} is data-parallel over histogram buckets, the programmer has vectorized it.
Using \code{rfactor}, the programmer can produce the same machine code as the manually-transformed version, using the simpler algorithm in the original \code{hist} implementation. While the schedule is more complex, recall that it is only the five lines of the algorithm that determine correctness. The programmer can freely transform the code to exploit parallelism without risking introducing a correctness bug:
\noindent
\begin{minipage}{\linewidth}
\begin{lstlisting}
// Algorithm
Func hist;
Var i;
hist(i) = 0;
RDom r(0, input.width(), 0, input.height());
hist(input(r.x, r.y)) += 1;
// Schedule
Var y;
hist.compute_root()
Func intm = hist.update().rfactor(r.y, y);
intm.compute_root().update().parallel(y);
hist.update().vectorize(i, 4);
\end{lstlisting}
\end{minipage}
Reduction domains need not be rectangular. Consider the function below, which computes the maximum over a circular domain using \code{RDom::where} to restrict the reduction domain to the points that lie within a circle of radius 10.
%\vspace*{-3pt}
\noindent
\begin{minipage}{\linewidth}
\begin{lstlisting}
// Algorithm
Func max_val;
max_val() = 0;
RDom r(0, input.width(), 0, input.height());
r.where(r.x*r.x + r.y*r.y <= 100);
max_val() = max(max_val(), input(r.x, r.y));
// Schedule
max_val.compute_root();
\end{lstlisting}
\end{minipage}
In this case, manually factoring the reduction requires also manipulating the predicate associated with the \code{RDom}. The identity for \code{max} is the minimum value of the type in question, so the newly-factored algorithm becomes:
\noindent
\begin{minipage}{\linewidth}
\begin{lstlisting}
// Algorithm
Func intm;
Var y;
intm(y) = input.type().min();
RDom rx(0, input.width());
rx.where(rx*rx + y*y <= 100);
intm(y) = max(intm(y), input(rx, y));
Func max_val;
max_val() = 0;
RDom ry(0, input.height());
max_val() = max(max_val(), intm(ry));
// Schedule
intm.compute_root().update().parallel(y);
max_val.compute_root();
\end{lstlisting}
\end{minipage}
Using \code{rfactor} in the schedule, the programmer can produce the same machine code from the original algorithm: %simpler form of the algorithm:
\noindent
\begin{minipage}{\linewidth}
\begin{lstlisting}
// Algorithm
Func max_val;
max_val() = 0;
RDom r(0, input.width(), 0, input.height());
r.where(r.x*r.x + r.y*r.y <= 100);
max_val() = max(max_val(), input(r.x, r.y));
// Schedule
Var y;
max_val.compute_root();
Func intm = max_val.update().rfactor(r.y, y);
intm.compute_root().update().parallel(y);
\end{lstlisting}
\end{minipage}
In both of these examples we factored a two-dimensional reduction into two one-dimensional reductions. In general, one can simultaneously factor out any subset of the dimensions of an N-dimensional reduction. Recall that low-dimensional reductions can be reshaped into higher-dimensional ones using \code{split}, as in Figure~\ref{fig:rfactor}.
With these concrete examples in mind, we now specify the general form of the transformation. Given a list of (\code{RVar}, \code{Var}) pairs, \code{rfactor} does the following:
\begin{enumerate}
\item The associative binary operator equivalent to the reduction is synthesized (see Section~\ref{synthesize}).
\item An \emph{intermediate} function is created (called \code{intm} in the examples above). It is given a pure definition equal to the identity of the associative operator. The function has the same pure \code{Var}s as the original, plus the \code{Var}s specified in the \code{rfactor} call. The \emph{intermediate} is thus a higher-dimensional function than the original.
\item The \emph{intermediate} is given an \emph{update} definition which is a copy of the update definition to which \code{rfactor} was applied. The reduction domain for this definition is a copy of the original reduction domain, minus the \code{RVar}s being factored out. It is thus a lower-dimensional reduction than the original. In all expressions that comprise this update definition, each \code{RVar} being factored out is replaced with its associated \code{Var}, and those \code{Var}s also appear as pure variables on the left-hand-side of the update definition.
\item The original update definition being factored is deleted, and a new update definition is injected in its place which reduces over the \emph{intermediate} using the associative operator. The domain of this reduction is the set of \code{RVar}s which were factored out of the intermediate. The update operates element-wise along any pure dimensions of the merge stage (which may create yet more data parallelism, as in the histogram example). Other stages of the original function (e.g.\ its pure definition) are left unchanged.
\end{enumerate}
Note that this transformation requires that the associative operator has an identity. This is not true for all associative operators, such as $2xy$, where $x, y \in \mathds{Z}$.
%while $2xy$ has an identity in the real domain, it does not have an identity as a function of integers.