This project takes two raw ingredients:
- A long snippet of audio (.WAV format)
- A shorter snippet of audio (also .WAV format)
and layers repeated echoes of (2), scaled and shifted, to try and approximate (1). Along the way, it writes some of these successive approximations out to their own .WAV files.
Let's call (1) the target audio, and (2) the basis audio.
Here's a fantastic example of how lots and lots of repeated versions of the same basis sound (one duck honk) can smear together to approximate something quite different (human-like screams):
This project is that, just you get to also guide the smearing of the repeated basis audio towards some target audio of your choosing.
Right now, if you provide a WAV file with multiple channels, e.g., in stereo, for either the target or the basis, it's compressed down to a single mono channel signal.
From this directory, run this command:
sbt/sbt 'runMain com.gawalt.wav_project.WavProject
path/to/long_target_audio.wav
path/to/short_basis_audio.wav
path/to/output_filename_base'
This will perform an approximation of the target audio by layering scaled and shifted echoes of the basis audio. The approximation is computed iteratively, one basis echo per iteration.
Unfortunately, right now that takes like two minutes per iteration. I might be able to speed this up?
The approximation being calculated is a multivariate linear regression, fit by coordinate descent.
Suppose the target WAV file is N
samples long, and the
basis WAV is K
samples. The approximation we want to
produce should also be N
samples long. We can represent
the target as an array of doubles of length N
, the basis
as an array of length K
We start with an N
-length vector, full of zeros, as our
approximation. The first N - K + 1
elements of that
approximation vector are all "starting places" where we
could drop a scaled version of the K
-length basis without
the basis overrunning the end of the approximation array.
Let's define the residual array as the difference of
target and approximation: resid[i] = target[i] - approx[i]
for i = 0, ..., N - 1
. Let's also define the sequence of
starting places, s = 0, 1, ... N - K
.
Given those definitions, let's run this loop until we're bored:
- For each starting place
s
, calculate the optimal amount to scale the basis vector so that it resembles the residual snippet running fromresid[s]
toresid[s + K - 1]
. We want to pick a scaling factora
such that the sum of(a*basis[j] - residual[s + j])^2
forj = 0, ..., K - 1
is minimized. It's easy to calculate a closed form answer to this -- it's essentially vector projection of that segment of the residual onto the basis. - Finds the starting position for which that squared-sum was
smallest. Call that position
s'
and its associated optimal scaling factora'
. - Update the approximation by adding in the optimally
scaled basis function shifted to the starting place:
approx[s' + j] += a'*basis[j]
forj = 0, ..., K - 1
. - Recalculate the residual from element
resid[s']
toresid[s' + K - 1]
and repeat.
It's fun to solve this using coordinate descent, so that each update step of the algorithm is adding one whole, continuous, recognizable version of the basis sound. It makes the evolution of the approximation a wackier listening experience.
Also, after the first loop, you only need to recalculate
projections in Step (1) for starting positions that involve
some overlap with the interval s', ..., s' + K - 1
. That's
about 2K
positions instead of N - K + 1
.
I've used the Akka library to parallelize the computation of
Step (1). I did this because I think Akka is fun. There's
a single Conductor
actor in the system doing Steps (2),
(3), and (4). It has a small army of BasisFitter
actors,
one for each N - K + 1
starting position. The Conductor
sends residual snippets to each BasisFitter
, waits to hear
back about how well each one's vector projection went, then
updates the residual accordingly and kicks it off again.
It turns out to unfortunately take a while to run even a single
iteration. A basis array of length K
means doing O(K^2)
floating point multiplications each iteration of the loop.
It's certain to go faster if I rented a machine with like 32 CPU-equivalents, instead of my 4-core laptop. Those cost like two bucks an hour, plus the headache of working with a remote machine. The current WAVProjects I've put together have definitely maxed out the available CPUs, so the speed up is almost certainly there.
Instead of testing all starting positions in Step (1), I could instead just pick a random 10 or 100 or 1000. If I coupled that with lazy evaluation of the vector projection, it might save some time. And the approximation would probably approach the target audio at roughly the same speed.
In general, there's almost certainly a way I can cut
down on having to make new copies of resid
snippets
all the time. It'd break the nice little "shared nothing"
world my actors currently enjoy.
Right now, there's N - K + 1
messages relayed back to
the Conductor
, which processes them all one-by-one.
I could instead set up a prefix tree of other actors to
weed out suboptimal projection results tournament-style.
That could help parallelize Step (2) as well as Step (1).
(Though from my naive look at system resource utilization,
WAVProject seems like it's making use of 3 out of my 4
laptop cores, so it's probably spending all its time in
the already-parallelized Step (1) anyway.)
I could also try reimplenting this all in Numpy instead. I just really enjoy writing in Scala, though.
https://www.youtube.com/watch?v=t-7mQhSZRgM
Moved to a multi-basis system, instead of all workers fitting the same basis. Tried out a new example soundwave pair at first:
Implemented "laziness" in the routine: moved the dot product calculations back to the moment of a BasisFitRequest, and enabled the Conductor to request only a subsample of basis fitters for each step.
To accomodate this, had to push the residual into a global
variable. Otherwise, basis fitters would need to keep a
local copy of their snippet of the residual, which means
a quadratic blowup in the number of samples being stored.
(Maybe this could've been fixed with using List
s instead
of Array
s for the residual and its snippets, but that seems
slow computationally and this is working fine, if inelegantly.)
Now that residual updates are lazy, we can also use sampling instead of a full census. That's where the real speed-up comes in. There doesn't appear to be a huge impact to quality, at least to my ear. Two thousand updates per minute is a lot.
Moved dot-product calculation in BasisFitter
to a while
loop.
Started passing residual snippets as arrays instead of lists.
(This fixed a goof-up where I was checking .length
on a
List
, which is expensive.)