Pjit with gradient accumulation performs an all-reduce every microbatch #16156

fattorib · 2023-05-26T18:04:33Z

fattorib
May 26, 2023

Hi,

I am in the process of moving over some data parallel training code written with xmap to the new jit API with the intent to extend it to allow model-parallel training. The training step function which was previously xmapped, takes a batch of data, splits the data into microbatches and then performs gradient accumulation on the microbatches, performing a single pmean operation at the end of the accumulation loop to synchronize gradients across devices.

I've been trying to replicate this behavior with jit unsuccessfully (minimal code reproduction attached below). Every time jnp.mean(loss) is called in the accumulation loop, an all-reduce across all devices is performed, which I have been able to confirm with the jax profiler. I've tried sharding the batch and then re-sharding every microbatch within the accumulation loop but the compiled code seems to want to perform this all-reduce no matter the sharding annotations. Is there something I am missing with respect to the sharding annotations or is this a bug?

Thank you!

Code to reproduce:

from functools import partial
from time import time

import jax
import numpy as np
from jax.sharding import Mesh
from jax.sharding import NamedSharding
from jax.sharding import PartitionSpec as P
import jax.numpy as jnp
from typing import Any
from jax.lax import with_sharding_constraint
from typing import Callable


def train_step(
    params: Any,
    batch: jnp.array,
    batch_spec: Any = None,
    grad_fn: Callable = None,
    dp_axis_size: int = None,
    per_device_parallelism: int = None,
):
    """
    Computes loss/grads for a single batch of data, optionally with gradient accumulation
    """
    batch_size = jnp.shape(batch)[0]
    microbatch_size = dp_axis_size * per_device_parallelism
    num_micro_steps = batch_size // microbatch_size
    assert num_micro_steps * microbatch_size == batch_size

    # reshape to add a microbatch dimension
    batch = batch.reshape((num_micro_steps, microbatch_size) + batch.shape[1:])
    batch = with_sharding_constraint(
        batch, batch_spec
    )  # keep dp sharding for microbatches

    # accumulate gradients
    def cumul_minibatch_step(carry, x_y):
        cumul_loss, cumul_grads = carry
        minibatch = x_y
        loss, grads = grad_fn(to_bf16(params), minibatch)
        cumul_grads = jax.tree_map(jnp.add, cumul_grads, grads)

        return (cumul_loss + loss, cumul_grads), None

    grad_init = to_bf16(jax.tree_util.tree_map(jnp.zeros_like, params))

    (loss, grads), _ = jax.lax.scan(
        cumul_minibatch_step, init=(jnp.zeros(()), grad_init), xs=batch
    )

    metrics = {
        "train/loss": loss,
        "train/ppl": jnp.exp(loss),
    }

    return grads, metrics


if __name__ == "__main__":

    rng = jax.random.PRNGKey(23)

    grad_acc_steps = 64

    batch_size = 512
    d_model = 2048
    n_layer = 16
    num_iter = 10
    dp = 8

    def to_bf16(t):
        return jax.tree_map(
            lambda x: x.astype(jnp.bfloat16) if x.dtype == jnp.float32 else x, t
        )

    # Setting up device mesh
    mesh = Mesh(np.array(jax.devices()).reshape(dp), ("dp"))

    # setup sharding for data parallelism
    batch_sharding = NamedSharding(mesh, P("dp", None))
    no_shard = NamedSharding(mesh, None)
    microbatch_spec = NamedSharding(mesh, P(None, "dp", *(None,) * (1)))

    param_spec = no_shard
    batch_grad_spec = no_shard

    def create_mini_model(rng):
        # create mini model that does a sequence of matmul + residual
        params = jax.random.normal(rng, shape=(n_layer, d_model, d_model))
        return params

    model_params = create_mini_model(rng)

    def fwd(batch: jnp.array, params: jnp.array):
        def layer(x, param):
            p = param
            y = jnp.dot(x, p)
            return y + x, None

        x, _ = jax.lax.scan(layer, batch, params)
        return x

    params = jax.device_put(model_params, param_spec)

    def loss_fn(params, batch):
        out = fwd(batch, params)
        loss = jnp.mean(out)
        return loss

    grad_fn = jax.value_and_grad(loss_fn, has_aux=False)

    # compute per-device batch size
    per_device_parallelism = batch_size // dp // grad_acc_steps

    with mesh:
        train_step_dp = jax.jit(
            partial(
                train_step,
                grad_fn=grad_fn,
                per_device_parallelism=per_device_parallelism,
                dp_axis_size=dp,
                batch_spec=microbatch_spec,
            ),
            in_shardings=(param_spec, batch_sharding),
            out_shardings=(param_spec, no_shard),
        )

        init_batch = jax.numpy.ones(shape=(batch_size, d_model))
        batch = jax.device_put(init_batch, batch_sharding)
        grads, metrics = train_step_dp(params, batch)

        start = time()
        for i in range(num_iter):
            # create a pseudobatch of data and send to device
            batch = jax.numpy.ones(shape=(batch_size, d_model))
            batch = jax.device_put(batch, batch_sharding)

            grads, metrics = train_step_dp(params, batch)
            grads[0].block_until_ready()

        total_time = time() - start

        print(
            f"Global BS {batch_size} - accum steps {grad_acc_steps} - Num Executions {num_iter}"
        )
        print(f"Total Time: {total_time:.4f}s")

What jax/jaxlib version are you using?

jax[tpu], 0.4.8

Which accelerator(s) are you using?

TPU, tested on v2-8, v3-8

sourabh2k15 · 2025-01-09T18:40:33Z

sourabh2k15
Jan 9, 2025

Have you tried inspecting annotations for grads and putting sharding constraints around that so it doesn't call all_reduce before the accumulation is finished

also it seems you'd need either shard_map which would be better suited for such non-trivial setup
finally , why not do gradient accumulation outside the devices ? like divide into microbatches , dispatch to devices, accumulate and do pmean

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pjit with gradient accumulation performs an all-reduce every microbatch #16156

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Pjit with gradient accumulation performs an all-reduce every microbatch #16156

fattorib May 26, 2023

What jax/jaxlib version are you using?

Which accelerator(s) are you using?

Replies: 1 comment

sourabh2k15 Jan 9, 2025

fattorib
May 26, 2023

sourabh2k15
Jan 9, 2025