Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor MPI serialization #689

Merged
merged 20 commits into from
Aug 26, 2020
Merged

Refactor MPI serialization #689

merged 20 commits into from
Aug 26, 2020

Conversation

kks32
Copy link
Contributor

@kks32 kks32 commented Aug 14, 2020

MPM Particle serialization

Summary

Add functionality to handle particle serialization and deserialization to transfer particles across MPI tasks.

Motivation

The existing design uses Plain-Old-Data (POD) structure, where Particle class writes its data to HDF5Particle with all the relevant information. This POD is then serialized and sent using MPI with MPI_Type_Create_Struct. This requires registering all the different particle types and makes it harder to implement when more than 1 particle type is involved. The serialization/deserialization function offers a unified interface to transfer particles.

Design Detail

The Particle class will have a serialize and a deserialize function both using a vector<uint8_t> as the buffer. In addition, we need to compute the pack size to initialize the buffer with the correct size. This is saved as a private variable.

//! Serialize particle data
template <unsigned Tdim>
std::vector<uint8_t> mpm::Particle<Tdim>::serialize() {
  // Compute pack size
  if (pack_size_ == 0) pack_size_ = compute_pack_size();
  // Initialize data buffer
  std::vector<uint8_t> data;
  data.resize(pack_size_);
  uint8_t* data_ptr = &data[0];
  int position = 0;

#ifdef USE_MPI
  // Type
  int type = ParticleType.at(this->type());
  MPI_Pack(&type, 1, MPI_INT, data_ptr, data.size(), &position, MPI_COMM_WORLD);

  // Material id
  unsigned nmaterials = material_id_.size();
  MPI_Pack(&nmaterials, 1, MPI_UNSIGNED, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);
  MPI_Pack(&material_id_[0], 1, MPI_UNSIGNED, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);

  // ID
  MPI_Pack(&id_, 1, MPI_UNSIGNED_LONG_LONG, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);
  // Mass
  MPI_Pack(&mass_, 1, MPI_DOUBLE, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);
  // Volume
  MPI_Pack(&volume_, 1, MPI_DOUBLE, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);
#endif
} 

The deserialization function will read from the buffer.

//! Deserialize particle data
template <unsigned Tdim>
void mpm::Particle<Tdim>::deserialize(
    const std::vector<uint8_t>& data,
    std::vector<std::shared_ptr<mpm::Material<Tdim>>>& materials) {
  uint8_t* data_ptr = const_cast<uint8_t*>(&data[0]);
  int position = 0;

#ifdef USE_MPI
  // Type
  int type = ParticleType.at(this->type());
  MPI_Unpack(data_ptr, data.size(), &position, &type, 1, MPI_INT,
             MPI_COMM_WORLD);
  // material id
  int nmaterials = 0;
  MPI_Unpack(data_ptr, data.size(), &position, &nmaterials, 1, MPI_UNSIGNED,
             MPI_COMM_WORLD);

  MPI_Unpack(data_ptr, data.size(), &position, &material_id_[0], 1,
             MPI_UNSIGNED, MPI_COMM_WORLD);

  // ID
  MPI_Unpack(data_ptr, data.size(), &position, &id_, 1, MPI_UNSIGNED_LONG_LONG,
             MPI_COMM_WORLD);
  // mass
  MPI_Unpack(data_ptr, data.size(), &position, &mass_, 1, MPI_DOUBLE,
             MPI_COMM_WORLD);
  // volume
  MPI_Unpack(data_ptr, data.size(), &position, &volume_, 1, MPI_DOUBLE,
             MPI_COMM_WORLD);

#endif
}

Important consideration: We expect all future derivation of particle types to have the first few bytes to be the Type of particle followed by material information to retrieve in the mesh class for initialization of particle and subsequent deserialization.

In addition, the particle type is added to the Particle class.

  //! Type of particle
  std::string type() const override { return (Tdim == 2) ? "P2D" : "P3D"; }

This is used to identify the type of particle and create them when they are transferred across MPI tasks. Moreover, we have added ParticleType and ParticleTypeString as global maps to determine an index value (int) mapped to a string "P2D". The reason for this is that in serialization if we use string, we have no idea of the length of the string, which makes it complicated. Instead, since we are only going to have a few different particle types, it's easier to set-up a map to do a quick lookup.

>particle.cc
namespace mpm {
// ParticleType
std::map<std::string, int> ParticleType = {{"P2D", 0}, {"P3D", 1}};
std::map<int, std::string> ParticleTypeName = {{0, "P2D"}, {1, "P3D"}};
}  // namespace mpm

The MPI transfer_halo_particles will be altered to send 1 particle at a time rather than a bulk of particles. This is to achieve sending different particle types in a cell all at once (sequentially), instead of iterating through each particle type.

These changes would remove the need for registering MPI particle types and also get us one more step closer to removing the limit of 20 on the state_vars.

Drawbacks

No potential drawback has been identified.

Rationale and Alternatives

Why is this design the best in the space of possible designs?

Serialization vs MPI_Type_Create_Struct speed is unknown, we may have to do a performance benchmark to see the difference. Using Struct data types means we have to register each particle type and has a fixed number of state variables.

What other designs have been considered and what is the rationale for not choosing them?

Different serialization libraries were considered: (Boost Serialization)[https://www.boost.org/doc/libs/1_56_0/libs/serialization/doc/tutorial.html], (Cereal)[http://uscilab.github.io/cereal/], and (bitsery)[https://github.com/fraillt/bitsery]. The fastest bitsery, doesn't have serialization support for Eigen, we can implement a custom serializer but it will take some time. The MPI Pack/Unpack seems to be one of the fastest

Size

Size

Time

Time

What is the impact of not doing this?

If not done, will result in clunkier interface for handling different MPI transfer.

Prior Art

Discuss prior art, both the good and the bad, in relation to this proposal. A few examples of what this can include are:

https://github.com/STEllAR-GROUP/cpp-serializers

https://github.com/fraillt/bitsery#why-use-bitsery

Unresolved questions

What parts of the design do you expect to resolve through the RFC process before this gets merged?

MPI transfer halo particles function is yet to be implemented. Don't foresee an issue, but still TBD.

Related issues

#680
#681

Changelog

@bodhinandach
Copy link
Contributor

@kks32 That would be nice to see some performance comparison of using the serialize and deserialize vs the normal hdf5, just to make sure there is no performance reduction. Also, can we check it for different numbers of MPI rank?

@kks32
Copy link
Contributor Author

kks32 commented Aug 17, 2020

We won't have a big difference in the amount of information that is being sent/received. Furthermore, it would be hard to measure any significant speed difference in MPI transfer unless we do 100s of nodes with millions of particles, even if we do that I don't think it will be a big difference considering since the data size change is very small. However, as previously mentioned in the RFC, the time to serialize/deserialize particles as PODs or vector of uint_8t is benchmarked and the results show the serialization with unit_8t is faster than POD+MPI_Type_Create_Struct. Serialization / Deserialization of a POD in itself is faster, however, registering and deregistering the MPI Data types more time than serializating/deserializing as vector of unsigned buffer.

image

SECTION("Performance benchmarks") {
      // Number of iterations
      unsigned niterations = 1000;

      // Serialization benchmarks
      auto serialize_start = std::chrono::steady_clock::now();
      for (unsigned i = 0; i < niterations; ++i) {
        // Serialize particle
        auto buffer = particle->serialize();
        // Deserialize particle
        std::shared_ptr<mpm::ParticleBase<Dim>> rparticle =
            std::make_shared<mpm::Particle<Dim>>(id, pcoords);

        REQUIRE_NOTHROW(rparticle->deserialize(buffer, materials));
      }
      auto serialize_end = std::chrono::steady_clock::now();

      // HDF5 serialization
      auto hdf5_start = std::chrono::steady_clock::now();
      for (unsigned i = 0; i < niterations; ++i) {
        // Serialize particle as POD
        auto hdf5 = particle->hdf5();
        // Deserialize particle with POD
        std::shared_ptr<mpm::ParticleBase<Dim>> rparticle =
            std::make_shared<mpm::Particle<Dim>>(id, pcoords);
        // Initialize MPI datatypes
        MPI_Datatype particle_type = mpm::register_mpi_particle_type(hdf5);
        REQUIRE_NOTHROW(rparticle->initialise_particle(hdf5, material));
        mpm::deregister_mpi_particle_type(particle_type);
      }
      auto hdf5_end = std::chrono::steady_clock::now();
}

@codecov
Copy link

codecov bot commented Aug 17, 2020

Codecov Report

Merging #689 into develop will decrease coverage by 0.11%.
The diff coverage is 67.70%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #689      +/-   ##
===========================================
- Coverage    96.81%   96.69%   -0.11%     
===========================================
  Files          131      130       -1     
  Lines        25811    25822      +11     
===========================================
- Hits         24987    24968      -19     
- Misses         824      854      +30     
Impacted Files Coverage Δ
include/mesh.h 100.00% <ø> (ø)
include/mesh.tcc 82.65% <0.00%> (-1.48%) ⬇️
include/particles/particle_base.h 100.00% <ø> (ø)
tests/graph_test.cc 100.00% <ø> (ø)
include/particles/particle.tcc 91.92% <82.69%> (-2.01%) ⬇️
include/particles/particle.h 100.00% <100.00%> (ø)
include/solvers/mpm_explicit.tcc 95.16% <100.00%> (+0.08%) ⬆️
tests/particle_serialize_deserialize_test.cc 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1eaa7a4...6c1aef8. Read the comment docs.

@kks32 kks32 marked this pull request as ready for review August 17, 2020 23:31
@kks32
Copy link
Contributor Author

kks32 commented Aug 17, 2020

@bodhinandach or @tianchiTJ or @jgiven100 would you be able to test the MPI scheme with a material model that has state variables (NorSand or MC)? Check with load balancing or any problem that involves migration of particles.

@tianchiTJ
Copy link
Contributor

tianchiTJ commented Aug 18, 2020

@bodhinandach or @tianchiTJ or @jgiven100 would you be able to test the MPI scheme with a material model that has state variables (NorSand or MC)? Check with load balancing or any problem that involves migration of particles.

I test it by MC model, and I think the result is good.

@jgiven100
Copy link
Collaborator

@kks32 NorSand test looks good

@kks32
Copy link
Contributor Author

kks32 commented Aug 18, 2020

Thanks @jgiven100 and @tianchiTJ for testing with state vars materials

@ezrayst
Copy link
Contributor

ezrayst commented Aug 20, 2020

@kks32, I would like to understand the data being presented.

The pack/unpack serialization in this PR is faster than the POD struct implementation for 2D sliding block with 4 MPI ranks. The results are an average of 5 different runs.

Schemes Avg Times (ms) SD (ms)
Pack/Unpack 13201 326
POD/Struct 13815 540

What is SD here? Previously you showed POD has 0.4 to 0.7 speedup as compared to Pack/Unpack but why is this result shows that POD takes longer? (I think I am missing something here, sorry).

@kks32
Copy link
Contributor Author

kks32 commented Aug 20, 2020

@kks32, I would like to understand the data being presented.

The pack/unpack serialization in this PR is faster than the POD struct implementation for 2D sliding block with 4 MPI ranks. The results are an average of 5 different runs.
Schemes Avg Times (ms) SD (ms)
Pack/Unpack 13201 326
POD/Struct 13815 540

What is SD here? Previously you showed POD has 0.4 to 0.7 speedup as compared to Pack/Unpack but why is this result shows that POD takes longer? (I think I am missing something here, sorry).

POD alone is insufficient as you need to register data with MPI_Type_Create_Struct. This adds additional run-time. Compared to our current implementation on develop, the Pack/Unpack is slightly faster. This is the best way to handle different particle types.

Copy link
Contributor

@bodhinandach bodhinandach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the awesome refactoring idea @kks32. Some comments from my side, I am working on the twophase and fluid particle

include/particles/particle.tcc Outdated Show resolved Hide resolved
include/particles/particle.tcc Outdated Show resolved Hide resolved
include/particles/particle.tcc Show resolved Hide resolved
include/particles/particle.tcc Outdated Show resolved Hide resolved
include/particles/particle.tcc Outdated Show resolved Hide resolved
include/particles/particle.tcc Outdated Show resolved Hide resolved
Copy link
Contributor

@bodhinandach bodhinandach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me @kks32. Similar implementation for two-phase has been implemented in PR #680 and tested working well with the dynamic load balancing. Thanks for addressing my comments too.

@kks32 kks32 merged commit 19570dd into develop Aug 26, 2020
@kks32 kks32 deleted the refactor/serialization branch August 26, 2020 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants