Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GNU jobserver posix client support #2474

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mcprat
Copy link
Contributor

@mcprat mcprat commented Aug 10, 2024

a rework of #2450 supporting all versions of GNU Make, but without Windows support
(I'm not able to test for Windows, and I have doubts with proposed Windows support)

resolves #1139 for posix systems

thanks to @hundeboll for much of the work with this newer implementation

ping @jhasse @digit-google

significant differences:

  • no changes to any function parameters
  • no new intermediary functions
  • instantiate client support in real_main() instead of Plan
  • pass references pointers to the jobserver class into other classes
  • use a constructor to initialize jobserver client
  • release all tokens on any fatal error
  • calculate a value for "load capacity" instead of returning SIZE_MAX
  • supports both fifo and simple pipe file descriptors from Make
  • detect invalid or closed pipe and inform user about the most likely reason

@mcprat
Copy link
Contributor Author

mcprat commented Aug 10, 2024

I forgot I have to adapt to a windows build even if I'm not going to support it on windows...

@mcprat mcprat force-pushed the jobserver-final branch 4 times, most recently from c1c6829 to 8530799 Compare August 10, 2024 06:09
@mcprat
Copy link
Contributor Author

mcprat commented Aug 10, 2024

The CI for Windows is happy now, but it would be nice to have a tester for Windows...

src/jobserver.h Outdated
Comment on lines 81 to 71
/// The number of currently acquired tokens, or the jobserver status if negative.
/// Used to verify that all acquired tokens have been released before exiting,
/// and when the implicit (first) token has been acquired (initialization).
/// -1: initialized without a token
/// 0: uninitialized or disabled
/// +n: number of tokens in use
int token_count_ = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to point out this concept and ask whether or not this usage of an int is too unusual or non-standard.
This coincides with the last line of the constructor Jobserver::Jobserver() and the value of capacity in CanRunMore() using the absolute value function.

It's pretty easy to rework this. I just happened to have this idea first, to use the token number in place of Enabled() when there actually are no tokens yet.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 11, 2024

found and fixed some minor mistakes, added some commit tags

@mcprat mcprat force-pushed the jobserver-final branch 3 times, most recently from da3903d to dd128f2 Compare August 12, 2024 19:48
@mcprat
Copy link
Contributor Author

mcprat commented Aug 12, 2024

sorry I didn't realize that there were "builder" constructors for the test suite, I built and ran the test suite this time.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 13, 2024

oops... I went too fast and everything is still "green" on segfault...

@jhasse
Copy link
Collaborator

jhasse commented Aug 13, 2024

We're very wary of changes that increase the complexity of Ninja, so a PR that implements both methods while one of them is technical superior and results in less code in Ninja (and to my understand that's the case for fifo), is very unlikely to get merged.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 13, 2024

The current previous struggle is regarding the creation of the Jobserver object in real_main(), nothing to do with the new files and new functionality. I was trying to avoid making functions that call Jobserver functions through another class and instead pass references to the object wherever it's needed, but I can always go back to the other way of creating the object within the Plan struct.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 13, 2024

ah, I see what you mean, it errored on readability...

but the other thing I have doubts about

Run ctest -C Release -vv
CMake Error: Unknown argument: -vv
CMake Error: Run 'ctest --help' for all supported options.
Error: Process completed with exit code 1.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 13, 2024

the bool now defaults to false, and set true with a simple if instead of a ternary, like in the rest of the project 👍🏼

@mcprat
Copy link
Contributor Author

mcprat commented Aug 14, 2024

I believe that all the minor issues caught by the CI are handled now...

@mcprat
Copy link
Contributor Author

mcprat commented Aug 15, 2024

some simplification:

I was reading the Google Style Guide and saw this

...we never allow non-­const reference parameters.

so changes in the last push are:

  1. Converted all new references to pointers, to comply with the style guide for Jobserver being a non-const object.

Then I realized that I no longer need to create a Jobserver object for build_test.cc (at least in this commit), so

  1. Pass NULL for Jobserver* in Builder instantiations in build_test.cc, check for null dereferencing Jobserver in build.cc

then finally, another opportunity to save lines:

  1. Create Jobserver object in NinjaMain instead of real_main() since I realized that only 1 NinjaMain is created in a single process run.

@mcprat
Copy link
Contributor Author

mcprat commented Aug 15, 2024

updated commit message

@mcprat
Copy link
Contributor Author

mcprat commented Aug 17, 2024

@jhasse can we run the CI workflow again?

@mcprat
Copy link
Contributor Author

mcprat commented Aug 29, 2024

small organization update:

  • comment rewriting
  • style formatting
  • save some more lines
  • line wrapping
  • moved new const functions to header

Jobserver::Jobserver() {
assert(!Enabled());

// Return early if no makeflags are passed in the environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all parsing logic to a separate static method that can be unit-tested properly. Also be aware that the GNU Make documentation states explicitly: Be aware that the MAKEFLAGS variable may contain multiple instances of the --jobserver-auth= option. Only the last instance is relevant. Hence this should be implemented properly (and tested).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, see also to implement the following requirement:

Your tool may also examine the first word of the MAKEFLAGS variable and look for the character n. If this character is present then make was invoked with the ‘-n’ option and your tool may want to stop without performing any operations

From https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, see also to implement the following requirement:

Your tool may also examine the first word of the MAKEFLAGS variable and look for the character n. If this character is present then make was invoked with the ‘-n’ option and your tool may want to stop without performing any operations

From https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html

I wish they didn't mention this in that part of the manual. It's completely out of scope for the jobserver. The jobserver client should not be responsible for determining whether ninja does a "dry run" even though it happens to be parsing MAKEFLAGS which can have many different flags for many different reasons...

This should be implemented in a separate commit (and separate PR) and directly with the BuildConfig object instead of involving the Jobserver objects. Traditionally, ninja does not rely on the environment for any of it's configuration flags, so that's another conversation as well.

// Tokenize string to characters in flag_, then words in flags_.
while (flag_char_ < strlen(makeflags)) {
while (flag_char_ < strlen(makeflags) &&
!isblank(static_cast<unsigned char>(makeflags[flag_char_]))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isblank() is locale-dependent, which leads to surprises and is generally slow. It is easier to just compare with ' ' and ' t' in this case.

src/jobserver-posix.cc Outdated Show resolved Hide resolved
src/jobserver.h Outdated Show resolved Hide resolved
src/jobserver.h Outdated Show resolved Hide resolved

#include "util.h"

Jobserver::Jobserver() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend splitting this PR into multiple commits, i.e.:

  1. One that adds the Jobserver class, and its Posix implementation + appropriate unit-tests for it. Also try to make the class as independent from the rest of Ninja as possible (e.g. to not call Warning() or Info() in the parser function, leave that to clients).

  2. One that adds usage of the class to build.cc / ninja.cc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend splitting this PR into multiple commits, i.e.:

In my opinion, the first commit in should be functional so that other projects can easily pull the patch while waiting for a release, and also for the "bisect rule", so each individual checkout is functional on it's own. I'm planning on Windows implementation and tests to be separate commits while trying to keep this one small (except for the new files)...

if (!jobserver_fifo_)
Warning("pipe closed: %d (mark the command as recursive)", rfd_);
else
Fatal("failed to read from jobserver: %d: %s", rfd_, strerror(errno));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to fallback gracefully to the usual mode when this happens instead?

In general, it is better to avoid calling Fatal() in methods like these, because these conditions cannot be properly unit-tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the remaining calls to Fatal() are for the following cases:

  1. MAKEFLAGS from environment told ninja that a fifo object needs to have file descriptors open for it. The open() syscall is ran and errored.
  2. File descriptors have been opened to the fifo object, but reading from it has failed not due to blocking. This likely means that the fifo object has been deleted while we have file descriptors that point to nothing, or maybe a problem with the filesystem itself.
  3. File descriptors have been opened to the fifo object, but writing to it has failed, again suggesting that the fifo object has been deleted while we have file descriptors that point to nothing or a problem with the filesystem.

They can become warnings, but I think they protect execution from continuing when something is extremely wrong. It should be exceedingly rare for the program to step into these Fatal() calls and they are probably out of scope for unit tests anyway, but I'll let you decide.

src/jobserver-posix.cc Outdated Show resolved Hide resolved
src/jobserver-posix.cc Outdated Show resolved Hide resolved
@mcprat
Copy link
Contributor Author

mcprat commented Sep 5, 2024

big update...

summary:

  • the destructor has been removed. I realized that it never gets run at all. Also, it is not the responsibility of the client to close the pipe. Emergency return of tokens before exiting still happens during call to Clear().
  • the constructor has been split into a Parse() function for the majority of the parsing again.
  • the Jobserver class is split into base and derived classes, reducing preprocessor #if's to 1.
  • Enabled() uses jobserver_closed_ instead of using jobserver_closed_ to put fake FDs.
  • macros for constant strings have been converted to static constexpr's as suggested.
  • removed underscores from local variables that were previously member variables.
  • moved <vector> include from the header to the source.
  • small rewrites to comments and warnings.

@mcprat
Copy link
Contributor Author

mcprat commented Sep 9, 2024

smaller update, this might be considered mergeable now.
hopefully all major issues are handled so we can focus on style and nitpicking...

summary:

  • now handling and storing the actual token character instead of just a count. this allowed for some simplification so there is a line decrease. this is written assuming a char is 1 byte.
  • added a case to fallback to non-parallel build, when the token server provides an FD to read tokens from that is blocking (Make versions 4.2.1 and earlier).
  • some light cleanup and rewording
  • tested all the way back to Make 4.0, which is when Windows support for jobserver was added. anything earlier than that can be considered "ancient" enough to ignore...

src/jobserver-posix.cc Outdated Show resolved Hide resolved
/// It must be called for each successful call to Acquire() after the command
/// even if subprocesses fail or in the case of errors causing Ninja to exit.
/// Ninja is aborted on write errors, and otherwise calls always succeed.
virtual void Release(unsigned char*) {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what the pointer being passed as argument here should point to? E.g. is it ok to call this with a pointer that points to a value that was never the result of a previous Acquire() code (as suggested by your Clear() function implementation). In which case, what is the default value to be used (also apparently \0 from the code, but you should make that clear in the documentation). Also will the function modify the pointed value or not?

Ideally, users of the API should not have to guess these details by looking at the source code.

I suggest writing a dedicated move-only Token class, even if trivial, to better encapsulate these semantics.

/// A wrapper for token values acquired or released to the pool.
/// A default instance has no value, and is used to indicate that
/// no token is available.
struct Token {
  // Default constructor builds a value-less token.
  Token() = default;
  
  // Explicit constructor for a Token with a value from the pipe.
  explicit Token(uint8_t value) : value_(static_cast<int>(value)) {}
  
  // Move operations are allowed.
  Token(Token&& other) noexcept : value_(other.value_) { other.value_ = -1; }
  Token& operator=(Token&&) noexecpt = default;
  
  // Copy operations are forbidden.
  Token(const Token&) = delete;
  Token& operator=(const Token&) noexcept = default;
  
  /// Returns true if this instance contains a value received from the pipe.
  bool HasValue() const { return value_ != -1; }
  
  /// Return underlying value. It is a runtime error to call this method
  /// if HasValue() returns false.
  uint8_t  GetValue() const {
    assert(HasValue());
    return static_cast<uint8_t>(value_); 
  }
  
  int value_ = -1;
};

Then you can have Acquire() return a Token by value, and Release() take a token by value, and forget about pointers entirely, e.g.:

/// Try to acquire a token from the pool. A value-less token instance is returned
/// if no token is available.
virtual Token Acquire() { return Token(); }

/// Release a previously acquire token to the pool. Does nothing if the
/// token argument has no value.
virtual void Release(Token token) {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote comments in the header to reflect the change in functionality and new returns and arguments.

Making a whole class just to manage the value of the tokens seems complicated and overkill to me. It makes the function declarations read better, but the Token class itself is not very readable and adds a lot of object constructs for not much benefit...

Is there any benefit to avoiding pointers here? or avoiding char?

What if we just made a simple typedef for the tokens?

I'm very confident that a jobserver will never return a NUL char unless it's to indicate "no tokens available" just as I am using it to mean. I'm also pretty confident that Ninja will never receive anything other than a '+' when being used with Make, probably even with forks of Make...

@mcprat
Copy link
Contributor Author

mcprat commented Sep 9, 2024

applied some review comments

summary:

  • moved new Clear() function to the RealCommandRunner class as ClearJobTokens()
  • ClearJobTokens() function uses range-based for loop statement
  • ClearJobTokens() function takes a const reference for the vector
  • removed headers for vectors and Edge class from jobserver.h
  • reworded comments in jobserver.h

@mcprat
Copy link
Contributor Author

mcprat commented Sep 17, 2024

  • Add an std::unique_ptr<Jobserver> jobserver_ member to this class, and default-initialize it with a base/null instance that does nothing. This avoids modifying tests that don't care about the feature at all.
  • Add a SetJobserver(std::unique_ptr<Jobserver> jobserver) method to replace the instance's jobserver with a new value. This can be called after constructing a Builder instance where it is needed only. This passes ownership to avoid lifecycle management issues.

nit: isblank() is locale-dependent, which leads to surprises and is generally slow. It is easier to just compare with ' ' and ' t' in this case.

I suggest writing a dedicated move-only Token class, even if trivial, to better encapsulate these semantics.

@digit-google for these comments from you, would you consider any of them critical that I must do? Or perhaps keeping things simple may be better at this point?

Presence of #ifdef _WIN32 is down to 1, and I don't think that last one can be removed, or can it?

@mcprat
Copy link
Contributor Author

mcprat commented Sep 17, 2024

  • use calls to Enabled() more instead of rewriting checks

@digit-google
Copy link
Contributor

Oh, I realize I was completely wrong in one of my earlier comments where I stated that Ninja would not pass file descriptor to command sub-processes. It actually totally does, so using --jobserver-auth=R,W works fine on Posix. I tested that actually with this patch. Sorry for the noise.

(Turns out that posix_spawn() doesn't close heritable file descriptors as I mistakenly thought).

@mcprat
Copy link
Contributor Author

mcprat commented Sep 18, 2024

No problem, I had to double-check because I wasn't sure myself, but at least documentation on file descriptors is pretty clear. There is a lot of confusion about how O_CLOEXEC is used by Make, but it has nothing to do with the external pipes.

The core principle of a jobserver is simple:
before starting a new job (edge in ninja-speak),
a token must be acquired from an external entity as approval.

Once a job is finished, the token is returned to represent a free job slot.
In the case of GNU Make, this external entity is the parent process
which has executed Ninja and is managing the load capacity for
all subprocesses which it has spawned. Introducing client support
for this model allows Ninja to give load capacity management
to it's parent process, allowing it to control the number of
subprocesses that Ninja spawns at any given time.

This functionality is desirable when Ninja is part of a bigger build,
such as Yocto/OpenEmbedded, Openwrt/Linux, Buildroot, and Android.
Here, multiple compile jobs are executed in parallel
in order to maximize cpu utilization, but if each compile job in Ninja
uses all available cores, the system is overloaded.

This implementation instantiates the client in real_main()
and passes pointers to the Jobserver class into other classes.
All tokens are returned whenever the CommandRunner aborts,
and the current number of tokens compared to the current number
of running subprocesses controls the available load capacity,
used to determine how many new tokens to attempt to acquire
in order to try to start another job for each loop to find work.

Jobserver related functions are defined as no-op for Windows
pending Windows-specific support for the jobserver.

Co-authored-by: Martin Hundebøll <[email protected]>
Co-developed-by: Martin Hundebøll <[email protected]>
Signed-off-by: Martin Hundebøll <[email protected]>
Signed-off-by: Michael Pratt <[email protected]>
@mcprat
Copy link
Contributor Author

mcprat commented Sep 18, 2024

  • added const to makeflags variable again

}

// --jobserver-auth=<val>
for (size_t n = 0; n < flags.size(); n++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these loops look like they're missing break;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lack of break is intentional, so only the last matching value is the one that takes effect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like it would be better to reverse the loop then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure it can be reversed, but since this will only be run once per execution of ninja it's not performance critical, so I lean towards the simplest way to write it with less lines and no brackets, etc...

while (flag_char < strlen(makeflags)) {
while (flag_char < strlen(makeflags) &&
!isblank(static_cast<unsigned char>(makeflags[flag_char]))) {
flag.push_back(static_cast<unsigned char>(makeflags[flag_char]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if emplace_back and no cast is better here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure exactly what the difference is, I would have to read about it...

the cast is not strictly necessary, it's just protection against invalid characters that I ran into while reading about tokenizing strings.

I saw something about using a std::vector<uint8_t> instead of std::string but that might be confusing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently after optimization, push_back and emplace_back are exactly the same, unless you are doing fancy constructors for the elements in the vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... no cast is better here.

@neheb I got it from the example here

https://en.cppreference.com/w/cpp/string/byte/isblank

@mcprat
Copy link
Contributor Author

mcprat commented Sep 20, 2024

@digit-google @neheb would this be better? to make the range explicitly equal to isgraph():

  // Tokenize string to characters in flag, then words in flags.
  while (flag_char < strlen(makeflags)) {
    while (flag_char < strlen(makeflags) &&
           ' ' < makeflags[flag_char] && makeflags[flag_char] <= '~') {
      flag.push_back(makeflags[flag_char]);
      flag_char++;
    }

...
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add GNU make jobserver client support
8 participants