Skip to content

Commit

Permalink
Added TF-IDF transformer benchmark
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewdalpino committed Aug 29, 2020
1 parent 7902520 commit e8bf1d3
Show file tree
Hide file tree
Showing 7 changed files with 66 additions and 16 deletions.
14 changes: 7 additions & 7 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ Here are a few things to check off before sending in a pull request ...

> New to pull requests? Github has a great [howto](https://help.github.com/articles/about-pull-requests/) to get you started.
### Code Review
We use pull requests as an opportunity to communicate with our contributors. Oftentimes, we can improve code readability, find bugs, and make optimizations during the code review process. Every pull request must have the approval from at least one core engineer before merging into the main codebase.

## Static Analysis
Static code analysis is an integral part of our overall testing and quality assurance strategy. Static analysis allows us to catch bugs before they make it into the codebase. Therefore, it is important that your updates pass static analysis at the level set by the project lead.

Expand Down Expand Up @@ -53,7 +56,10 @@ $ composer fix
```

### Naming
Use accurate, descriptive, and concise nomenclature. A variable name should only describe the data that the variable contains. With few exceptions, interfaces and the classes that implement them should be named after what the object *does* whereas value objects and classes that extend a base class should be named after what the object *is*. Method and function names should be verbs unless in the case of an accessor/getter function, in which case, the 'get' prefix may be dropped. Prioritize full names over abbreviations unless in the case where the abbreviation is the more common usage.
Use accurate, descriptive, consistent, and concise nomenclature. A variable name should only describe the data that the variable contains. With some exceptions, interfaces and the classes that implement them should be named after what the object *does* whereas value objects and classes that extend a base class should be named after what the object *is*. Prefer verbs for function and method names unless in the case of an accessor/getter function where the 'get' prefix may be dropped. Prioritize full names over abbreviations unless in the case where the abbreviation is the more common usage.

#### Domain-driven Design
We employ the Domain Driven Design (DDD) methodology in our naming and design. Our goal is to allow contributors and domain experts to be able to use the same language when referring to concepts. Therefore, it is crucial that your naming reflects the domain that your abstraction operates within. For example, Bayesian probability-based learners might use terms like 'likelihood', 'density', 'mass', and 'PDF.'

### Mutability
Objects implemented in Rubix ML have a mutability policy of *generally* immutable which means properties are private or protected and state must be mutated only through a well-defined public API.
Expand All @@ -71,9 +77,3 @@ To run the benchmarking suite:
```sh
$ composer benchmark
```

## Code Review
We use pull requests as an opportunity to communicate with our contributors. Oftentimes, we can improve code readability, find bugs, and make optimizations during the code review process. Every pull request must have the approval from at least one core engineer before merging into the main codebase.

## Anti Plagiarism Policy
Our community takes a strong stance against plagiarism, or the copying of another author's code without attribution. Since the spirit of open source is to make code freely available, it is up to the community to enforce policies that deter plagiarism. As such, we do not allow contributions from those who violate this policy.
11 changes: 5 additions & 6 deletions benchmarks/Kernels/Distance/CosineBench.php
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,15 @@ class CosineBench
*/
protected $kernel;

public function setUp() : void
{
$this->kernel = new Cosine();
}

public function setUpDense() : void
{
$this->aSamples = Matrix::gaussian(self::NUM_SAMPLES, 8)->asArray();
$this->bSamples = Matrix::gaussian(self::NUM_SAMPLES, 8)->asArray();

$this->kernel = new Cosine();
}

/**
Expand Down Expand Up @@ -62,8 +65,6 @@ public function setUpSparse() : void
$this->bSamples = Matrix::gaussian(self::NUM_SAMPLES, 8)
->multiply($mask)
->asArray();

$this->kernel = new Cosine();
}

/**
Expand Down Expand Up @@ -92,8 +93,6 @@ public function setUpVerySparse() : void
$this->bSamples = Matrix::gaussian(self::NUM_SAMPLES, 8)
->multiply($mask)
->asArray();

$this->kernel = new Cosine();
}

/**
Expand Down
50 changes: 50 additions & 0 deletions benchmarks/Transformers/TfIdfTransformerBench.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
<?php

namespace Rubix\ML\Benchmarks\Transformers;

use Tensor\Matrix;
use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Transformers\TfIdfTransformer;

/**
* @Groups({"Transformers"})
* @BeforeMethods({"setUp"})
*/
class TfIdfTransformerBench
{
protected const NUM_SAMPLES = 10000;

/**
* @var \Rubix\ML\Datasets\Unlabeled
*/
public $dataset;

/**
* @var \Rubix\ML\Transformers\TfIdfTransformer
*/
protected $transformer;

public function setUp() : void
{
$mask = Matrix::rand(self::NUM_SAMPLES, 100)
->greater(0.8);

$samples = Matrix::gaussian(self::NUM_SAMPLES, 100)
->multiply($mask)
->asArray();

$this->dataset = Unlabeled::quick($samples);

$this->transformer = new TfIdfTransformer();
}

/**
* @Subject
* @Iterations(3)
* @OutputTimeUnit("seconds", precision=3)
*/
public function apply() : void
{
$this->dataset->apply($this->transformer);
}
}
2 changes: 1 addition & 1 deletion docs/transformers/tf-idf-transformer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<span style="float:right;"><a href="https://github.com/RubixML/RubixML/blob/master/src/Transformers/TfIdfTransformer.php">[source]</a></span>

### TF-IDF Transformer
# TF-IDF Transformer
*Term Frequency - Inverse Document Frequency* is a measurement of how important a word is to a document. The TF-IDF value increases proportionally with the number of times a word appears in a document (*TF*) and is offset by the frequency of the word in the corpus (*IDF*).

> **Note:** This transformer assumes that its input is made up of word frequency vectors such as those produced by [Word Count Vectorizer](word-count-vectorizer.md).
Expand Down
2 changes: 1 addition & 1 deletion src/Backends/Serial.php
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
* than a parallel backend in cases where the computations are minimal such as with
* small datasets.
*
* > **Note:** The Serial backend is the default for most objects that capable of
* > **Note:** The Serial backend is the default for most objects that are capable of
* parallel processing.
*
* @category Machine Learning
Expand Down
2 changes: 1 addition & 1 deletion src/Extractors/CSV.php
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ public function getIterator() : Generator
}

if (!$record) {
throw new RuntimeException("Malformed CSV on line $line.");
throw new RuntimeException("Malformed record on line $line.");
}

yield $record;
Expand Down
1 change: 1 addition & 0 deletions src/Other/Helpers/CPU.php
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ public static function cores() : int
preg_match_all(self::CORE_REGEX, $cpuinfo, $matches);

return count($matches[0]);

default:
throw new RuntimeException('Could not detect number'
. ' of processor cores.');
Expand Down

0 comments on commit e8bf1d3

Please sign in to comment.