Skip to content

Commit

Permalink
Added extra whitespace pattern to Regex Filter
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewdalpino committed May 3, 2021
1 parent 0646362 commit 2faf539
Show file tree
Hide file tree
Showing 8 changed files with 44 additions and 30 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
- 1.0.0-beta3
- 1.0.0-rc1
- Added Token Hashing Vectorizer transformer
- Added Word Stemmer tokenizer from Extras
- Remove HTML Stripper and Whitespace Remover transformers
Expand All @@ -7,6 +7,7 @@
- Remove rules() method on CART
- Removed results() and best() methods from Grid Search
- Change string representation of NAN to match PHP
- Added extra whitespace pattern to Regex Filter

- 1.0.0-beta2
- Interval Discretizer now uses variable width histograms
Expand Down
16 changes: 10 additions & 6 deletions docs/transformers/regex-filter.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/Transformers/RegexFilter.php">[source]</a></span>

# Regex Filter
Filters the text columns of a dataset by matching a list of regular expressions.
Filters the text features of a dataset by matching and removing patterns from a list of regular expressions.

!!! note
Patterns are filtered in the same sequence as they are given in the constructor.

**Interfaces:** [Transformer](api.md#transformer)

Expand All @@ -27,18 +30,19 @@ $transformer = new RegexFilter([
## Predefined Regex Patterns
| Class Constant | Description |
|---|---|
| URL | An alias for the default URL matching pattern (GRUBER 1). |
| EMAIL | A pattern to match any email address. |
| URL | An alias for the default URL matching pattern. |
| GRUBER_1 | The original Gruber URL matching pattern. |
| GRUBER_2 | The improved Gruber URL matching pattern. |
| EMAIL | A pattern to match any email address. |
| EXTRA_CHARACTERS | Matches consecutively repeated non word or number characters such as punctuation and special characters. |
| EXTRA_WORDS | Matches consecutively repeated words. |
| EXTRA_WHITESPACE | Matches consecutively repeated whitespace characters. |
| MENTION | A pattern that matches Twitter-style mentions (@example). |
| HASHTAG | Matches Twitter-style hashtags (#example). |
| EXTRA_CHARACTERS | Matches extra non word or number characters such as repeated punctuation and special characters. |
| EXTRA_WORDS | Matches extra (consecutively repeated) words.

## Additional Methods
This transformer does not have any additional methods.

## References:
[^1]: J. Gruber. (2009). A Liberal, Accurate Regex Pattern for Matching URLs.
[^2]: J. Gruber. (2010). An Improved Liberal, Accurate Regex Pattern for Matching URLs.
[^2]: J. Gruber. (2010). An Improved Liberal, Accurate Regex Pattern for Matching URLs.
2 changes: 2 additions & 0 deletions src/Transformers/KNNImputer.php
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,8 @@ public function transform(array &$samples) : void
}

$value = argmax($scores);

break;
}
}
}
Expand Down
29 changes: 18 additions & 11 deletions src/Transformers/RegexFilter.php
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
/**
* Regex Filter
*
* Filters the text columns of a dataset by matching a list of regular expressions.
* Filters the text features of a dataset by matching and removing patterns from a list of regular expressions.
*
* References:
* [1] J. Gruber. (2009). A Liberal, Accurate Regex Pattern for Matching URLs.
Expand All @@ -26,6 +26,13 @@
*/
class RegexFilter implements Transformer
{
/**
* A pattern to match email addresses.
*
* @var string
*/
public const EMAIL = '/[a-z0-9_\-\+\.]+@[a-z0-9\-]+\.([a-z]{2,4})(?:\.[a-z]{2})?/i';

/**
* The default URL matching pattern.
*
Expand All @@ -48,39 +55,39 @@ class RegexFilter implements Transformer
public const GRUBER_2 = '%(?xi)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))%s';

/**
* A pattern to match email addresses.
* Matches consecutively repeated non word or number characters such as punctuation and special characters.
*
* @var string
*/
public const EMAIL = '/[a-z0-9_\-\+\.]+@[a-z0-9\-]+\.([a-z]{2,4})(?:\.[a-z]{2})?/i';
public const EXTRA_CHARACTERS = '/([^\w\s])(?=[^\w\s]*\1)/';

/**
* A pattern to match Twitter-style mentions (ex. @RubixML).
* Matches consecutively repeated words.
*
* @var string
*/
public const MENTION = '/(@\w+)/';
public const EXTRA_WORDS = '/\b(\w+)(?=\s+\1+\b)/ui';

/**
* A pattern to match Twitter-style hashtags (ex. #MachineLearning).
* Matches consecutively repeated whitespace characters.
*
* @var string
*/
public const HASHTAG = '/(#\w+)/';
public const EXTRA_WHITESPACE = '/\s(?=\s+)/';

/**
* Matches extra non word or number characters such as repeated punctuation and special characters.
* A pattern to match Twitter-style mentions (ex. @RubixML).
*
* @var string
*/
public const EXTRA_CHARACTERS = '/([^\w\s])(?=[^\w\s]*\1)/';
public const MENTION = '/(@\w+)/';

/**
* Matches extra (consecutively repeated) words.
* A pattern to match Twitter-style hashtags (ex. #MachineLearning).
*
* @var string
*/
public const EXTRA_WORDS = '/\b(\w+)(?=\s+\1+\b)/ui';
public const HASHTAG = '/(#\w+)/';

/**
* A list of regular expression patterns used to filter the text columns of the dataset.
Expand Down
3 changes: 1 addition & 2 deletions src/Transformers/TSNE.php
Original file line number Diff line number Diff line change
Expand Up @@ -509,8 +509,7 @@ protected function gradient(Matrix $p, Matrix $y, Matrix $distances) : Matrix
->add(1.0)
->pow((1.0 + $this->dofs) / -2.0);

$q = $q->divide($q->sum()->multiply(2.0))
->clipLower(EPSILON);
$q = $q->divide($q->sum()->multiply(2.0)->clipLower(EPSILON));

$pqd = $p->subtract($q)->multiply($distances);

Expand Down
2 changes: 1 addition & 1 deletion src/constants.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
*
* @var string
*/
const VERSION = '1.0.0-beta2';
const VERSION = '1.0.0-rc1';

/**
* A small number used in substitution of 0.
Expand Down
15 changes: 8 additions & 7 deletions tests/Transformers/RegexFilterTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -31,18 +31,19 @@ protected function setUp() : void
$this->dataset = Unlabeled::quick([
['I was not proud of what I had learned, but I never doubted that it was worth $$$ knowing..'],
['Too weird to live, [email protected] too rare to die https://rubixml.com'],
['A man who procrastinates in @his choosing will inevitably have his choice made for him by #circumstance'],
['A man who procrastinates in @his choosing will inevitably have his choice made for him by #circumstance'],
['The quick quick brown fox jumped over the lazy man sitting at a bus stop drinking a can of Cola cola'],
['Diese äpfel Äpfel schmecken sehr gut'],
]);

$this->transformer = new RegexFilter([
RegexFilter::URL,
RegexFilter::EMAIL,
RegexFilter::MENTION,
RegexFilter::HASHTAG,
RegexFilter::EXTRA_CHARACTERS,
RegexFilter::EXTRA_WORDS,
RegexFilter::MENTION,
RegexFilter::HASHTAG,
RegexFilter::EXTRA_WHITESPACE,
]);
}

Expand All @@ -64,10 +65,10 @@ public function transform() : void

$expected = [
['I was not proud of what I had learned, but I never doubted that it was worth $ knowing.'],
['Too weird to live, too rare to die '],
['A man who procrastinates in choosing will inevitably have his choice made for him by '],
['The quick brown fox jumped over the lazy man sitting at a bus stop drinking a can of cola'],
['Diese Äpfel schmecken sehr gut'],
['Too weird to live, too rare to die '],
['A man who procrastinates in choosing will inevitably have his choice made for him by '],
['The quick brown fox jumped over the lazy man sitting at a bus stop drinking a can of cola'],
['Diese Äpfel schmecken sehr gut'],
];

$this->assertEquals($expected, $this->dataset->samples());
Expand Down
4 changes: 2 additions & 2 deletions tests/Transformers/TSNETest.php
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
<?php

namespace Rubix\ML\Tests\Embedders;
namespace Rubix\ML\Tests\Transformers;

use Rubix\ML\Verbose;
use Rubix\ML\DataType;
use Rubix\ML\Transformers\TSNE;
use Rubix\ML\Loggers\BlackHole;
use Rubix\ML\Transformers\TSNE;
use Rubix\ML\Datasets\Generators\Blob;
use Rubix\ML\Kernels\Distance\Euclidean;
use Rubix\ML\Datasets\Generators\Agglomerate;
Expand Down

0 comments on commit 2faf539

Please sign in to comment.