Sentence Splitting Approach in BERT Preprocessing #1394

AliHaiderAhmad001 · 2023-10-17T10:39:37Z

Hi,

I am very impressed with your work on BERT.

Currently I am reproducing Bert's model from scratch for educational purposes. I have finished building the model, but I have a question about preprocessing the data. Note that I am not using the same dataset, instead I am using the IMDB dataset. I try to emulate your approach as much as possible.

The case

I consider each review as a document, and I break each document down into sentences. The way the sentences are divided seems so crucial, I've decided to take the following approach:

In 10 percent of cases the maximum possible number of words is taken (256 words).
In 80 percent of cases it is divided by ., !,; or ?.
In 10 percent of cases, randomly.

def split_sentences(text, delimiters=".!?;", max_words=250):
  # Split sentences based on maximum word count (10% of cases)

  if random.random() < 0.1:
      return split_text_by_maximum_word_count(text, max_words)

  # Split sentences based on common punctuation marks (80% of cases)
  if random.random() < 0.8:
      return split_text_by_punctuation_marks(text, delimiters, max_words)

  # Random splitting (10% of cases)
  if random.random() < 0.1:
      return random_splitting(text, max_words)

The question:

I would like to know if my approach is wrong, how did you separate the sentences in your approach?

Thanks

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence Splitting Approach in BERT Preprocessing #1394

Sentence Splitting Approach in BERT Preprocessing #1394

AliHaiderAhmad001 commented Oct 17, 2023 •

edited

Loading

Sentence Splitting Approach in BERT Preprocessing #1394

Sentence Splitting Approach in BERT Preprocessing #1394

Comments

AliHaiderAhmad001 commented Oct 17, 2023 • edited Loading

The case

The question:

AliHaiderAhmad001 commented Oct 17, 2023 •

edited

Loading