This project implements: news gathering -> preprocessing -> vectorization -> classification
Content:
Parser_news_LENTA creates 12 datasets (one for each month of 2020) in the format data/data_on_months/news_lenta_XX_2020
data_news_corrector_2020 combines all collected data and normalizes their shape. Only the tags - dt - main_text - website columns remain. Final dataset data/news_main_2020
data_preprocessing conducts data preprocessing. Stemming / lemmatization, removal of stop words, replacement of numbers with unified analogs are performed.
It also creates tags based on tags from websites for further classification:
- economy
- entertainment
- traditions
- science
- society
- sports
- technology
Final dataset data/news_main_prepr_2020, as well as data/data_stem & data/data_lemm
vector_model_creator performs vectorization:
- tfidf_lemm_500k - tfidf
- d2v_300 - Doc2Vec
- ft_lemm_300 - FastText
- w2v_tfidf_vector_data - Word2Vec
- glove_tfidf_vector_data - GloVe
- use_vector_data - Universal-sentence-encoder
- bert_vector_data - Bert
all models are stored in the models/ folder
Classifier_news makes a classification:
- LogisticRegression
- SVM
- Single-layer perceptron
- Bert
- Gpt-2
for LogisticRegression, SVM, Single-layer perceptron models, preliminary vectorization is performed using one of the previously described methods.