Clustering of English texts on the basis of automated extraction of key properties

To ran the code, you need to download the corresponding jupyter notebook "REALEC_clustering.ipynb" and unzip the dataset "REALEC_cleaned.zip" in the same directory.

Next, you need to set the parameters that will be used in the work in a string format:

data_type - the data we want to analyze, can be set to:

'unknown_data' - the unlabeled data to which you should set path to folder with essay (example: folder = "datasets/REALEC_clean/exam/Exam2018")
'test_graph' - labeled data which is used to draw the representations of graph descriptions from Exam2017 for test results
'test_essay' - labeled data which is used to draw the representations of opinion essays from Exam2017 for test results

method - method for receiving embeddings, can take the following values:

'tf-idf' - sklearn TF-IDF method
'bm25' - weighted TD-IDF
'dm' - 'dm' realization of doc2vec PV
'dbow' - 'dbow' realization of doc2vec PV

After analysis of values of internal measures, you should set the optimal number of clusters ('best_k' - by default set to 4) that will be used for all subsequent clustering algorithms and for keywords extraction for the corresponding groups.

Dataset REALEC: https://yadi.sk/d/AVFzedaVtSporg

Dataset EFCAMDAT: https://yadi.sk/d/LXn7PifcOmOjYQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Clustering of English texts on the basis of automated extraction of key properties

Files

README.md

Latest commit

History

README.md

File metadata and controls

Clustering of English texts on the basis of automated extraction of key properties