REAL

These scripts generate keyword, Conceptnet, Stanford NER and feature-based features from a given text. This is initially developed for creating feather from problem bodies in ASSISTments but can be used for other text.

Feature generators

4 different methods were used to generate the features from the problem bodies.

Keywords

The problem body is checked for the presence of user-defined keywords.
ConceptNet 5

The problem body is checked for the presence of words related to user-defined concepts. Relationships are inferred from MIT ConceptNet 5.
Stanford Named Entity Recognizer (NER)

The problem body is checked for the presence of entities defined in the Stanford NER. The entities include place, location, time, person, organization, money, percentage and date.
Feature-based features

Features based on other features can also be created. For example, a real-world-reference feature was created by checking for the presence of other features in the problem (i.e., car, animal, person, ...).

Feature generation

Retrieve unique problems from ASSISTments database and store in CSV format.
Change the source and target variables in driver.py to control the location of the ASSISTments CSV and the output file location.
If needed, modify the keyword_categories and conceptnet_categories lists in driver.py to include other keywords and concepts to be used as features.
If needed, modify the columnval_categories categories dictionary to create other feature-based features.
If needed, modify the DATA_COL variable in driver.py to the column containing the problem body in the ASSISTments CSV.
Performance can be optimized by tweaking the batch and pool_size parameters in driver.py. Batch indicates the number of problems loaded into memory before it is processed. Setting batch to 1 would mean running the generators on a single problem, so it is suggested to have a high enough value that would fit in your computer's memory. pool_size controls the number of parallel generator processes that would be used. This is highly dependent on your computer so it will need tweaking. A general rule though would be to set pool_size to 1 less than the number of processors in your machine.
Run using python driver.py

Output files

history.cfg - stores the last row processed by the script so it can continue processing instead of redoing the entire process whenever unexpected errors are encountered
wordmap.json - JSON formatted mapping of words and features used by the generators so it does not have to re-process previously used words

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
stanford-ner		stanford-ner
.gitignore		.gitignore
README.md		README.md
category.py		category.py
data.py		data.py
driver.py		driver.py
wordmap.json		wordmap.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stanford-ner

stanford-ner

.gitignore

.gitignore

README.md

README.md

category.py

category.py

data.py

data.py

driver.py

driver.py

wordmap.json

wordmap.json

Repository files navigation

REAL

Feature generators

Feature generation

Output files

Dependencies

About

Releases

Packages

Languages

pinventado/REAL_features

Folders and files

Latest commit

History

Repository files navigation

REAL

Feature generators

Feature generation

Output files

Dependencies

About

Resources

Stars

Watchers

Forks

Languages