Skip to content

jdi-testing/jdi-qasp-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JDI-QASP-ml v.2.0

Model

MUI

Our model is neural network based on pytorch framework organized as follows:

-> Input linear layer
-> Dropout layer
-> LeakyReLu activation layer
-> Batch normalization layer
-> Hidden linear layer
-> Dropout layer
-> LeakyReLu activation layer
-> Output linear layer

As input for NN we use following calculated groups of features:

  • Attributes features (OneHot-encoded info about having some attributes for object, his parent and up and down siblinhs)
  • Class features (TF-IDF encoded info about class attribute for object, his parent and up and down siblinhs)
  • Type features (OneHot-encoded info about type attribute for object, his parent and up and down siblinhs)
  • Role features (OneHot-encoded info about role attribute for object, his parent and up and down siblinhs)
  • TAG features (OneHot-encoded info about tag of object, his parent and up and down siblinhs)
  • Followers TAG features (TF-IDF encoded info about all tags of childs or generally followers)
  • Numerical general features (General features about object like numger of followers, children, max max_depth etc.)
  • Binary general features (General features with binary values like is the object or his parent hidden or displayed or leaf etc.)

Angular

Same model as for MUI.

HTML5

Out model is desicion tree, because of simplicity of classic html5 element structure

Picture of tree can be found in HTML5_model/model/tree.jpeg

Install environment to train / test model

  1. Clone the repository.
  2. Download and install Anaconda from https://www.anaconda.com/products/individual.
    Alter your PATH environment variable to be able run python as well as conda utility.
  3. Create conda virtual environment using this command (see create-env.bat if you use Windows):
    conda env create -f environment.yml --name jdi-qasp-ml
  1. Run cmd.exe for windows or terminal for mac, and from command prompt:
    conda activate jdi-qasp-ml 

Generating the dataset for training model

MUI

Generator for MUI element library sites placed in generators/MUIgenerator/
To generate sites go in the directory of MUIgenerator and run:

    sh generate_data.sh

After thet in catalog /data/mui_dataset/build you will find directories named like "site-N"

Next go the directory MUI_model and run:

    python build_datasets_for_mui_sites.py

After that in directory /data/mui_dataset you will find following structure:

  • /annotations (not used, maybe need to be removed later)
  • /cache-labels (not used, maybe need to be removed later)
  • /df - directory with pickles of site-datasets
  • /html - directory with html files of sites (only for info)
  • /images - directiry with images of sites (only for info)
  • classes.txt - file with all possible labels to detect. Do not change it!!!
  • EXTRACT_ATTRIBUTES_LIST.json - file with all attributes to take into account in the model (need in feature building). Do not change it!!!

Angular

Generator for Angular element library sites placed in generators/NgMaterialGenerator/
To generate sites go in the directory of NgMaterialGenerator and run:

    sh generate_data.sh

After thet in catalog /data/angular_dataset/build you will find directories named like "site-N"

Next go the directory Angular_model and run:

    python build_datasets_for_angular_sites.py

After that in directory /data/angular_dataset you will find following structure:

  • /annotations (not used, maybe need to be removed later)
  • /cache-labels (not used, maybe need to be removed later)
  • /df - directory with pickles of site-datasets
  • /html - directory with html files of sites (only for info)
  • /images - directiry with images of sites (only for info)
  • classes.txt - file with all possible labels to detect. Do not change it!!!
  • EXTRACT_ATTRIBUTES_LIST.json - file with all attributes to take into account in the model (need in feature building). Do not change it!!!

HTML5

Generator for HTML5 element library sites placed in generators/HTMLgenerator/
To generate sites go in the directory of HTMLgenerator and run:

    python generate-html.py

After thet in catalog /data/html5_dataset/build/ you will find directory named as "html5"

Next go the directory HTML5_model and run:

    python build_datasets_for_html5_sites.py

After that in directory /data/html5_dataset you will find following structure:

  • /annotations (empty, maybe need to delete)
  • /cache-labels (empty, maybe need to delete)
  • /df - directory with pickles of site-datasets
  • /html - directory with html files of sites (only for info)
  • /images - directiry with images of sites (only for info)
  • classes.txt - file with all possible labels to detect. Do not change it!!!
  • EXTRACT_ATTRIBUTES_LIST.json - file with all attributes to take into account in the model (need in feature building). Do not change it!!!

Train model

MUI

To train the model you need to go to the directory /MUI_model and run:

    python train.py

If you need to set up training parameters, change following variables for train.py (placed in vars/mui_train_vars.py):

  • BATCH_SIZE (2048 by default)
  • TRAIN_LEN and TEST_LEN
  • NUM_EPOCHS (2 by default)
  • EARLY_STOPPING_THRESHOLD (2 by default)

At the end of the process the table with training results saves in MUI_model/tmp/train_metrics.csv

Angular

To train the model you need to go to the directory /Angular_model and run:

    python train.py

HTML5

To train the model you need to go to the directory /HTML5_model and run:

    python train.py

If you need to set up training parameters, change following variables for train.py (placed in vars/html5_train_vars.py):

  • TRAIN_LEN and TEST_LEN
  • parameters of DT

At the end of the process the table with training results saves in MUI_model/tmp/train_metrics.csv

Predicting

To get predictions we need to run API main.py (better to do it wia docker - will be disscussed below) when API is running we can send input json data to following url:

Validate model

MUI

To validate models quality we use test web-pages, placed in directory notebooks/MUI/Test-backend

You can change only notebooks with the "new"-end in the name like "Test-backend_mui-Buttons_new.ipynb"(others are legacy for comparing)

In that notebooks we load specific web-page, creating dataset and predict labels for this dataset. It may be needed to correct some paths in notebooks (especially ports in them)

To use this notebooks main.py need to be run or docker needs to be up.

HTML5

To validate models quality we use test web-pages, placed in directory notebooks/HTML5/Test-backend

Docker

Take docker image from github:

Release version

macOS / Linux

curl --output jdi-bootstrap.sh https://raw.githubusercontent.com/jdi-testing/jdi-qasp-ml/master/scripts/bootstrap.sh && \
bash jdi-bootstrap.sh

Windows

curl --output jdi-bootstrap.ps1 https://raw.githubusercontent.com/jdi-testing/jdi-qasp-ml/master/scripts/bootstrap.ps1 && ^
powershell -executionpolicy bypass .\jdi-bootstrap.ps1

Development version

macOS / Linux

curl --output jdi-bootstrap.sh https://raw.githubusercontent.com/jdi-testing/jdi-qasp-ml/develop/scripts/bootstrap.sh && \
bash jdi-bootstrap.sh -b develop -l develop

Windows

curl --output jdi-bootstrap.ps1 https://raw.githubusercontent.com/jdi-testing/jdi-qasp-ml/develop/scripts/bootstrap.ps1 && ^
powershell -executionpolicy bypass .\jdi-bootstrap.ps1 -Branch develop -Label develop

Installing version from any other repository branch:

Example with branch "branch_name":

Installing for the first time:

  1. Clone repository to your machine:
git clone https://github.com/jdi-testing/jdi-qasp-ml.git
  1. After process finished go to the project folder:
cd jdi-qasp-ml
  1. Checkout to a branch needed:
git checkout branch_name
  1. Copy .env.dist file to .env:
cp .env.dist .env
  1. Adjust variables in .env file to your needs (refer to the Settings section).
  2. Build and start containers:
docker-compose -f docker-compose.dev.yaml up --build

Next time if you want to run/rerun containers, use following commands:

  1. Stop running containers:
docker-compose -f docker-compose.dev.yaml down -v
  1. Update repository with new commits:
git pull
  1. Restart containers:
docker-compose -f docker-compose.dev.yaml up

Settings

Variable name Description Default value
SELENOID_PARALLEL_SESSIONS_COUNT Total number of parallel Selenoid sessions.
Is also used to determine number of processes used to calculate visibility of page elements.
Set it to the number of parallel running threads supported by your processor. -2 optionally if you'd like to reduce CPU load.
4

Docker - get debugging info:

    docker system prune --all --force

Development:

API service dependencies

New dependencies can be added with pipenv command:

pipenv install <package>==<version>

If there are conflicts on creating a new pipenv env on your local machine, please, add the dependencies inside the container:

docker compose -f docker-compose.dev.yaml run --rm api pipenv install <package>==<version>

API

Available API methods you can see in Swagger at http://localhost:5050/docs

Websocket commands

Those commands could be sent to websocket and be processed by back-end:

1. Schedule Xpath Generation for an element in some document:

Request sent:

{
    "action": "schedule_xpath_generation",
    "payload": {
        "document": '"<head jdn-hash=\\"0352637447734573274412895785\\">....',
        "id": "1122334455667788990011223344",
        "config": {
            "maximum_generation_time": 10,
            "allow_indexes_at_the_beginning": false,
            "allow_indexes_in_the_middle": false,
            "allow_indexes_at_the_end": false,
        },
    },
}

Response from websocket:

{
    "action": "tasks_scheduled",
    "payload": {"1122334455667788990011223344": "1122334455667788990011223344"},
}

2. Schedule CSS locators generation for specific elements in a document:

Request sent:

{
    "action": "schedule_css_locators_generation",
    "payload": {
        "document": '"<head jdn-hash=\\"0352637447734573274412895785\\">....',
        "id": ["1122334455667788990011223344"]
    }
}

Response with the generated locator:

{
    "action": "result_ready",
    "payload": {
        "id": "css-selectors-gen-47e475cd-3696-400e-b761-8db6ec38857d",
        "result": [
            {
                "id": "1122334455667788990011223344",
                "result": "p:nth-child(3)"
            }
        ]
    }
}

Result will be an empty string in case algorithm fails to find CSS locator for the element.

3. Get task status:

Request sent:

{
    "action": "get_task_status",
    "payload": {"id": "1122334455667788990011223344"},
}

4. Get task statuses:

Request sent:

{
    "action": "get_task_status",
    "payload": {
        "id": [
            "1122334455667788990011223344",
            "1122334455667788990011223345",
            "1122334455667788990011223346",
        ]
    },
}

5. Revoke tasks:

Request sent:

{
    "action": "revoke_tasks",
    "payload": {
        "id": [
            "1122334455667788990011223344",
            "1122334455667788990011223345",
            "1122334455667788990011223346",
        ]
    },
}

Response from websocket:

{
    "action": "tasks_revoked",
    "payload": {
        "id": [
            "1122334455667788990011223344",
            "1122334455667788990011223345",
            "1122334455667788990011223346",
        ]
    },
}

6. Get task result:

Request sent:

{
    "action": "get_task_result",
    "payload": {"id": "1122334455667788990011223344"},
}

7. Get task results:

Request sent:

{
    "action": "get_task_results",
    "payload": {
        "id": [
            "1122334455667788990011223344",
            "1122334455667788990011223345",
            "1122334455667788990011223346",
        ]
    },
}