The most common use case of this project is that your organization (or clients of your company) have one or more websites SO BIG they help to understand how the content is organized and then how to generate more value.
PROTIP 1: While new people to both machine learning and usage in the real world are likely to assume the value to deliver with data mining for large websites is forecasting and replicating success cases, often the best value is discovering patterns already existing and fixing small issues with high impact. In other words: you learn how humans produce/optimize content (even if it means ask them instead of analize the data data), then discover outliers that are not as perfect. The next step is, based on data visualization, help the humans to self-improvement with new feedback.
PROTIP 2: if you're really into forecasting from the PROTIP 1, machine learning can (and often will) works better even with small (but well representative) list of items (think like 100 rows) than 1.000's to 100.000's but poorly prepared data.
- 1. Overview
- 2. Data extraction, preparation and cleaning
- 3. Data Mining and Machine learning ("understanding the data")
- 4. Patterns often found on (large) websites to be aware
TL;DR of Concepts
- Joomla! is an award-winning content management system (CMS), which enables you to build web sites and powerful online applications.
- Data mining (in Portuguese, Mineração de dados) is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
- The joomla-data-mining-and-machine-learning gives an general idea of how to do data mining and machine learning with websites, but dedicate special attention to implementation of non-generic tasks, like SQL queries, to an typical Joomla! CMS installation.
What this reference is NOT:
- Do not mistake Data mining with
web scraping.
- This project requires administrative access (like database access and server access logs).
- Do not mistake this project by something that you "install" on an site
- Some people search for "Data mining WordPress" or "Data Mining Joomla" asking for something to install and automatically "enable data mining" "enable machine learning" on the website.
- Ok, stricly speaking, yes, one discussed source of data, Google Analytics, can be installed, but this project is have broader concept.
- Do not confuse this guide with something whose final result runs
automatically and give insights.
- While some parts can be automated and even individual softwares can
bruteforce insights, an human would still need to glue different tools
and discard irrelevant or obvious conclusions and focus on what can be
useful
- "Wow, Orange Data Mining discoreverd that
content_created_date
can predictcontent_hits
" is the same as one machine discover that as older as one content is, more likely to have an total number of views. But is this useful?
- "Wow, Orange Data Mining discoreverd that
- While some parts can be automated and even individual softwares can
bruteforce insights, an human would still need to glue different tools
and discard irrelevant or obvious conclusions and focus on what can be
useful
- Directory: sql/
- Database engine: MySQL-like
- Replace 'JOSTABLEPREFIX_' with your real database prefix.
- For Joomla CMS sites using PostgreSQL or SQL Server/SQL Azure you need some small changes on syntax, but field names are granted to be compatible.
- Directory: output/
Each data mining project is unique, not just by project per organization, but by point in time. Yet this project contains examples of data output when using the implementation with the SQL queries.
Note: this section is a minimal draft. Since the implementation is more generic than the use with Joomla CMS, it may not be implemented at all on future releases. But it still mentioned here, since is an potential source of data.
Note: since this topic is generic than the use with only Joomla CMS websites, it's here more as a quick reference. In fact, it would be incomplete to at least not mention one source of data that is very likely to be already used on average sites in production on a reference related to data mining for Joomla.
Both Google Analytics and Google Search Console can be used a source of data. While both (in special for very big sites with pages with low number of access) are likely to not be as structured to get an full picture as the context extracted from the database export, it still allow to compare data.
Export to Excel XLSX, CSV or Google Spreadsheet list of all URLs
- Go to https://analytics.google.com/. Select the account of the website
- On the sidebar, click
Behavior
(in Portuguese,Comportamento
) - Click on
Site Content
(in Portuguese,Conteúdo do site
) - Click on
All Pages
(in Portuguese,Todas as Páginas
) - Change how many list items to show; recommented: 5000.
- Exported items are exactly the ones previewed on screen.
- When do exist more than 5000 items, click to see next page. Each page needs to be exported.
- Personalize the date range to show the table on the top of the screen, right
side; recommended: at least a few years ago until current date.
- Keep in mind that older pages may not be accessed by users on recent months. For big sites, this means you will miss parts of your site
- But also keep in mind that older dates will return inexistent pages today
- On the top of the screen, right side, click
Export
.
Excel (XLSX)
orGoogle Spreadsheets
can be a good format to start.
The steps are similar to Google Analytics, but with the limitation of 1000
URLs. Also an single Excel (XLSX)
exported file have more content from
Google Analytics and is
Export to Excel XLSX, CSV or Google Spreadsheet list of all URLs
- Go to https://search.google.com/search-console/. Select the account of the website
- On the sidebar, click
Performance
(in Portuguese,Desempenho
) - There is no need to change how many itens to export (the maximum, non-configurable, is 1000)
- Personalize the date rang;, the older you can get, the better.
- On the top of the screen, right side, click
Export
.
Excel (XLSX)
orGoogle Spreadsheets
can be a good format to start.
At the moment the bin/, even for an working draft, have no significant content.
At the moment this section is an draft.
- Data extraction from the Joomla site database ("SQL queries") fields named
content_hits
,category_hits
andtag_hits
, while very useful to get an overview without need to relate with other sources (like Google Analytics and Apache/Nginx access logs) are affected by caching mechanisms (both standard Joomla Caching and full page caches like Varnish). Do not ask to disable caches as this affects performance, in particular for big sites. But, as rule of trump, when you confirm that cache do affect these variables:- The real value is equal or higher than exported one (sometimes much higher)
- Pages with disproportionate number of access in short times will have significantly less hits. (think articles that viralize)
- Pages with infrequent (but constant access over the same day or week) may not be affected at all by cache.
- The
content_hits
,category_hits
andtag_hits
still worth to be used both for small benchmarks and (for serious reports) needs extra testing for detect the impact
- The real value is equal or higher than exported one (sometimes much higher)
- Data from Google Analytics and Google Search Console are limited to
page URls most accessed by users and evolution over time (including URLs
changes)
- (TODO: explain more on this)
- Site: https://libreoffice.org/
- Site: https://www.microsoft.com/microsoft-365/excel
- Site: https://www.google.com/sheets/about/
TODO: this entire section is an working draft
For sites that allow users to create accounts (even if there is no registration link on the site, but the CMS allows create accounts) the exported data will show trends of non-natural humans registered.
TODO: add content
SPDX-License-Identifier: MIT