From b3213e6cd12382dbc5f6b9d5ff880970bce18a0e Mon Sep 17 00:00:00 2001 From: Ruben Taelman Date: Tue, 10 Dec 2024 06:48:12 +0100 Subject: [PATCH 1/9] Review abstract --- section/abstract.tex | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/section/abstract.tex b/section/abstract.tex index 4a84a96..3c18f5c 100644 --- a/section/abstract.tex +++ b/section/abstract.tex @@ -2,20 +2,21 @@ % Context Centralization of web information can raise legal and ethical problems, especially in the context of social applications. % Need - Decentralizing this information offers a potential solution, but maintaining query performance remains a challenge. - Link Traversal Query Processing (LTQP) enables querying in large-scale networks of decentralized data but suffers from long execution times and high data usage, - largely due to the extensive HTTP requests required for network exploration. + Decentralizing this information offers a potential solution, but achieving acceptable query performance remains a challenge. + Link Traversal Query Processing (LTQP) enables querying in large-scale networks of decentralized data but suffers from long execution times and high data transfer, + largely due to the extensive number of HTTP requests required for network exploration. % Task - This paper introduces a shape-based pruning approach to minimize the search space of traversal queries. + To solve this problem, we introduce a shape-based pruning approach to minimize the search space of traversal queries. The approach utilizes \emph{shape indexes} provided by data providers in networks of decentralized knowledge graphs to reduce the search space using a \emph{query-shape containment} algorithm. % Object - This work introduces link pruning in LTQP by formalizing the shape index and query-shape containment approach, and evaluates its impact on the performance of traversal queries. + In this article, we formalize this shape index and query-shape containment approach as a link pruning mechanism for LTQP, + and evaluate its impact on the performance of queries in a social media context. % Findings - Our findings show that shape-based data summarization can reduce the query execution time and network usage of selective traversal queries by up to 7 times in our benchmark. + Our findings show that shape-based link pruning can reduce the query execution time and network usage of selective queries by up to 7 times.\rt{Can you also mention if there is an increase in server cost? If not, also mention that.} % Conclusion - This performance gain, achieved without delegating queries to endpoints, makes our approach a strong candidate for handling selective queries in large networks of structured, decentralized knowledge graphs. + Our work shows the benefits of exposing shape-based metadata for handling selective LTQP queries in large networks of structured, decentralized knowledge graphs. -\keywords{Linked data, +\keywords{Linked Data, Link Traversal Query Processing, RDF data shapes, Decentralization, From bb54df54fbc4db9b0d7468c36a71d647a00bda5e Mon Sep 17 00:00:00 2001 From: Ruben Taelman Date: Tue, 10 Dec 2024 09:13:50 +0100 Subject: [PATCH 2/9] Review intro --- section/introduction.tex | 60 ++++++++++++++++++++++------------------ 1 file changed, 33 insertions(+), 27 deletions(-) diff --git a/section/introduction.tex b/section/introduction.tex index 71e601a..b961db5 100644 --- a/section/introduction.tex +++ b/section/introduction.tex @@ -10,52 +10,58 @@ \section{Introduction} It can be defined as ``the self-determination of individuals and organizations concerning to the use of their data''~\cite{verstraete2022solid}, which in practice can be interpreted as the power to choose where one's data is stored and who has access to it~\cite{verstraete2022solid}. Multiple studies have denoted problems of ownership, democracy, reinforcement of inequality, and antagonism between users and owners of web social applications~\cite{Terranova2000FreeLP, Curran2016ch1, Sevignani2013, 9663788}. -Several authors consider decentralizing web data an insufficient solution~\cite{9663788, Curran2016ch1}; however, it can still be an integral component of initiatives focused on data sovereignty. -Thus, technical research on that topic is a relevant endeavor. -Linked data has the potential to create seamless decentralized knowledge graphs (DKG) through the use of dereferencable IRIs. -These IRIs allow access to additional knowledge graph (KG) containing information relevant to what an IRI identifies. +Several authors consider decentralizing web data an insufficient solution~\cite{9663788, Curran2016ch1}; yet, it is an integral component of initiatives focused on data sovereignty, +which necessitates technical research. +Linked Data enables the creation of Decentralized Knowledge Graphs (DKGs) through the use of dereferencable IRIs. +These IRIs allow access to additional Knowledge Graphs (KGs) containing information relevant to what an IRI identifies. For example, a WebID~\sepfootnote{sf:webID} represents a user, dereferencing it can provide the name of the user, among other information, without having to store the information locally. -Despite these benefits, most SPARQL query processing occurs in centralized setups, partly because optimizing queries is easier in centralized environments. +Despite these benefits, most SPARQL query processing occurs in centralized setups, partly because there is a better understanding of how to optimize queries in centralized environments. -A query paradigm called Link Traversal Query Processing (LTQP)~\cite{Hartig2012} has been developed to leverage the potential descriptive power of IRI dereferencing. -LTQP consists of recursively dereferencing IRIs discovered in a query engine's internal triple store during query execution to expand its base of information. -The main difficulty of LTQP is the large domain of exploration, which results in the performance of a high number of HTTP requests as demonstrated by \citeauthor{hartig2016walking}~\cite{hartig2016walking}. -From another perspective, and without contradicting \citeauthor{hartig2016walking}~\cite{hartig2016walking}, \citeauthor{Taelman2023}~\cite{Taelman2023} has demonstrated that in descentralized environments with structural properties (DESP), -it is possible to attain query completeness for various types of practical queries within acceptable execution times. +The Link Traversal Query Processing (LTQP)~\cite{Hartig2012} is more natural paradigm for querying in decentralized environments, +which leverages the potential descriptive power of IRI dereferencing. +LTQP involves recursively dereferencing IRIs discovered in a query engine's internal triple store during query execution to expand its base of information. +The main difficulty of LTQP is the large domain of exploration, which leads to a high number of HTTP requests as demonstrated by \citeauthor{hartig2016walking}~\cite{hartig2016walking}. +From another perspective, and without contradicting \citeauthor{hartig2016walking}~\cite{hartig2016walking}, \citeauthor{Taelman2023}~\cite{Taelman2023} has demonstrated that in Decentralized Environments with Structural Properties (DESPs), +it is possible to attain query completeness for various types of practical queries within acceptable execution times.\rt{We need to be more precise on what \emph{acceptable} is here, as it depends on the use case. In Taelman2023 we focus on the social media use case (like in your paper), and this requires query exec times that are within the human attention ranges (see Taelman2023). For high-volatile applications such as smart city sensors, this may not be a good fit.} Moreover, they showed that query planning could significantly influence the execution time. -Structural properties ensure data discoverability, which in turn helps guarantee result completeness through a concept called structural assumptions. -In practice, DESP can be used in the context of personal data and social networks, among others. -Examples of environments that adhere to these constraints are datasets following the Solid protocol~\cite{Taelman2023} and the TREE specification~\cite{tam_iswc_traversalsensortree_2024}. -The work of \citeauthor{Taelman2023}~\cite{Taelman2023} indicates that there are multiple optimizations possible in LTQP in the context of descentralized environments with structural properties as opposed to the +Structural properties ensure data discoverability, which in turn helps guarantee result completeness by making \emph{structural assumptions}. +In practice, DESPs emerge in different places, such as personal data, social networks, and more. +Concrete DESPs has been shown to be beneficial for datasets following the Solid protocol~\cite{Taelman2023} and the TREE specification~\cite{tam_iswc_traversalsensortree_2024}. +The work of \citeauthor{Taelman2023}~\cite{Taelman2023} indicates that there are multiple optimizations possible in LTQP in the context of decentralized environments with structural properties as opposed to the more pessimist conclusion of the work of \citeauthor{hartig2016walking}~\cite{hartig2016walking} for Open Linked Data without structural properties. -From a holistic perspective, the web does not have a structure exploitable by query engines for optimization. -On the web, any document can be published at any location, and there is no index or trust mechanism to guide source selection. -From a perspective where the web is divided into small subsections controlled by data providers, data publishers can implicitly or explicitly organize their data -in a way that query engines could exploit its organization for optimization. +In general, the Web does not have a structure exploitable by query engines for optimization. +That is because any document can be published at any location, and there is no standard index or trust mechanism to guide discovery. +However, when we consider subsections of the Web that are controlled by specific data providers as \emph{subwebs}, +implicit or explicit data organizations with specific structural properties can emerge +which can be exploited by query engines. +\rt{The following two sentences seem out of place for a high-level introduction like this.} We propose to put our focus on the completeness of results when considering that the completeness of traversal has been fixed. Placing our focus on results allows us to investigate ways to optimize the search space of link traversal queries by pruning irrelevant sources. -We propose using a decentralized dataset summary method, called the shape index~\cite{tam2024opportunitiesshapebasedoptimizationlink}, as the support mechanism for our pruning approach. +In this work, we build upon a new dataset summarization approach for decentralized environments called a \emph{shape index}~\cite{tam2024opportunitiesshapebasedoptimizationlink}, +to enable pruning links within LTQP. This method involves mapping the content of a decentralized dataset using RDF data shapes. -The intuition behind this approach is that publishing explicit RDF schemas is relatively inexpensive for data providers when publishing decentralized datasets but they could be highly beneficial for clients. -Although RDF does not enforce schemas on its data, the nature of the usually modeled objects and the formalism often results in implicit schemas~\cite{Neumann2011CharacteristicSA}. +The intuition behind this approach is that publishing explicit shapes is relatively inexpensive for data providers when publishing decentralized datasets but they could be highly beneficial for clients. +Even if such shapes are not defined explicitly, they often emerge in practise~\cite{Neumann2011CharacteristicSA}. -This paper is organized as follows: first, we introduce the problem statement; next, we discuss related work and present preliminary concepts. +This paper is organized as follows: first, we introduce the problem statement; next, we discuss related work and present preliminary concepts.\rt{Can you reference the sections as well?} We then describe our approach, followed by the experimental setup and results. Finally, we conclude the article. \section{Problem Statement} To guide our study we formulated the following research question. -\textbf{Can LTQP use shape-based pruning in networks of decentralized Knowledge Graphs to reduce the number of HTTP requests while maintaining the same completeness of results, and does this reduction of HTTP requests lead to a decrease in query execution time?} +\textbf{Can LTQP use shape-based pruning in networks of decentralized Knowledge Graphs to reduce the number of HTTP requests while maintaining the same completeness of results \rt{I would splice everything after this to a separate hypothesis}, and does this reduction of HTTP requests lead to a decrease in query execution time?} We formulated the following hypotheses: \newcounter{hypothesisCounter} \setcounter{hypothesisCounter}{1} \begin{itemize}[label=\textbf{H\arabic{hypothesisCounter}}\,\stepcounter{hypothesisCounter}] \item Using shape indexes will reduce the number of non-contributing data sources acquired - \item There is a linear relationship between the reduction of the number of HTTP requests and the reduction of query execution time - \item Executing a query-shape containment is negligible in the context of social media applications - \item More detailed shapes will provide a higher reduction in the number of HTTP requests - \item A network with more \emph{complete} shape index will reduce more the number of HTTP requests and the query execution time than one with less + \item There is a linear correlation between the reduction of the number of HTTP requests and the reduction of query execution time + \item Executing a query-shape containment is negligible in the context of social media applications\rt{negligible in terms of what? It's effect and benefit? Or execution time?} + \item More detailed \rt{What are more detailed shapes? Just larger shapes?} shapes will provide a higher reduction in the number of HTTP requests + \item A network with a more \emph{complete} shape index will reduce more the number of HTTP requests and the query execution time than one with less \item The shape index approach can be adaptative, so not every dataset in the network needs to have an index to see a performance improvement \end{itemize} + +\rt{This section looks good, but I still think it would be better to move it to later in the paper. Because there are concepts mentioned here that are not known yet to the reader, such as "complete shape index". So either the hypotheses must be reformulated already be understandable to the reader, or this section must come later.} From 4ca413e6f273def80e84932bb643be5ef8af0d9e Mon Sep 17 00:00:00 2001 From: Ruben Taelman Date: Tue, 10 Dec 2024 09:13:55 +0100 Subject: [PATCH 3/9] Review related work --- section/related_work.tex | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/section/related_work.tex b/section/related_work.tex index dcf5f9b..1e2c24d 100644 --- a/section/related_work.tex +++ b/section/related_work.tex @@ -1,26 +1,33 @@ \section{Related Work} -LTQP is a SPARQL querying paradigm that answers queries by exploring the web using the follow-your-nose principle~\cite{hartig2016walking}. -The main challenge of LTQP is the web's open-ended nature leading to large search space. +\rt{Shall we add some subsections here, since each paragraph has a different topic?} + +LTQP is a SPARQL querying paradigm that answers queries by exploring the Web using the follow-your-nose principle~\cite{hartig2016walking}. +The main challenge of LTQP is the Web's open-ended nature leading to large search space. Completeness in LTQP is defined by the traversal of a well-defined set of links~\cite{Hartig2012}. The first method employed to define this set was the reachability criteria~\cite{Hartig2012}, which are boolean functions that determine whether a link should be dereferenced. Building on this, the theoretical query language LDQL~\cite{hartigLDQL} was introduced, separating the traversal definition from the query definition. Further advancements include the subweb specifications language~\cite{Bogaerts2021LinkTW}, which allows data providers to define how their DKG should be traversed. These contributions are centered on guiding the engine in selecting links to follow in a discovery process. -However, they do not explicitly address mechanisms for restricting certain links in what we could call a pruning process. +However, they do not explicitly address mechanisms for restricting certain links in what we could call a pruning process\rt{That's not entirely accurate. SWSL does pruning as well, but it's more pruning for trust (see the Carol's wrong name for Bob example), while you do pruning based on shapes and the current query. Let's elaborate a bit on this difference here.}. Link pruning can be very useful to reduce the search domain of queries when information about the data model of DKG is found. +\rt{Emphasize why pruning matters for your work} RDF data shapes are used for validating, describing, and communicating data structures, as well as generating data and driving user interfaces~\cite{Gayo2018a,Gayo2018}. -The two main RDF data shape formalisms are SHACL and ShEx. +The two most well known RDF data shape formalisms are SHACL and ShEx.\rt{Cite the specs} %Both describe RDF data but differ in focus: ShEx emphasizes graph structure, while SHACL targets constraints. For common use cases, they are equally expressive and interchangeable~\cite{Gayo2018c}. +\rt{The following 4 lines should at least form a separate paragraph.} Shape Trees~\cite{shapetreesShapeTrees} are an index structure for validating and organizing decentralized knowledge graphs (DKGs). However, Shape Trees have not been used for query optimization. Due to their \emph{virtual hierarchy}~\cite{shapetreesShapeTrees}, it can be challenging for a query engine to efficiently capture the relationship between a resource IRI and its corresponding shapes. -Furthermore, because Shape Trees are not widely adopted, proposing an alternative formalization for query optimization addresses a gap in the literature. -RDF data shapes have also been used in the litterature for querying of centralized KG~\cite{kashif2021}. -Automatic generation of RDF data shapes based on KG~\cite{fernandez2023extracting} and shape-based integration of data~\cite{LabraGayo2023} are also topics that have been studied and can support shape-based summary approaches for DKGs. +Furthermore, because Shape Trees are not widely adopted, proposing an alternative formalization for query optimization addresses a gap in the literature.\rt{I don't understand this sentence.} +\rt{Contrast shape trees with your shape index approach: how are they different? And why did you introduce something yourself instead of building upon shape trees?} +RDF data shapes have also been used in the literature for querying centralized KGs~\cite{kashif2021}. +Automatic generation of RDF data shapes based on KGs~\cite{fernandez2023extracting} and shape-based integration of data~\cite{LabraGayo2023} are also topics that have been studied and can support shape-based summary approaches for DKGs. +\rt{How does this relate to your work?} +\rt{The following paragraph needs to be better structured. There's a lot of stuff in there, but it's not very coherent. It's introduced as source selection, but that's not accurate, as you mention SERVICE (which is an operator in SPARQL), summarization techniques for query optimization, ... You also mix sparql federation and link traversal. And then you end with some centralized techniques. You may even want to have separate subsections for some of these topics. Also think about what point you want to make for each subsection, and how it relates to your work (and then also make this clear to the reader).} Source selection is a crucial challenge in decentralized querying. Approaches like SPARQL \texttt{SERVICE} clauses, service descriptions, basic statistics on triple counts, and histogram methods have been studied~\cite{hose2012towards, Harth2010}. However, most of those source selection methods face the limitation of assuming a small number of data sources~\cite{Harth2010}, leaving their suitability for LTQP uncertain. From b3f3e84b2c519ab95262329016dc8cea15f4ddfc Mon Sep 17 00:00:00 2001 From: Ruben Taelman Date: Tue, 10 Dec 2024 09:27:13 +0100 Subject: [PATCH 4/9] Review preliminaries --- section/preliminaries.tex | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/section/preliminaries.tex b/section/preliminaries.tex index 3fc4927..21f96ee 100644 --- a/section/preliminaries.tex +++ b/section/preliminaries.tex @@ -2,16 +2,21 @@ \section{Preliminaries} \subsection{RDF knowledge graphs and SPARQL queries} -Our work focuses on conjunctive and disjunctive queries over RDF knowledge graphs~(KG) using the state-of-the-art SPARQL query language~\cite{w3SPARQLQuery}. +Our work focuses on conjunctive and disjunctive queries over RDF knowledge graphs~(KG) using the SPARQL query language~\cite{w3SPARQLQuery}. The core components of SPARQL queries and of KGs are respectively the triple patterns and the triples defined at Definition~\ref{def:triplePattern} and ~\ref{def:triple}. +\rt{Can you first introduce triple, and then triple pattern. The reverse direction is a bit odd.} + \begin{definition}[Triple pattern]\label{def:triplePattern} A triple pattern $tp = (s_{tp}, p_{tp}, o_{tp})$ is formed by a subject term $s_{tp} \in \mathcal{I} \cup \mathcal{B} \cup \mathcal{V}$, - a property path $p_{tp}$ as defined by \citeauthor{Kostylev2015} (Definition 2) and the SPARQL specification (section 18.2)~\cite{w3SPARQLQuery} + a property path \rt{A triple pattern does not contain property paths. Just use predicate here. (I, B, or V)} $p_{tp}$ as defined by \citeauthor{Kostylev2015} (Definition 2) and the SPARQL specification (section 18.2)~\cite{w3SPARQLQuery} and an object term $o_{tp} \in \mathcal{I} \cup \mathcal{L} \cup \mathcal{V} \cup \mathcal{B}$. Where $\mathcal{I}$, $\mathcal{B}$, $\mathcal{L}$, $\mathcal{V}$ are respectively the set of every possible IRI, blank node, literal and variable. \end{definition} +\rt{What is missing here is an explanation of what a triple pattern does (returns a solution sequence with solution mappings).} +\rt{You'll also at least need to hint to its relationship to BGPs up to broader query processing.} + \begin{definition}[Triple]\label{def:triple} An RDF triple $t = (s,p,o)$ is similar to a triple pattern, where $s \in\mathcal{I} \cup \mathcal{B}$, @@ -23,19 +28,19 @@ \subsection{RDF knowledge graphs and SPARQL queries} \subsection{Reachability Criteria} -Relying on results or the traversal of a large network like the World Wide Web is not feasible for defining completeness in LTQP. -Thus, to formalize the completeness of queries, a link discrimination formalism has been developed called \emph{Reachability criteria}~\cite{Hartig2012}. +Relying on results or the traversal of a large network like the Web is not feasible for defining completeness in LTQP.\rt{For defining completeness, this would definitely be feasible. But for practical use cases, it would not be feasible.} +Thus, to formalize the completeness of queries, a link discrimination formalism has been developed called \emph{Reachability criteria}~\cite{Hartig2012}.\rt{I would just introduce this by saying something along the lines of "following all possible links is not feasible in practise, so different reachability criteria have been introduced to define which subsets of links need to be followed."} Reachability criteria are boolean functions ($c_i$) restricting the dereferencing of links from the internal data source of the query engine. -They take as parameters an RDF triple $t$ from an internal triple store, a dereferenceable IRI $iri$ from $t$, and the query $B$~\cite{Hartig2012}. -If $c_i$ returns $true$, the query engine must try to dereference $iri$. +They take as parameters an RDF triple $t$ from an internal triple store, a dereferenceable IRI $iri$ from $t$, and the query $B$~\cite{Hartig2012} \rt{Are you sure it's the full query (I don't remember). I thought it was just applied to each triple pattern.}. +If $c_i$ returns $true$, the query engine must dereference $iri$. More formally \begin{equation}\label{eq:reachabilityCriteria} c_i(t, iri, B) \rightarrow \{\mathrm{true}, \mathrm{false}\} \end{equation} -\subsection{Decentralized Knowledge Graph} +\subsection{Decentralized Knowledge Graphs} -We define a decentralized knowledge graph as a KG $G$ materialized in a network of resources $R$. +We define a Decentralized Knowledge Graph (DKG) as a KG $G$ materialized in a network of resources $R$. A resource $r_i \in R$ contains a KG $g_i \subseteq G$ and is mapped and exposed by an IRI $iri_i$. The network forms a graph where the resources $r_i$ are the nodes and the $iri_j \in t \in g_i$ are directed edges starting from $r_i$ to $r_j$. $G$ is formed by the union of all the $g \in r$, such that $G = \bigcup_{i=0}^{n}g_i$ given $n$ resources in the network. \ No newline at end of file From 9072e4dfae8c1143874acb2a42a8937dadce8c1d Mon Sep 17 00:00:00 2001 From: Ruben Taelman Date: Tue, 10 Dec 2024 10:44:43 +0100 Subject: [PATCH 5/9] Review method --- section/method.tex | 101 +++++++++++++++++++++++++-------------------- 1 file changed, 56 insertions(+), 45 deletions(-) diff --git a/section/method.tex b/section/method.tex index 4376845..6b8917f 100644 --- a/section/method.tex +++ b/section/method.tex @@ -30,9 +30,11 @@ \section{Approach} \rt{This footnote seems unnecessary, since delva2023 is already cited.} } -\subsection{LTQP Completeness Under Prunning Regime}\label{sec:slde} +\rt{Introduce your subsections here in a few lines, so the reader knowns what to expect. This should help explain to the reader why the formalization is needed before you introduce your shape index approach.} -We propose to focus the completeness of LTQP on results instead of traversal in the context DESP. +\subsection{LTQP Completeness when Pruning}\label{sec:slde} + +We propose to focus the completeness of LTQP on results instead of traversal in the context DESP.\rt{Say why} We formalize our approach as follows. A query is executed over a DKG $G$ formed by the union of all the $g \in r$ in a network $R$. The query engine has to build a KG $G^{\prime}$ using a reachability $C^{\prime}$ in its internal data store from the $g \in r$ by dereferencing resources in $R$ such that @@ -48,13 +50,13 @@ \subsection{LTQP Completeness Under Prunning Regime}\label{sec:slde} \end{equation} for any network $R$. Because the $g \in G^{\prime}$ can only be acquired from the dereferencing of resources $r \in R$, a smaller $G^\prime$ implies that a lesser number of HTTP requests has been performed to answer a query. -Naturally, query execution is faster with a smaller KG instance. +Naturally, query execution is faster with a smaller KG instance.\rt{This is not entirely accurate, needs to be rephrased. It mostly is so though. Perf depends on distributions. You could create a smaller KG that is slower to query over than a large KG. For example, when you have many distinct predicates.} Additionally, HTTP requests can be slow and unpredictable~\cite{hartig2016walking}, making them a significant factor in overall query execution time. -Thus, the benefit of reducing the number of HTTP request is twofold. +Thus, the benefit of reducing the number of HTTP requests is twofold. To define less selective reachabilities to produce $G^{\prime\prime}$, we propose extending the reachability criteria by formalizing a chain of criteria in a concept called \emph{composite reachability criteria}. In this form, a reachability criterion $cp_i$ is said to \emph{prune} links, and $cd_i$ is said to \emph{discover} links. -A reachability $cp_i$ act upon all the links that have yet to be dereferenced as well as on the incoming links. +A reachability $cp_i$ acts upon all the links that have yet to be dereferenced as well as on the incoming links.\rt{What is the difference? Does it even matter?} Equation~\ref{eq:cReachabilityCriteria} formalizes a composite reachability criterion $C$. \begin{equation}\label{eq:cReachabilityCriteria} @@ -67,34 +69,38 @@ \subsection{LTQP Completeness Under Prunning Regime}\label{sec:slde} \subsection{Shape Index} Pruning in LTQP requires information about the data models of dereferenced sources. -However, obtaining complete, up-to-date, and expensive information for each source in a large decentralized network is unrealistic. -To address this, we propose the concept of shape indexes. -A shape index is a mapping between sets of RDF documents and RDF data shapes, which describe a region of the web controlled by a data provider. +However, obtaining complete, up-to-date, and expensive information for each source in a large decentralized network is unrealistic.\rt{But yet, it is what you do with shape indexes, so it is realistic then? Probably needs some rephrasing.} +To address this, we propose the concept of the \emph{shape index}. +We define a shape index is a mapping between sets of RDF documents and RDF data shapes, which describe subweb controlled by a data provider. When a set of RDF documents is mapped, the associated KG must conform to the shape. Unlike statistics on triples, shapes are independent of the size of the KG or updates that remain compliant with the shape, making them a more cost-effective alternative for use cases where the data model remains stable. -Moreover, shapes support open-world assumptions, offering data providers flexibility at the cost of being unsuitable for pruning (if we do not consider negative statement). +Moreover, shapes support open-world assumptions, offering data providers flexibility at the cost of being unsuitable for pruning (if we do not consider negative statement).\rt{I don't understand this sentence.} -A shape index can be formalized as follow +We formalize a shape index as follows: \begin{equation}\label{eq:shapeIndex} SI = \{s_1 \mapsto IRI_1, s_2 \mapsto IRI_2 \cdots, s_n \mapsto IRI_n\} \end{equation} +\rt{You must define what $s_i$ and $IRI_i$ are. (Probably iri instead of IRI to be aligned with the earlier formalisms?)} given $n$ entries. -The region of the web described by the index is defined by $D = \bigcup_{i=1}^{n} \bigcup_{iri \in IRI_i} iri$. +The subweb described by the index is defined by $D = \bigcup_{i=1}^{n} \bigcup_{iri \in IRI_i} iri$. A shape index \emph{must} map every resource in $D$. -We denote a shape index as \emph{complete} when every shape $s_i \in \text{dom}(SI)$ is closed and \emph{incomplete} otherwise. -A mapping between a shape and a set of IRI has implications in the distribution of the data in $D$. +We denote a shape index as \emph{complete} when every shape $s_i \in \text{dom}(SI)$ is closed and \emph{incomplete} otherwise.\rt{When is a shape closed? (may also be part of preliminaries if that is easier)} +A mapping between a shape and a set of IRIs has implications in the distribution of the data in $D$. When a shape $s$ is mapped to an $IRI$, then the KG targeted by the mapping $G = \{g | g \in r, iri \mapsto r \land iri \in IRI\}$ satisfies $s$. Given that the shape is closed, then every set of triples in the resource associated with $D$ respecting the shape must be in a resource mapped to an $iri \in IRI$. -We provide a web specification of the shape index online~\sepfootnote{sf:shapeIndexURL}. +We provide a detailed technical description of the shape index in an online specification~\sepfootnote{sf:shapeIndexURL}. + +\rt{This feels like a good place to start a new subsection...} -RDF shapes use the concept of targets to determine the root node for validation. +RDF shapes use the concept of \emph{targets} to determine the root node for validation. In this work, we consider all entities in a KG within a document to conform to the same schema. -We refer to these entities as tree stars (patterns), an extension of the existing star patterns concept in RDF. +\rt{A link is missing here, which says that you use the root of star patterns as validation roots (?)} +We refer to these entities as tree stars (patterns) \rt{This is confusing. Introduce tree stars and tree star patterns as separate concepts}, an extension of the existing star patterns concept in RDF \rt{citation needed}. Star patterns consist of triples sharing the same subject. We extend this idea by considering all star patterns linked via the object term of a preceding star pattern, forming a tree-like structure of star patterns. -For example, lets consider a user that have among multiple properties the comments it has posted and the comments, among other properties, are a reply to posts. -Our approach encapsulates these relationships within a single conceptual framework by linking the user star pattern with the nested comment star pattern and the deeper nested post star pattern. -This concept, formalized at Definition~\ref{def:starPattern}, serves two purposes: defining targets for validation and capturing relationships between the triple patterns of our query and shape entities. +For example, consider a user that links to his posts with recursive replies. +Tree stars can capture this through a star pattern defining a user, and recursively nested star patterns representing posts. +This concept, formalized in Definition~\ref{def:starPattern}, serves two purposes: defining targets for validation and capturing relationships between the triple patterns of our query and shape entities.\rt{Aha, here it is! This should have come as first sentence :-) Always explain why something is needed before you explain it.} \begin{definition}[Tree Star Pattern]\label{def:starPattern} We define a star pattern $Q_{star}$ as a set of $tp \in Q$~\cite{Karim2020} with the same subject such that @@ -131,17 +137,18 @@ \subsection{Shape Index} cannot be a $v \in \mathcal{V}$. We denote this structure a tree star. \end{definition} -Thus, the target shapes in the shape index correspond to the subject of each root star pattern when a KG is divided into tree stars with no shared partial tree stars. +Thus, the target shapes in the shape index correspond to the subject of each root star pattern when a KG is divided into tree stars with no shared partial tree stars.\rt{This should also have come much sooner.} -\subsection{Link Prunning Using Shape Indexes}\label{sec:sourceSelection} +\subsection{Link Pruning Using Shape Indexes}\label{sec:sourceSelection} -In this section we make the link between the shape index and link prunning in LTQP. -A decentralized dataset exposing a shape index provides a query engine with the opportunity to reduce its search domain by knowing which resources are query-relevant. -More formally, instead of traversing the web region $D$ associated with a shape index, the engine will traverse a subregion $d \subseteq D$, ignoring the knowable nonquery-relevant sections. +In this section we make the link between the shape index and link prunning \rt{Make sure to go through the whole paper with a spell-checker. I've seen you make this typo regularly.} in LTQP. +A decentralized dataset \rt{DKG?} exposing a shape index provides a query engine with the opportunity to reduce its search domain by knowing which resources are query-relevant. +More formally, instead of traversing the whole subweb $D$ associated with a shape index, the engine will traverse a subweb $d \subseteq D$, ignoring the knowable query-irrelevant sources. The concept of composite reachability criteria allows us to ignore certain sources during traversal based on the knowledge acquired during traversal. -Our approach involves dynamically constructing new reachability criteria during traversal by adding pruning criterion as we discover and analyze shape indexes. -These criteria are designed so that, they will always produce the same completeness of results as the one that was define at the beginning of the traversal. +Our approach involves dynamically constructing new reachability criteria during traversal by adding pruning critera as we discover and analyze shape indexes. +These criteria are designed so that, they will always produce the same completeness of results as the one that was defined at the beginning of the traversal.\rt{I don't understand this. Do you mean that the order of finding shape indexes does not matter?} +\rt{I still think introducing time here is a really bad idea. I'm almost certain that reviewers will shoot the paper down because of this. In the formalization, let's just assume prior knowledge of all shape indexes and corresponding reachability criteria. Combining them during query execution is an implementation detail.} More formally, let us introduce time $t$ as a factor for our reachability criteria $C_t$. The query execution begins with an initial reachability criterion $C_0$. At any time $t$, Equation~\ref{eq:evalQueryStructuralAssumption} must hold if we consider $C_t = C_{t+1} = C_{t+2} \dots = C_{tf}$ until the end of the execution $tf$, given that $G^{\prime}$ is produced using $C_0$. @@ -161,42 +168,46 @@ \subsection{Link Prunning Using Shape Indexes}\label{sec:sourceSelection} \subsection{Query Shape Containment}\label{sec:containment} -To determine if a source is query-relevant we propose to perform a query-shape containment problem. -The understanding of this problem is similar to the classic query containment problem. -Query containment aims at determining independent of the specific KG (database) instance if a query can be answer from the answer of another query. -In our context we are trying to determine if a tree star pattern from a query (given no share partial tree star pattern) can be answer by any sources respecting a specific shape. -Intuitively shapes expressions can be treated as a segment of a query~\cite{delva2023}. +We consider determining if a source is query-relevant a \emph{query-shape containment} problem. +The understanding of this problem is similar to the classic query containment problem \rt{Citation needed}. +Query containment determines if a query can be answered from the answer of another query, independent to the database or KG. +In our work, we want to determine if a tree star pattern from a query can be answered by any sources respecting a specific shape. +\rt{There is an unnatural jump here in your storyline. Needs another sentence, or separate paragraph?} +Intuitively shapes expressions can be treated as a segment of a query~\cite{delva2023} \rt{What is a segment of a query? Do you mean that shapes can be translated into queries?}. Furthermore it is a common approach for shape validation over an RDF graph is to convert shapes into SPARQL queries~\cite{labragayo2017validatingdescribinglinkeddata, Corman2019,Prestamo2023, spapeExpressionConvert}.~\sepfootnote{sf:recursiveShape} -Let's define this transformation $T(S)$ producing a query $Q_s$. -We consider open shapes to always represent queries that retrieve entire KGs (ignoring any negative statements). -This is because, under the open-world assumption, the shape defines the minimum constraints of a graph. -With this transformation the problem become similar to query containment problems. +We refer to this transformation of a shape $s$ as $T(s)$ producing a query $q_s$. +\rt{Another jump?} +We consider open shapes to always represent queries that retrieve entire KGs (ignoring any negative statements). \rt{You'll need to elaborate on these negative statements} +This is because, under the open-world assumption, a shape defines the minimum constraints of a graph. +With this transformation the problem \rt{Unclear what \emph{the} problem is here.} becomes similar to query containment problems. +\rt{Another jump?} Query containment may intuitively seem unsuitable for dynamic reachability due to its NP-complete time complexity~\cite{Spasi2023}. -However, this complexity depends on the size of the queries, which are typically small~\cite{Doan2012}; we do not expect in most cases queries containing thousands or millions of triple patterns~\cite{Bonifati2019}. +However, this complexity depends on the size of the queries, which are typically small in practise~\cite{Doan2012}; we do not expect in most cases queries containing thousands or millions of triple patterns~\cite{Bonifati2019}. Additionally, practical use cases can often leverage polynomial-time algorithms~\cite{Doan2012}. -In our context, the container query (the shape) adheres to a tree-star pattern template structure, where predicates are always IRIs. +In our context, the container query \rt{This concept is unknown to the reader} (the shape) adheres to a tree-star pattern template structure, where predicates are always IRIs. This structure arises because shapes predominantly describe predicate terms and object terms and are isomorphic to the specific KG. By exploiting this structure, it is possible to design an algorithm with polynomial time complexity. We consider that a query $Q$ is contained in a shape $S$, denoted as $Q \sqsubseteq_{qs} S$, if a tree star pattern in $Q$ is contained in $Q_s = T(S)$. -For a tree star pattern to be contained, we need to consider the two parts of $Q = Q_{\text{body}} \cup Q_{\text{unions}}$. +For a tree star pattern to be contained, we need to consider the two parts of $Q = Q_{\text{body}} \cup Q_{\text{unions}}$.\rt{What if bgps and unions are nested? Limitation of your work? (which is fine, but must be mentioned)} $Q_{\text{body}}$ is the Basic Graph Pattern (BGP) of the query, and $Q_{\text{unions}} = \bigcup Q_u$ represents the Union Graph Patterns (UGP), where $Q_u = q_0 \cup q_1 \cup q_2 \dots \cup q_n$. A tree star pattern $Q_{\text{starT}_i}$ is contained in $Q_s$ if its segment in $Q_{\text{body}}$ is contained in $Q_s$, and if its segment in at least one $q_i$ in each $Q_u$ is contained in $Q_s$. If $Q_{\text{starT}_i}$ is not part of any $Q_u$, then $Q_u$ is ignored. -In our containment problem, we ignore \texttt{GROUP BY} segments. +\rt{I would not mention everything after this, and instead at the start of this paragraph say that for simplicity of this formalization, we focus on just BGPs of triple patterns and unions.} +In our containment problem, we ignore \texttt{GROUP BY} segments. We make this choice because, in the context of shape queries, \texttt{GROUP BY} is primarily used to set a cardinality, and we are not attempting to identify sources that can fully answer specific segments of a query. Instead, our goal is to disregard data sources that are irrelevant to the query, given the constraints imposed by the shape index. Discriminating based on cardinalities could potentially affect query results. Additionally, this article does not consider filter expressions, as they can significantly increase the complexity of the problem. -Moreover, "false" negatives do not impact the correctness of our approach. +Moreover, "false" negatives do not impact the correctness of our approach. \rt{But this is important to mention indeed! But needs some more elaboration on the fact that doing more (useless) requests is not a problem.} However, incorporating filter expressions into future work would be an interesting direction to explore. % https://en.wikipedia.org/wiki/Master_theorem_(analysis_of_algorithms) \begin{algorithm}[h] - \caption{Determine if a tree star pattern is contained ($isContain_{T}$)}\label{alg:containmentTree} + \caption{Determine if a tree star pattern is contained ($isContain_{T}$)}\label{alg:containmentTree}\rt{I think I also mentioned this in my previous review, but you probably mean isContained? Also not sure what the T in there means.} \begin{algorithmic} \REQUIRE $Q_{star}$, $Q_{starT_i}$, $Q_s = Q_{s\text{body}} \cup Q_{s\text{unions}}$ and $Eval_{star}$ - \ENSURE \TRUE $ $ or \FALSE $ $ whether the tree star is contained in the shape + \ENSURE \TRUE $ $ or \FALSE $ $ whether the tree star \rt{pattern?} is contained in the shape \IF{$S_{star}(Q_{star}) \in Eval_{star}$} \RETURN \TRUE @@ -230,11 +241,11 @@ \subsection{Query Shape Containment}\label{sec:containment} \end{algorithm} -We define the function $isContain_{T}$ in Algorithm~\ref{alg:containmentTree} to evaluate whether a tree star with a root star pattern $Q_{star_i}$ from $Q_{starT_i}$ is contained in $Q_s$. +We define the function $isContain_{T}$ in Algorithm~\ref{alg:containmentTree} to evaluate whether a tree star \rt{pattern?} with a root star pattern $Q_{star_i}$ from $Q_{starT_i}$ is contained in $Q_s$. The algorithm also takes a set $Eval_{star}$ to track which partial tree stars have already been evaluated. The algorithm works by examining each triple pattern in the root star pattern $Q_{star_i}$ and checking if an equivalent triple pattern (ignoring the variable names) can be found in the BGP of $Q_s$ using the $match$ function. If the triple pattern cannot be found in the BGP, the algorithm then looks into the UGPs of $Q_s$. -Since we assume that the union statements are not nested (perhaps through rewriting), this limits the number of recursive calls. +Since we assume that the union statements are not nested (perhaps \rt{This \emph{perhaps} is not very scientific...} through rewriting. Either say it's a limitation, or cite a paper showing you can rewrite. (or prove yourself)), this limits the number of recursive calls. If an equivalent triple pattern is found, the algorithm checks whether the object of the triple pattern is the subject of a partial tree star pattern in $Q_{starT_i}$. If it is, the algorithm recursively applies the same procedure to this partial tree star pattern as $Q_{star_i}$. To avoid cycles and redundant evaluations when processing the object of the star patterns, we maintain a set of evaluated subjects in $Eval_{star}$. @@ -244,4 +255,4 @@ \subsection{Query Shape Containment}\label{sec:containment} This operation results again in a polynomial time complexity algorithm. %To solve $Q \sqsubseteq_{qs} S$, we need to consider the $n_{starT}$ tree star from the BGP with their $n_{starTu}$ segment in the UGP and $n_{starTui}$ BGP in the UGP. %This operation results in a polynomial time complexity of $O(n_o \times n_{tp}^2 \times n_{union}^2 \times n_{starT} \times n_{starTu} \times n_{starTui})$. -In the \nameref{sec:appendix} Algorithm~\ref{alg:containment} present the full resolution. +In the \nameref{sec:appendix} Algorithm~\ref{alg:containment} present the full resolution.\rt{What do you mean by "the full resolution"? Is it a different algorithm?} From b6a27157afa89f7288be760d3806295bba83d856 Mon Sep 17 00:00:00 2001 From: Ruben Taelman Date: Tue, 10 Dec 2024 10:58:15 +0100 Subject: [PATCH 6/9] Review experimental setup --- section/experiment.tex | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/section/experiment.tex b/section/experiment.tex index fa1144d..bfbf78b 100644 --- a/section/experiment.tex +++ b/section/experiment.tex @@ -41,24 +41,24 @@ \section{Experimental Setup} The implementation of our shape index approach is open source~\sepfootnote{sf:implementationComunica}, as well as our query-shape containment solver~\sepfootnote{sf:implementationQueryShapeContainment}. We use SolidBench~\cite{Taelman2023}, based on the LDBC social network benchmark~\cite{Angles2020}, to evaluate our contribution. To facilitate this, we created an open-source module~\sepfootnote{sf:shapeIndexGenerator} to generate shape indexes in SolidBench, based on user-provided mappings between ShEx shapes and data model objects. -The shape annotated portion of the data model includes posts, comments on posts, user profiles, cities, and likes. -The datasets are Solid Pods~\cite{Taelman2023} +The shape-annotated portion of the data model includes posts, comments on posts, user profiles, cities, and likes. +The datasets are Solid Pods~\cite{Taelman2023} \rt{There's probably a better citation for Solid} Each Solid Pod contains alongside the data a shape index and separate files for each shape definition. Some shapes are nested within others. -For example, profiles are associated with cities, and comment are associated with posts. +For example, \rt{user profiles?} profiles are associated with cities, and comment are associated with posts. Depending on the pod instance, certain data model objects are materialized in a single file, while others are distributed across multiple files. -The entire data model and query templates are available online~\sepfootnote{sf:solidbench}. +The entire data model and query templates are available online~\sepfootnote{sf:solidbench}.\rt{I would say that queries simulate typical (read) actions in social media use cases, perhaps with an example.} -To evaluate our approach, we conducted the following experiments, we first measured the execution time and results of our query-shape containment algorithm using the shapes from the study. -Then, We compared the Solid Pod network optimal traversal algorithm, which uses the LDP specification and type-index~\cite{Taelman2023}, with the LDP traversal algorithm~\cite{Taelman2023} and our shape index approach in a network where each Solid Pods (datasets) provides a complete shape index with the most descriptive shapes. -Additionally, we evaluated the adaptivity of our approach by reducing the shape index information across the network: +To evaluate our approach, we conducted the following experiments, we first measured the execution time and results of our query-shape containment algorithm using the shapes from the study. +Then, we compared the state-of-the-art Solid Pod network traversal algorithm, which uses the LDP specification and type-index~\cite{Taelman2023}, with the LDP traversal algorithm~\cite{Taelman2023} and our shape index approach in a network where each Solid Pod provides a complete shape index with the most descriptive \rt{which shapes are that?} shapes. +Additionally, we evaluated the adaptivity \rt{Not sure adaptivity is the right term here. Something like \emph{shape index variability} may be better.} of our approach by reducing the shape index information across the network: \begin{itemize} - \item We compared the impact of query execution time in a network where 0\%, 20\%, 50\%, and 80\% of Solid Pods expose a shape index. - \item We compared the impact of using shape indexes with 20\%, 50\%, and 80\% of entries using closed shapes. - \item We compared the impact of using shapes that incorporate only data from the Solid Pods, and shapes providing a minimal dataset description where the object constraints are always an IRI or a literal. + \item Query execution time in a network where 0\%, 20\%, 50\%, and 80\% of Solid Pods expose a shape index. \rt{Why mention a metric here, but not in the next two?} + \item Impact of shape indexes with 20\%, 50\%, and 80\% of entries using closed shapes. + \item Impact of shapes that incorporate only data from the Solid Pods, and shapes providing a minimal dataset description where the object constraints are always an IRI or a literal.\rt{Unclear why this is needed, and what it would look like.} \end{itemize} -We used query templates from SolidBench, each with five instances varying the starting pod. -Experiments were repeated 50 times with a 2 minute timeout (120,000 ms). +We used query templates from SolidBench, each replicated five times with varying starting pods. +Experiments were repeated 50 times with a 2 minute timeout. They were conducted on an Ubuntu 20.04.6 LTS machine with a 2x Hexacore Intel E5645 CPU and 24GB RAM. All experiments are reproducible, with raw data and complementary materials available online~\sepfootnote{sf:complementaryMaterial}. From f4f3a83913fdfde1fe36526b74eac706f3d8219b Mon Sep 17 00:00:00 2001 From: Ruben Taelman Date: Tue, 10 Dec 2024 11:22:42 +0100 Subject: [PATCH 7/9] Review results --- section/results.tex | 64 +++++++++++++++++++++++++-------------------- 1 file changed, 36 insertions(+), 28 deletions(-) diff --git a/section/results.tex b/section/results.tex index 13b0c6b..059c568 100644 --- a/section/results.tex +++ b/section/results.tex @@ -1,4 +1,4 @@ -\section{Results} +\section{Discussion of Results} \subsection{Evaluation Against Other Approaches} @@ -7,10 +7,11 @@ \subsection{Evaluation Against Other Approaches} \includegraphics[width=0.45\linewidth]{analysis/artefact/variation_approach/reduction_query_execution_time} \caption{ This figure compares the type index with the LDP approach against other approaches. - The shape index approach is better or similar to the other approaches except with S4. + The shape index approach is faster or similar to the other approach except for S4. } \label{fig:compApproach} \end{figure} +\rt{Same comment as before, can we use execution time instead of ratio on the Y-axis? The results are not intuitive atm.} Figure~\ref{fig:compApproach} shows that the shape index approach for all query templates except S4 performs better or comparably to the Solid Pod network optimal traversal algorithm. @@ -20,16 +21,17 @@ \subsection{Evaluation Against Other Approaches} Query templates D6 and D7 show no reduction because they require nearly every document in the dataset to be processed by the engine, making our approach ineffective in these cases. We notice that query template S4 with the shape index performed worse in every instance, with an increase in query execution time of up to 2.80 times. This is further illustrated in Table~\ref{tab:ratioUsefulResources}, which shows that for these queries, the Solid Pod network optimal traversal algorithm achieves a ratio of useful resources dereferenced of 100\% or 50\%, compared to only 6\% with the shape index approach. -The poor performance is due to the fact that other approaches rely solely on the reachability \texttt{Cmatch}~\cite{hartig2016walking} to achieve completeness, without leveraging the structural properties of the datasets. +The poor performance is due to the fact that other approaches rely solely on the reachability \texttt{Cmatch}~\cite{hartig2016walking} \rt{I think it's c\_match} to achieve completeness, without leveraging the structural properties of the datasets. In contrast, the shape index approach always enforces the use of these properties, resulting in additional HTTP requests and increased processing time. It has to be highlighted that our formalization assumes that structural assumptions are always used and that the shape index, if present, will be part of the traversal, see section~\ref{sec:sourceSelection}. However, those queries were already fast, with the Solid Pod network optimal traversal algorithm approach being executed in approximately 0.30\% of the maximum execution time allowed. Nonetheless, these results still highlight a category of queries and networks for which our approach is not well-suited. +\rt{Separate subsection here.} The empirical evaluation of the query-containment algorithm shows that its execution time with the more detailed shapes from our experiment is negligible, with a maximum execution time of 4.655 ms (0.0039\% of the timeout). -The result tables are available in the supplementary material. +The result tables are available in the supplementary material \rt{Reference needed}. This outcome is expected, as the algorithm has polynomial time complexity, and the shapes and queries in the experiments are small and not deeply nested. -This suggests that the primary overhead in this approach is not from the algorithm itself but likely from the state retention required for the pruning reachability criteria. +This suggests that the primary overhead in this approach is not from the algorithm itself but likely from the \rt{No idea what the following is...} state retention required for the pruning reachability criteria. \begin{figure}[htbp] \centering @@ -55,27 +57,29 @@ \subsection{Evaluation Against Other Approaches} \label{fig:http_req_exec_time_cor} \end{figure} -To determine the relationship between the reduction of HTTP requests and the query execution time, we evaluated their ratio using +\rt{Another new subsection.} +To determine the relationship between the reduction of HTTP requests and the query execution time, we evaluated their ratio \rt{correlation?} using the data of our experiments with the Solid Pod network optimal traversal algorithm results as the baseline. -We use Figure~\ref{fig:http_req_exec_time_cor} as a tool of analysis. -The relationship between HTTP request and query execution time can be divided into two regimes. +We use Figure~\ref{fig:http_req_exec_time_cor} as a tool of analysis \rt{Figures are usually the results of an analysis, not the tool to achieve them.}. +The relationship between HTTP request and query execution time can be divided into two regimes.\rt{How do you divide them?} In the first regime (left figure), where the shape index approach reduces the number of HTTP requests, we notice a positive linear correlation with a -Pearson correlation coefficient (PCC) of 0.84 and a high statistical significance given a p-value of 3.00E-93. -However, evaluating an $R^2$ score with an exponential best fit curve we get a score of 0.72 and 0.71 for a linear curve. +Pearson correlation coefficient (PCC) of 0.84 and a high statistical significance given a p-value of 3.00E-93\rt{No need to be so precise, you can just say $< 0.01$}. +However, evaluating an $R^2$ score with an exponential best fit curve we get a score of 0.72 and 0.71 for a linear curve.\rt{Is Pearson not enough? Why this one as well? Can we stick to just one?} We can notice that toward the end, the curve appears to exhibit more exponential behavior. Below a ratio of approximately 0.83 of HTTP requests, the shape index approach did not guarantee a reduction in query execution time. -With this obsevation, the exponential behavior might be explained by the approach's overhead. +With this observation, the exponential behavior might be explained by the approach's overhead. It is possible that with a small reduction of HTTP requests, the state retention required for the pruning reachability criteria could offset the gain in the reduction of HTTP requests. In the second regime (right figure), the shape index increases the number of HTTP requests. We notice a weaker positive linear correlation with a PCC of 0.44 and a high statistical significance given a p-value of 9.01E-05 though the significance is lower than in the first regime. The overall correlation between reducing HTTP requests and query execution time is positively linear, with a PCC of 0.56 and a high statistical significance with a p-value of 5.83E-36. The overall correlation is more linear than exponential with $R^2$ scores respectively of 0.31 and 0.24, however due to the low score it is difficult to determine the nature of the distribution. -Explaining the two regimes' behavior exhibited by the data is challenging. +Explaining the two regimes' behavior exhibited by the data is challenging.\rt{Same comment as before; pretty sure RubenE's work can help explaining.} A possible explanation can be the lack of samples when the shape index approach performs poorly. However, we can also notice that the relationship between the two variables in the first regime is closer to one-on-one (slope of approximately 0.91) than in the second regime (slope of approximately 0.08), where the ratio of HTTP requests has less of an impact. Futhermore, the number of HTTP requests increased by 16 times compared to the baseline, which initially required only 1 or 2 requests (see Table ~\ref{tab:ratioUsefulResources}). However, despite this increase, the engine ultimately handled a modest number of links to dereference. -This indicates that the actual impact might not be as significant as the figure suggests, particularly considering that HTTP requests are made in concurrence. +This indicates that the actual impact might not be as significant as the figure suggests, particularly considering that HTTP requests are done concurrently. +\rt{I think this paragraph can be shorted quite a bit if you run into space issues.} %This observation can lead us to question how complex the queries are in that regime; queries where the shape index increases vastly in the number of HTTP requests are the queries from the S4 template. %However, those queries were already answered quickly and consisted of only four triple patterns and a union statement (with the alternative property path). %Thus, it is possible that the number of HTTP requests has less of an impact because it is easier for the engine to perform the join operation upon reception of the data than when processing more complex queries. @@ -89,37 +93,41 @@ \subsection{Evaluation Against Other Approaches} \centering \includegraphics[width=1\linewidth]{analysis/artefact/variation_shape_index_all/plot} \caption{ - Shape index approaches tend to perform less effectively with limited network information and comparatively better where the baseline shape index underperforms. + Shape index approaches tend to perform less effectively with limited network information and comparatively better where the baseline shape index underperforms.\rt{Seems like a very negative comment for results that are actually very positive? Try to make this more glass-half-full.} However, the utility of shape index information can vary depending on the specific queries and network characteristics. } \label{fig:adaptShapeIndex} -\end{figure} +\end{figure}\rt{Use exec times instead of ratios.} -\subsection{Evaluation of the Adaptivity} +\subsection{Evaluation of the Adaptivity}\rt{New title suggestion: Impact of variability in shape indexes} The final part of the results analysis focuses on the adaptivity of the shape index approach. In this analysis, we examine the impact of reducing the shape index information in the network and compare the results with a network in which all pods are exposed to detailed, complete shape indexes. Figure~\ref{fig:adaptShapeIndex} presents three plots that illustrate the results of our evaluation of the approach's adaptivity. -The plot on the left shows the variation in the percentage of shape index across the network. +The plot on the left \rt{Reference it} shows the variation in the availability of shape indexes across the network. As expected, we observe that queries that performed better in Figure~\ref{fig:compApproach} tend to perform worse with reduced shape index information, while queries that performed poorly improve. Queries that were unaffected by the shape index changes remain unaffected. -The plot in the middle shows the variation in the percentage of shape index entries using closed shapes. + +\rt{I did not really understand the following paragraph.} +The plot in the middle \rt{Reference it} shows the variation in the percentage of shape index entries using closed shapes. The results here are more nuanced. -While there is a general trend for query evaluations with a lower percentage of closed shapes to behave similarly to the plot on the left, we also observe both performance gains and a drastic performance loss for query S1 when 80\% of the shape entries are closed. -The performance gain occurs because not every entry needs to be closed to affect query performance. -Entries mapped to an open shape are always considered relevant. -If the containment resolution leads to the same conclusion then if the entry is closed the execution will be more expensive. -Additionally, shapes can be nested, and when open shapes are used, we do not need to dereference the nested shapes. +While there is a general trend for query evaluations with a lower percentage of closed shapes to behave similarly to the plot on the left \rt{Reference it}, we also observe both performance gains and a drastic performance loss \rt{Is it gain or loss? Can't be both at the same time.} for query S1 when 80\% of the shape entries are closed. +The performance gain occurs because not every entry needs to be closed to affect query performance.\rt{I don't understand this.} +Entries mapped to an open shape are always considered relevant.\rt{Explain why} +If the containment resolution leads to the same conclusion than if the entry is closed, the execution will be more expensive. \rt{Why?} +Additionally, shapes can be nested, and when open shapes are used, we do not need to dereference the nested shapes.\rt{Why?} For query S1, with 80\% of closed shape entries, the performance lost was due to random chance, as the discriminatory entries were provided with open shapes in multiple instances when looking at the raw data. -The right plot shows the variation in the level of detail of the shapes. -Most queries tend to perform similarly or better, with the exception of S1. -Upon analyzing the output of our query containment algorithm, we observe that the additional information provided in our base approach does not affect the algorithm’s results. -However, the engine needs to dereference more shapes, which can decrease the execution time. + +The right plot \rt{Reference it} shows the variation in the level of detail of the shapes. +Most queries tend to perform similarly or better \rt{when increasing of reducing level of detail?}, with the exception of S1. +Upon analyzing the output of our query containment algorithm, we observe that the additional information \rt{What information? Do you mean more precise shapes?} provided in our base approach does not affect the algorithm’s results. +However, the engine needs to dereference more shapes, which can decrease the execution time.\rt{Is that correct? Deferencing more should increase exec time I guess?} Query S1 is the only one where the added information can discriminate multiple parts of the datasets' domain, indicating the intuitive results that, in some situations, adding more information can be beneficial. -This sensitivity to the quality of the information in the index also helps explain the results for S1 in the middle plot. +This sensitivity to the quality \rt{What is "the sensitivity to the quality"?} of the information in the index also helps explain the results for S1 in the middle plot. In that case, the query engine still had to dereference sources from each dataset, and the information available was likely insufficient to significantly discriminate between sources. \subsection{Validation of Hypotheses} +\rt{I think you don't need this section if you just incorporate it in the subsections above.} In this section, we revisit our hypotheses. \textbf{H1} is mostly valid as the shape index reduces HTTP requests for queries using structural properties, though it can drastically increase requests when such properties are not used. \textbf{H2} is valid when HTTP requests are reduced, but near a ratio of 0.83, the curve is more exponential toward an increase in execution time. From 8496153a955bf241c7d39d25689f2f966609f966 Mon Sep 17 00:00:00 2001 From: Ruben Taelman Date: Tue, 10 Dec 2024 11:28:31 +0100 Subject: [PATCH 8/9] Review conclusions --- section/conclusion.tex | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/section/conclusion.tex b/section/conclusion.tex index 404d06d..c4c3832 100644 --- a/section/conclusion.tex +++ b/section/conclusion.tex @@ -2,8 +2,10 @@ \section{Conclusion} In this article, we minimized the search space for link traversal queries by leveraging a shape index and addressing the query-shape containment problem. Additionally, we introduced pruning in LTQP by extending the concept of reachability criteria. -Using the Solidbench benchmark, we demonstrated that our approach can improve query execution times by up to 7 times. -Our adaptive method effectively handles scenarios with reduced information in shape indexes or their partial absence in a network. +Using the Solidbench benchmark, we demonstrated that our approach can significantly reduce the number of HTTP requests, which leads to query execution time reductions by up to 7 times. +Our adaptive method effectively handles scenarios with reduced information in shape indexes or their partial absence in a network.\rt{This sentence can go, as it does not say a lot here.} This study highlights that shape-based pruning can be highly effective for LTQP in decentralized environments with structural properties, especially for selective queries. These findings are particularly relevant for decentralization initiatives that aim to enable users or third-party clients to perform efficient queries over large, diverse networks. Future work could explore further advancements in this area, such as enhancing query planning in LTQP~\cite{taelman2024towards} with RDF data shapes. + +\rt{I still have some open questions after reading the full article: What is the impact on server load (CPU usage)? Does adding shapes lead to an increase of that? Because if not, you could conclude here by saying that adding shapes adds significant benefits to the client, with no overhead to the server, except for a (possibly offline) process for shape creation/derivation. Also, does adding shapes influence query result arrival times negatively or positively?} \ No newline at end of file From eb01a5959a69af7f516bffd821574586f7527e9b Mon Sep 17 00:00:00 2001 From: Ruben Taelman Date: Tue, 10 Dec 2024 11:30:42 +0100 Subject: [PATCH 9/9] Add TODO in related work --- section/related_work.tex | 1 + 1 file changed, 1 insertion(+) diff --git a/section/related_work.tex b/section/related_work.tex index 1e2c24d..e2ea1eb 100644 --- a/section/related_work.tex +++ b/section/related_work.tex @@ -12,6 +12,7 @@ \section{Related Work} However, they do not explicitly address mechanisms for restricting certain links in what we could call a pruning process\rt{That's not entirely accurate. SWSL does pruning as well, but it's more pruning for trust (see the Carol's wrong name for Bob example), while you do pruning based on shapes and the current query. Let's elaborate a bit on this difference here.}. Link pruning can be very useful to reduce the search domain of queries when information about the data model of DKG is found. \rt{Emphasize why pruning matters for your work} +\rt{A paragraph/subsection is needed explaining the related work around LTQP in DESPs (contrasting to LTQP in Linked Open Data).} RDF data shapes are used for validating, describing, and communicating data structures, as well as generating data and driving user interfaces~\cite{Gayo2018a,Gayo2018}. The two most well known RDF data shape formalisms are SHACL and ShEx.\rt{Cite the specs}