You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently use inconsistent terminology across different parts of our data infrastructure, especially in how we describe content from various sources (forums, mailing lists, transcripts, etc.). This inconsistency complicates understanding the system, maintaining the codebase, and onboarding new team members.
Issues with Current Terminology:
Inconsistent terms: Different sources use different names for similar concepts (e.g., "Topics" vs. "Questions" vs. "Threads").
Confusion in usage: We have terms like "board", "topic", "post", "thread", "article", "emails" etc., used somewhat interchangeably across different scrapers and systems, even when they represent slightly different things.
Scalability: As the system grows with more sources and resources, the lack of a well-structured terminology/taxonomy makes it harder to add new content types or adapt the system for new use cases.
Lack of clarity for reusable components: These inconsistencies make it harder to build generic components that work across all sources.
Proposal for Standardized Terminology
To reduce confusion and create a clearer, more maintainable system, we can use the following standardized terms for the structure of our data infrastructure.
Proposed Terminology:
Source:
Represents the overarching collection of content from a particular platform or system.
The main content entity within a source. A resource is the primary unit that contains smaller elements.
Examples: Thread (in forums), Transcript (in transcripts), Question (in Q&A), Email Thread (in mailing lists), Newsletter (in newsletters).
Item (Optional Chunking):
A smaller, self-contained unit within a resource. Not all resources need to be chunked, but when applicable, this term will refer to the smaller parts of a resource.
Examples: Post (in a thread), Chapter (in a transcript), Answer (in Q&A), Email (in an email thread), Section (in newsletters).
Why This Matters
Clarity and Consistency: By using uniform terms across all sources, we reduce confusion for anyone working with the system and for future developers.
Improved Maintainability: With standardized terms, our codebase becomes easier to navigate and maintain, leading to fewer errors and smoother development.
Reusable Components: Generic components can be built more effectively when the underlying terminology is consistent. This will allow us to handle a variety of content types without needing custom solutions for each source.
Easier Onboarding: New team members will have a clearer understanding of how the system is organized, making onboarding quicker and more efficient.
Mapping the Terminology
To further illustrate the proposal, here’s how different sources will be structured under this new terminology:
Source
Resource
Item (Optional)
BitcoinTalk
Topic
Post
Mailing-lists
Email Thread
Email
Bitcoin Transcripts
Transcript
Chapter
Stack Exchange
Question
Answers
Delving Bitcoin
Topic
Post
BitcoinOps
Newsletter
Section
Individual Blogs
Blogpost
Addressing Metadata and Resource Reference Issues
Metadata Inconsistencies:
We have inconsistencies in the way metadata is defined across scrapers, particularly in the type and thread_url fields:
type field:
BitcoinTalk: type="topic" if message_number == #1, else type="post"
BitcoinOps: type="topic" for documents in _topics/en, type="post" for documents in _posts/en
Stack Exchange: type="answer" or type="question"
Mailing-lists, Delving Bitcoin: type="original_post" or type="reply"
BitcoinTranscripts: No type used
thread_url field:
Used in: Stack Exchange, DelvingBitcoin, Mailing-lists
Not used in: BitcoinTalk
To resolve this, we will adopt consistent metadata field definitions across all sources, ensuring that a type field and a consistent way to reference resources are applied uniformly.
Resource Reference Problem:
One key issue is the lack of a clear way to refer to a Resource. Currently, we treat the first post or element of a resource (e.g., the first post in a thread) as a reference point, but this leads to problems. For instance:
Thread Summaries (Combined Summaries): When we create summaries for threads, we lack a clear document that represents the thread itself. This forced us to generate a separate summary document, adding complexity to the infrastructure.
By defining a Resource as the main reference point, we create a clear structure for referring to the full entity (e.g., the thread as a whole, rather than just its first post). This approach allows us to handle ranking algorithms, mappings across resources, and summaries more effectively, without the added complexity of treating parts of a resource as a substitute for the whole.
Next Steps
Adopt the standardized terminology in all parts of the system where sources and resources are handled.
Update existing documentation to reflect the new terminology.
Standardize metadata fields across all scrapers, especially for type and thread_url.
In a follow-up proposal, we'll discuss how resources can be optionally chunked into smaller documents for improved semantic search and processing.
The text was updated successfully, but these errors were encountered:
Context and Current Terminology
We currently use inconsistent terminology across different parts of our data infrastructure, especially in how we describe content from various sources (forums, mailing lists, transcripts, etc.). This inconsistency complicates understanding the system, maintaining the codebase, and onboarding new team members.
Issues with Current Terminology:
Proposal for Standardized Terminology
To reduce confusion and create a clearer, more maintainable system, we can use the following standardized terms for the structure of our data infrastructure.
Proposed Terminology:
Source:
Resource:
Item (Optional Chunking):
Why This Matters
Mapping the Terminology
To further illustrate the proposal, here’s how different sources will be structured under this new terminology:
Addressing Metadata and Resource Reference Issues
Metadata Inconsistencies:
We have inconsistencies in the way metadata is defined across scrapers, particularly in the
type
andthread_url
fields:type
field:type="topic"
if message_number ==#1
, elsetype="post"
type="topic"
for documents in_topics/en
,type="post"
for documents in_posts/en
type="answer"
ortype="question"
type="original_post"
ortype="reply"
type
usedthread_url
field:To resolve this, we will adopt consistent metadata field definitions across all sources, ensuring that a
type
field and a consistent way to reference resources are applied uniformly.Resource Reference Problem:
One key issue is the lack of a clear way to refer to a Resource. Currently, we treat the first post or element of a resource (e.g., the first post in a thread) as a reference point, but this leads to problems. For instance:
By defining a Resource as the main reference point, we create a clear structure for referring to the full entity (e.g., the thread as a whole, rather than just its first post). This approach allows us to handle ranking algorithms, mappings across resources, and summaries more effectively, without the added complexity of treating parts of a resource as a substitute for the whole.
Next Steps
type
andthread_url
.The text was updated successfully, but these errors were encountered: