Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow site configuration to not index tag pages #10835

Open
1 of 2 tasks
undergroundwires opened this issue Jan 13, 2025 · 3 comments
Open
1 of 2 tasks

Allow site configuration to not index tag pages #10835

undergroundwires opened this issue Jan 13, 2025 · 3 comments
Labels
feature This is not a bug or issue with Docusausus, per se. It is a feature request for the future. status: needs triage This issue has not been triaged by maintainers

Comments

@undergroundwires
Copy link

undergroundwires commented Jan 13, 2025

Have you read the Contributing Guidelines on issues?

Description

Solution

Proposed API:

User experience, add new flag for DocusaurusConfig in docusaurus.config file such as deindexTags: true.

Proposed changes:

Behavior Changes
Tag URLs <a> elements will have rel="noindex nofollow" attributes on tag lists pages. Update Tag to do he check from siteConfig.
Tag list page (root component of the tags list page) will have <meta name="robots" content="noindex, nofollow">. Update DocTagsListPage and BlogTagsListPage to add the Head with noindex meta.
Tag page (root component of the "containing tag X" page) <meta name="robots" content="noindex, nofollow">. Update DocTagDocListPage and BlogTagsPostsPage to add the Head with noindex meta.
Sitemap ignores /tags** Theme sets sitemap.ignorePatterns:[${tagsBasePath}**].

Motivation

Why

Tag pages are thin/low quality, creating duplicated content.
This leads to search engines scoring the website lower, or indexing tag pages before the specific pages.

Google says 1:

Block crawling of duplicate content on your site, or unimportant resources (such as small, frequently used graphics such as icons or logos) that might overload your server with requests. Don't use robots.txt as a mechanism to prevent indexing; use the noindex tag or login requirements for that

I have solved this through wrapping/swizzling list pages, tag components and and custom sitemap.ignorePatterns rule in the config file, but it's a lot of workaround and a best-practice like this would be appreciated if it came as default.

Background

This has lead to me issues with all search engines for privacylearn.com to present open-source scripts pre-launch.

siteliner analysis:

Google indexing status:

Engines include Google, Yandex and Bing where thousands of my pages got de-indexed over time and tags pages took more priority than proper pages.

API design

No response

Have you tried building it?

No response

Self-service

  • I'd be willing to contribute this feature to Docusaurus myself.
@undergroundwires undergroundwires added feature This is not a bug or issue with Docusausus, per se. It is a feature request for the future. status: needs triage This issue has not been triaged by maintainers labels Jan 13, 2025
@slorber
Copy link
Collaborator

slorber commented Jan 16, 2025

The concept of tags is plugin-specific, so it can only be a plugin config, not a core site config.

The sitemap plugin does not need to be modified, it already ignores all pages having noindex by default, so adding noindex meta on rendered pages is enough.


Tag pages are thin/low quality

Is this really that different from regular blog pagination, and blog authors pages for example?

In the future, we may allow you to provide an MDX file to provide a valuable tag description:
https://docusaurus.io/blog/tags/release

Similar for blog authors:
https://docusaurus.io/blog/authors/slorber

Image

For now, it's just a string description but we'd like to allow MDX, so at least the 1st paginated page could contain meaningfully unique content

creating duplicated content

Afaik we are using canonical URLs and structured data so search engines know that this is not to be considered as duplicate content.

This leads to search engines scoring the website lower

Can you provide it? Show your SEO score before/after the change.

Or share a resource from an authority such as google explaining why this is a bad practice decreasing the score.

Note that docs tags barely create duplicate content because they only render title + description:

https://docusaurus.io/tests/docs/tags

Unlike blog posts which present an excerpt, that you can truncate using a {* truncate */} marker.

As far as I understand, you only use docs tags:
https://privacylearn.com/tags/remove-windows-apps

I'm not sure that creating such an index with relatively small excerpts is going to be considered "duplicate content".

Maybe the problem is that you are not using frontMatter.description and using the "inferred" description, which is relatively long and doesn't really look good. This looks messy and as a search engine I'd probably penalize that:

Image

I find it surprising that you are presenting here an advanced SEO topic, while you are not even using the most basic SEO metadata correctly on your canonical pages 😅

Indexing tag pages before the specific pages.

Specific pages receive more external backlinks from external domains, and also from internal pages since paginated pages only receive links from previous pages while all paginated pages will link to actual canonical pages. I doubt this is the case so please provide it or provide an authority link explaining this behavior.

This has lead to me issues with all search engines for privacylearn.com to present open-source scripts pre-launch.

I don't understand what you mean here 😅

Engines include Google, Yandex and Bing where thousands of my pages got de-indexed over time and tags pages took more priority than proper pages.

How am I supposed to see this in the screenshots above? Please

We don't have such a problem on other websites I manage.

Image


I'd be happy to improve SEO for Docusaurus.

This includes providing APIs to alter the SEO behavior, and/or providing more sensible defaults.

However, I do not take this lightly. Changing the SEO profile of thousands of existing Docusaurus websites is risky and could backfire. That's why I'm going to push back a lot and ask you to back your claims better.

We need to run experiments, and measure the SEO impact before/after to see if a change is worth generalizing, or an opt-in feature worth implementing.

I'm also usually asking other community members with expertise or care about SEO to confirm this change is welcome, such as @jdevalk or @johnnyreilly

Afaik @johnnyreilly uses tags and regularly monitor his website SEO, working with an SEO agency, and has not reported this problem:
https://johnnyreilly.com/tags

@undergroundwires
Copy link
Author

undergroundwires commented Jan 16, 2025

Hi,

Thank you for your detailed response and for diving deeper into this topic.
I've had great experience with Docusaurus maintainers, and among the static site generators I've used, Docusaurus is by far the best in regards to SEO features and extensibility.
A great work of engineering.


Author pages

Good point.
They should follow the same paradigm we choose for tag pages.

E-E-A-T (Experience, Expertise, Authoritativeness, Trust) signals matter for user trust for SEO.
So I think, from only SEO point of view, docs pages should allow specifying authors.
I guess this is tracked in #6218.


New idea for evaluation: Use ItemList structured data:

To help search engines understand that category pages with DocCardList are summaries of other pages, I've implemented ItemList structured data.
Google refers to these as Carousel pages: Google Documentation

See source code
   import {
useCurrentSidebarCategory,
filterDocCardListItems,
findFirstSidebarItemLink,
} from '@docusaurus/plugin-content-docs/client';
import { PageType } from '@site/src/components/PageType';
import { toAbsoluteUrl } from '@site/src/components/Utilities/ToAbsoluteUrl';

// Read by:
//  - Google  : https://developers.google.com/search/docs/appearance/structured-data/carousel
//    However, according to URL Inspector Tool (tested Jan 12, 2025) the data was not read.
//    According to the docs, it only results in rich results for specific type of pages, but
//    we still use it to "hint" Google how the page works.
export function getChildrenItemListStructuredData(pageType: PageType): ItemList | null {
if (pageType !== 'collection' && pageType !== 'category') {
 return null;
}
return {
 '@context': 'https://schema.org',
 '@type': 'ItemList',
 itemListElement: collectAllChildrenUrls().map((href, idx): ListItem => ({
   '@type': 'ListItem',
   'position': idx + 1,
   url: toAbsoluteUrl(href),
 })),
}
}


function collectAllChildrenUrls(): string[] {
const category = useCurrentSidebarCategory();
const filteredItems = filterDocCardListItems(category.items);
const urls = filteredItems.map(findFirstSidebarItemLink).filter((href) => href !== undefined);
return urls;
}

interface ListItem {
readonly '@type': 'ListItem';
readonly 'position': number;
readonly url: string;
}

interface ItemList {
// Format:
//  - https://schema.org/ItemList
//  - https://developers.google.com/search/docs/appearance/structured-data/carousel#summary
readonly '@context': 'https://schema.org';
readonly '@type': 'ItemList';
readonly itemListElement: readonly ListItem[];
}

Adding this structured data to tag/author pages could be a less disruptive approach, considering your point about changing SEO profiles of existing sites.


sitemap plugin does not need to be modified

Confirmed this works, thank you for the suggestion.
I've removed my custom code handling this.


About nofollow, no index on tag anchor:

After reading more on this, I think we may still want to add them to not consume the crawl budget. If a page has noindex, links to it should have nofollow so Google does not need to open the page, see that it has noindex and leave to save crawl budget. Except consuming crawling resources, there seems to be no other downsides of not having these attributes:

  • We do not need to that because Google has confirmed that noindex pages don't pass link signals anyway
  • Adding nofollow wouldn't change anything in terms of link equity
  • The internal navigation links are legitimate site structure

Afaik we are using canonical URLs and structured data

This can be double-edged.
Checking Duplicate, Google chose different canonical than user in Search Console shows that Google has never chosen a tag list page over the actual document.
However, this may signal to Google that tag pages own their content rather than being summary pages.


Can you provide it? Show your SEO score before/after the change.

I was wrong to sound certain here.
It was a speculation that this was one of the reasons for my unsuccessful SEO.
Unfortunately, I cannot prove it because Google's ranking/indexing take long time to change, and during this time I've been doing a lot of changes regarding its structured data and content (also deindexed like 50% of the website) so I wouldn't know what change leads to what.

Because Google’s ranking and indexing system is a complex, ever-changing “black box” (due to non-linear AI analysis), any data we show would have limited long-term value. That's why I quoted Google guidelines, because this is the only source of truth we have rather than speculations. And we know for sure that tag pages have zero "meaningfully unique content" as you put it.

Regarding tag page indexing: I observed tag pages being indexed and ranked higher than main content pages,
while the main content was either not indexed or ranked lower:

Image


using frontMatter.description and doc tags

You're correct that I only use docs due to the technical nature of the content.
And thank you a lot for sharing that page.
It looks terrible.
I'll add a procesasing step to humanize and plainify tables, or just remove them.

While I could add manual descriptions, my website in question pulls docs from an external source (privacy.sexy) that updates frequently, making separate description maintenance challenging.

P.S.: The privacy.sexy community recommended Docusaurus, and I'm increasingly appreciating its clean architecture and documentation.


My workaround to deindex tag pages:

It may be helpful for others, this is how I noindex tag pages to resolve the issue in the thread:

See my workaround
  1. Create src/NoIndexMetaData/index.tsx:
import Head from '@docusaurus/Head';
import React, {type ReactNode} from 'react';

// To solve SEO issues, see: https://github.com/facebook/docusaurus/issues/10835
export function NoIndexMetadata(): ReactNode {
  return (
    <Head>
      <meta name="robots" content="noindex" />
      {/*
        No need for nofollow:
          - Google has confirmed that noindex pages don't pass link signals anyway
          - Adding nofollow wouldn't change anything in terms of link equity
          - The internal navigation links are legitimate site structure
      */}
    </Head>
  );
}
  1. Swizzle & wrap DocTagsListPage: npm run swizzle @docusaurus/theme-classic DocTagDocListPage -- --wrap then add the NoIndexMetadata():
// Wrapped to solve SEO issues, see: https://github.com/facebook/docusaurus/issues/10835
// Root component of the "docs containing tag X" page.

import React, {type ReactNode} from 'react';
import DocTagDocListPage from '@theme-original/DocTagDocListPage';
import type DocTagDocListPageType from '@theme/DocTagDocListPage';
import type {WrapperProps} from '@docusaurus/types';
import { NoIndexMetadata } from '@site/src/components/NoIndexMetadata';

type Props = WrapperProps<typeof DocTagDocListPageType>;

export default function DocTagDocListPageWrapper(props: Props): ReactNode {
  return (
    <>
      <NoIndexMetadata  />
      <DocTagDocListPage {...props} />
    </>
  );
}
  1. Swizzle & wrap DocTagsListPage: npm run swizzle @docusaurus/theme-classic DocTagsListPage -- --wrap then add the NoIndexMetadata():
// Wrapped to solve SEO issues, see: https://github.com/facebook/docusaurus/issues/10835
// Root component of the tags list page

import React, {type ReactNode} from 'react';
import DocTagsListPage from '@theme-original/DocTagsListPage';
import type DocTagsListPageType from '@theme/DocTagsListPage';
import type {WrapperProps} from '@docusaurus/types';
import { NoIndexMetadata } from '@site/src/components/NoIndexMetadata';

type Props = WrapperProps<typeof DocTagsListPageType>;

export default function DocTagsListPageWrapper(props: Props): ReactNode {
  return (
    <>
      <NoIndexMetadata  />
      <DocTagsListPage {...props} />
    </>
  );
}

@slorber
Copy link
Collaborator

slorber commented Jan 16, 2025

Thanks for the feedback

I think we are only using structured data for blog paginated pages, but we don't for blog authors/tags and we don't either for docs tags (which I haven't seen used that often in practice) or category index pages. That may explain why some of your docs tags pages are ranking higher than the actual docs pages.

It's reasonable to add more structured data that we don't have today, and see what happens on our own website SEO + those willing to adopt this in canary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature This is not a bug or issue with Docusausus, per se. It is a feature request for the future. status: needs triage This issue has not been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants