Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for new Notion URL format #69

Open
bhchiang opened this issue Jul 21, 2021 · 20 comments
Open

Support for new Notion URL format #69

bhchiang opened this issue Jul 21, 2021 · 20 comments
Labels
enhancement New feature or request

Comments

@bhchiang
Copy link
Contributor

Not 100% sure, but I believe the URL format for Notion shared pages recently changed.

It's now notion.site instead of notion.so:

Editing view: https://www.notion.so/bryanchiang/Bryan-Chiang-fc01c67a1ed9402e83eb8efd5c99a216
Shared view: https://bryanchiang.notion.site/Bryan-Chiang-fc01c67a1ed9402e83eb8efd5c99a216

I get a parser error with the second one.

Ito-MacBook:loconotion bryanhpchiang$ python3 loconotion https://www.notion.so/bryanchiang/fc01c67a1ed9402e83eb8efd5c99a216
[23:09:54] INFO Initialising parser with simple page url
[23:09:54] INFO Setting output path to 'dist/bryanchiang/fc01c67a1ed9402e83eb8efd5c99a216'
[23:09:54] INFO Initialising chromedriver at /usr/local/lib/python3.9/site-packages/chromedriver_autoinstaller/91/chromedriver
[23:09:56] INFO Parsing page 'https://www.notion.so/bryanchiang/fc01c67a1ed9402e83eb8efd5c99a216'
[23:10:57] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Cellar/[email protected]/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/bryanhpchiang/Documents/Workspace/loconotion/loconotion/__main__.py", line 144, in <module>
    main()
  File "/Users/bryanhpchiang/Documents/Workspace/loconotion/loconotion/__main__.py", line 123, in main
    Parser(config=config, args=vars(args))
  File "/Users/bryanhpchiang/Documents/Workspace/loconotion/loconotion/notionparser.py", line 85, in __init__
    self.run(url)
  File "/Users/bryanhpchiang/Documents/Workspace/loconotion/loconotion/notionparser.py", line 667, in run
    f"Finished!\n\nProcessed {len(tot_processed_pages)} pages in {formatted_time}"
TypeError: object of type 'NoneType' has no len()

Will trying modifying the check for a valid notion.so website.

@tomreitz
Copy link

@bryanhpchiang it looks like the page you linked isn't publicly shared... unless you recently un-shared it, that would explain why loconotion cannot load the content.

There's definitely an issue here though, since shared pages have the new Notion URL format of https://example.notion.site/Page-1F29BC48EA1A029FC481B but sub-pages still have the old format https://notion.so/example/Page-1F29BC48EA1A029FC481B which is a redirect page.

For me, loconotion correctly loads the primary public Notion page URL, but times out on any subpages. I think the logic at lines 582-584 of loconotion/notionparser.py needs to be updated to rewrite page URLs from the old format to the new one before attempting to fetch them.

@leoncvlt
Copy link
Owner

@tomreitz merged a pull request from @bryanhpchiang earlier today which should address this, want to pull it and check it's all good?

@tomreitz
Copy link

@leoncvlt thanks for the quick response (and an awesome project!). Subpages still not working for me, see this public page which converts fine, but the subpages in the table time out, per the logs below

[21:47:02] INFO Initialising parser with configuration file
[21:47:02] INFO Setting output path to 'dist/wiwebsites.com'
[21:47:02] INFO Initialising chromedriver at /usr/bin/chromedriver
[21:47:03] INFO Parsing page 'https://tomreitz.notion.site/Wisconsin-Websites-ecdb3dc4cd1e40f280b7512a23ca2006'
[21:47:17] INFO Downloading 'https://www.notion.so/print.b31f28aa.css'
[21:47:17] INFO Downloading 'https://www.notion.so/app-7d82edb35207a8a8b776.css'
[21:47:18] INFO Downloading 'https://www.notion.so/lyon-text-regular-3be84b20b1d9ff1e3456b0a220ae449b.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/lyon-text-regular-italic-437d32a42fc5b8268bb4a1e0cc8b363f.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/lyon-text-semibold-acb7f110189034ff6a1afa4b730be0ed.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/lyon-text-semibold-italic-1f81a2f93060f05edd7f078ac91f25e6.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/iawriter-mono-regular-4b73d071988a4f1cd2283524716ad970.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/iawriter-mono-italic-d5d3224c1377168e261efc6aa0ce89c6.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/iawriter-mono-bold-eb96a5e539892d26cf8b0cb2367e3580.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/iawriter-mono-bold-italic-743b231fa82483406c79a00fa1f12fe8.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/inter-ui-regular-3ae6a7d3890c33d857fc00bd2e4c4820.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/inter-ui-medium-95b8a98959d1af9ab432d7ffe295ef94.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/inter-ui-semibold-19b57197b819695d334b9961ee41910e.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/inter-ui-bold-001893789f7f342b520f29ac8af7d6ca.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/permanent-marker-a6d62939e7c920a184ddddcf4149e62c.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/katex/katex.88defe76.min.css'
[21:47:18] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_AMS-Regular.342a61e0.ttf'
[21:47:18] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Caligraphic-Bold.b27e354b.ttf'
[21:47:18] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Caligraphic-Regular.bd18bae2.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Fraktur-Bold.359e1e97.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Fraktur-Regular.6b53a2db.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Main-Bold.ed829b5f.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Main-BoldItalic.ca23ba4b.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Main-Italic.14ff9c98.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Main-Regular.c89c6436.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Math-BoldItalic.7b481bb8.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Math-Italic.f677173e.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_SansSerif-Bold.362d94c6.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_SansSerif-Italic.2c742978.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_SansSerif-Regular.6087fc04.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Script-Regular.781730b2.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Size1-Regular.54a80b37.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Size2-Regular.24cbe093.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Size3-Regular.ee3e5bf4.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Size4-Regular.b78c75bb.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Typewriter-Regular.90f78c10.ttf'
[21:47:19] INFO Exporting page 'https://tomreitz.notion.site/Wisconsin-Websites-ecdb3dc4cd1e40f280b7512a23ca2006' as 'index.html'
[21:47:19] INFO Parsing page 'https://www.notion.so/7514e88c4042418997665b5ecf11733b?v=703812ea01fe4ee6bc010fd72be278f8'
[21:48:20] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
[21:48:20] INFO Parsing page 'https://www.notion.so/80f1c747841641e2a729fb0286390da2'
[21:49:21] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
[21:49:21] INFO Parsing page 'https://www.notion.so/e861fdd6a0c247ca8bad342d2cdb05b6'
[21:50:22] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
[21:50:22] INFO Parsing page 'https://www.notion.so/d5c95ef2e77349e98691b8925de7d119'
[21:51:23] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
[21:51:23] INFO Finished!

Processed 1 pages in 00:04:19

If you go to a subpage directly, you'll see that it is public, but is a redirect page from Notion.

@bhchiang
Copy link
Contributor Author

Thanks for pointing that out - my PR doesn't handle subpages. When parsing the subpages (sub_page_href), the www.notion.so part should be replaced with {site_name}.notion.site.

I tried a quick fix but there are a few edge cases in the code that I am probably missing, so not submitting a PR yet.

@leoncvlt leoncvlt added the enhancement New feature or request label Aug 4, 2021
@EveraertJan
Copy link

Hi,

I,m currently running into the same issues. Is there any fix available?

@specbug
Copy link

specbug commented Aug 16, 2021

Thanks for pointing that out - my PR doesn't handle subpages. When parsing the subpages (sub_page_href), the www.notion.so part should be replaced with {site_name}.notion.site.

I tried a quick fix but there are a few edge cases in the code that I am probably missing, so not submitting a PR yet.

@bryanhpchiang can you post the partial fix here? Others can try it out and help in fixing the edge cases.

@joshkmartinez
Copy link

I'm having this issue as well. Would appreciate the partial fix if possible @bryanhpchiang

@PiktCai
Copy link

PiktCai commented Aug 19, 2021

I am not a developer, but there is a quick way to make it work properly. Actually, by simply editing the links at lines 582-584 of loconotion/notionparser.py , it works.
before editing:

            if sub_page_href.startswith("/"):
                sub_page_href = "https://www.notion.so" + a["href"]
            if sub_page_href.startswith("https://www.notion.so/"):
                if parse_links or not len(a.find_parents("div", class_="notion-scroller")):

after editing:

            if sub_page_href.startswith("/"):
                sub_page_href = "https://xxxx.notion.site" + a["href"]
            if sub_page_href.startswith("https://xxxx.notion.site/"):
                if parse_links or not len(a.find_parents("div", class_="notion-scroller")):

when running the program, I used python loconotion https://xxxx.notion.site/xxxx/{page-id}.
Hope this would help.

@sunz1e
Copy link
Contributor

sunz1e commented Aug 23, 2021

Created a PR to Use custom new Notion url format https://xxxx.notion.site instead of default one
Saw an issue where subfolder is expected in case of link of format https://xxxx.notion.site/xxxx (faced during parsing my website). Fixed that as well.

@sunz1e
Copy link
Contributor

sunz1e commented Aug 25, 2021

@bryanhpchiang could you please pull the PR and verify if its working for you as well?

@bhchiang
Copy link
Contributor Author

bhchiang commented Aug 25, 2021

@meSunnySrivastava

Thanks for putting together this PR. Confirming that it did work for my website to parse subpages.

The only issue is that bullet points are now missing.

image

EDIT:
I see that this was supposed to be fixed by #73, and that your PR merged those changes as well.

EDIT:
Deleting my dist/ + regenerating fixed the issue. The PR looks good to me, thanks!

@sunz1e
Copy link
Contributor

sunz1e commented Aug 26, 2021

Sorry I had to close the old PR because I pushed to my master directly. :)

@leoncvlt
Copy link
Owner

leoncvlt commented Sep 7, 2021

PR has been merged, thanks all!

@leoncvlt leoncvlt closed this as completed Sep 7, 2021
@jamesdeluk
Copy link

I'm still getting the timeout issue. Exact same as the original post above.

The page is set to public:

image

The link is https://jamesdeluk.notion.site/James-IT-Notes-9969909992c04b5ba3a734cdf0a74530

(The Copy Link button gives https://www.notion.so/jamesdeluk/James-IT-Notes-9969909992c04b5ba3a734cdf0a74530, which forwards to the above).

@leoncvlt leoncvlt reopened this Sep 14, 2021
@jamesdeluk
Copy link

Thought I'd try this again with the new Notion update. A couple things:

Trying to access the .site page itself fails:

image

And webdrive.log loops this:

[1632288655.436][INFO]: Waiting for pending navigations...
[1632288655.437][INFO]: Done waiting for pending navigations. Status: ok
[1632288655.445][INFO]: Waiting for pending navigations...
[1632288655.447][INFO]: Done waiting for pending navigations. Status: ok
[1632288655.447][INFO]: [edc259a3fc220da0c2d6ba0789803d04] RESPONSE FindElements [  ]
[1632288655.957][INFO]: [edc259a3fc220da0c2d6ba0789803d04] COMMAND FindElements {
   "using": "css selector",
   "value": ".notion-presence-container"
}

@leoncvlt
Copy link
Owner

Well, that's not gonna work regardless because you're not logged in, so the script is unable to find the notion-presence-container div which is present on every notion page - it's gonna work with public pages only.

@jamesdeluk
Copy link

That's my confusion though. I am logged in, and the page is public.

@leoncvlt
Copy link
Owner

leoncvlt commented Mar 6, 2022

Just checking @leshchenko1979, is this fixed by #92?

@2m
Copy link
Contributor

2m commented Mar 23, 2022

I am using current master version of loconotion with the new style URLs and it seems to work fine: https://github.com/2m/nemunasring/blob/main/nemunasring.toml#L2

@sueszli
Copy link

sueszli commented Mar 27, 2023

Since Notion updated all URLs for hosted pages (see: #134) this ticket is no longer an enhancement, but a permanent bug.

We resolved it in our fork here: https://github.com/sueszli/notionSnapshot/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests