[Bug] important: non-English domains are not recognized (including punycode format) #702

Hitreno · 2024-09-24T23:23:40Z

Describe the Bug
As it seems to me firecrawl server recognizes non-English links as invalid and doesn't even try to load data from them. that's why I can't get data from links with cyrillic, with punycode formats.

To Reproduce
Steps to reproduce the issue:

Go to https://firecrawl.dev
Enter in cyrillic https://дом.рф
Get answer "Please enter a valid URL"
Enter in punycode https://xn--d1aqf.xn--p1ai
Get answer "Please enter a valid URL"
Also repeat with api and get the same errors

Expected Behavior
all two links are recognized as valid and processed

Environment (please complete the following information):

https://firecrawl.dev

Logs
Example from Python SDK

File "C:\Users\user\Desktop\migr_web2table.venv\Lib\site-packages\firecrawl\firecrawl.py", line 88, in scrape_url
self._handle_error(response, 'scrape URL')
File "C:\Users\user\Desktop\migr_web2table.venv\Lib\site-packages\firecrawl\firecrawl.py", line 391, in _handle_error
raise requests.exceptions.HTTPError(message, response=response)
requests.exceptions.HTTPError: Unexpected error during scrape URL: Status code 400. Bad Request - [{'code': 'custom', 'message': 'URL must have a valid top-level domain or be a valid path', 'path': ['url']}]

Additional Context
We parse a lot of sites with different domain names and it is super critical for us to recognize all kinds of domains

Hitreno added the bug Something isn't working label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] important: non-English domains are not recognized (including punycode format) #702

[Bug] important: non-English domains are not recognized (including punycode format) #702

Hitreno commented Sep 24, 2024 •

edited

Loading

[Bug] important: non-English domains are not recognized (including punycode format) #702

[Bug] important: non-English domains are not recognized (including punycode format) #702

Comments

Hitreno commented Sep 24, 2024 • edited Loading

Hitreno commented Sep 24, 2024 •

edited

Loading