sitemaps.py does not work with some .gz sitemaps #271
Replies: 2 comments 1 reply
-
Thanks for letting me know @internetyev The In this case the error seems to be in the sitemap itself. The full traceback for this particular sitemap shows that the error was raised by the
Double-checking here, I also got that the sitemap is invalid: So, this is more of a business decision than a coding one. But sometimes, like in this case, you are more interested in analyzing the content of a competitor, and don't care much whether or not it is valid. This is the price to pay for this behavior. More than happy to discuss further if you have other suggestions to make this user-friendly, useful, but make sure it is correct. |
Beta Was this translation helpful? Give feedback.
-
@eliasdabbas thank you for quick reply! I agree that this is more of a business decision - whether to parse sitemaps with errors, or not. I've noticed that you already give warnings if a sitemap does not follow the guidelines. In this particular case, I would consider allowing the erroneous sitemaps if the sitemaps are parseable (eg. wrong http header, wrongly specified encoding or file format, etc.) and the URLs can be distilled with confidence. Essentially, the same way as Screaming Frog does. |
Beta Was this translation helpful? Give feedback.
-
Hi @eliasdabbas!
I got stuck with an error, trying to parse a booking.com sitemap.
The error message was given by sitemaps.py: "not well-formed (invalid token)" when trying to process a sitemap eg.
https://www.booking.com/sitembk-beaches-pt-br.0000.xml.gz
I was able to go around this error by modifying the original code of sitemaps.py:
original code - lines 483-493:
modified code:
To make it work, I had to
import gzip
in the beginning, and introduce a new variablefile_content
as I didn't want to risk, if there are some other dependencies.Also, I didn't test this code with other non-gz sitemaps yet.
Definitely, the code is not optimal, but at least it got me what I wanted - to parse the booking.com .gz sitemaps with no issues.
I still don't know the exact reason why the .gz sitemap wasn't parsed properly, and would appreciate some advice.
Beta Was this translation helpful? Give feedback.
All reactions