Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving scraped items in a feed #147

Open
runa opened this issue Jun 2, 2023 · 1 comment
Open

Saving scraped items in a feed #147

runa opened this issue Jun 2, 2023 · 1 comment
Labels
more info needed original poster should provide more details to allow us to identify the problem

Comments

@runa
Copy link

runa commented Jun 2, 2023

Hi! thanks for your work on Scrapyrt!

I've discovered that spiders served by Scrapyrt don't save the output in the Spider's / custom_settings / FEEDS. Is it possible to change this behavior and make the spider served by Scrapyrt respect this setting?

Thanks!

@pawelmhm
Copy link
Member

pawelmhm commented Feb 23, 2024

@runa can you add some sample code to reproduce this and add more details? I tested with this simple spider

import scrapy


class ToScrapeCSSSpider(scrapy.Spider):
    name = "toscrape-css"
    start_urls = [
        'http://quotes.toscrape.com/',
    ]
    custom_settings = {
        'FEEDS': {
            'items.json': {
                'format': 'json'
            }
        }
    }

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").extract_first(),
                'author': quote.css("small.author::text").extract_first(),
                'tags': quote.css("div.tags > a.tag::text").extract()
            }

        next_page_url = response.css("li.next > a::attr(href)").extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

and when scheduled with ScrapyRT

curl --location 'http://localhost:9080/crawl.json' \
--header 'Content-Type: application/json' \
--data '{
    "request": {
        "url": "https://quotes.toscrape.com/"
    },
    "spider_name": "toscrape-css"
}'

I see there is items.json file generated in filesystem of spider project. Is there some specific feed that is failing for you?

@pawelmhm pawelmhm added the more info needed original poster should provide more details to allow us to identify the problem label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more info needed original poster should provide more details to allow us to identify the problem
Projects
None yet
Development

No branches or pull requests

2 participants