---
title: Web scraping: Extract structured data from websites
authors:
- Markus Mandalka
---
This connector integrates the Open-Source Scraping-Framework Scrapy, a Python framework for ETL (Extract, transform and load) to build a customized crawler, parser, data scraper and converter for extracting structured data from websites.
This is for websites which don't yet offer their data in structured open standard formats (like RDF, XML, JSON or CSV for which there are comfortable and easy to use connectors yet), so we have to extract all data from HTML pages with a more complicated individual scraper.
If you don't want to use standard fields like title, author and content you don't have to change the config file schema.xml (which defines the fields of the Solr index).
Our Solr server is preconfigurated with dynamic fields so that you can fill standard fields like title or content or additional dynamic fields like yourfield_b for one boolean, yourfield_bs for booleans, yourfield_s for a string and yourfield_ss for some strings or yourfield_t for a text or yourfield_tt for some texts to be filled with data.
Have a look at schema.xml
or managed-schema to see all possible data types of dynamic fields, like boolean, string, dates, text and so on.
If you did not use preconfigurated fields like tags (fieldname is tag_ss) and want to use them not only to find data but as interactive filters (facets) for the navigation:
To enable your own additional fields as facets (interactive filters) in the user interface just map the technical Solr fieldnames to user friendly labels in the config of the user interface with the option $cfg[facet].