Skip to content

Latest commit

 

History

History

scraper

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
Error in user YAML: (<unknown>): mapping values are not allowed in this context at line 1 column 20
---
title: Web scraping: Extract structured data from websites  
authors:  
    - Markus Mandalka
---

Web scraping: Extract structured data from websites

Import structured data scraped from websites to the search server

This connector integrates the Open-Source Scraping-Framework Scrapy, a Python framework for ETL (Extract, transform and load) to build a customized crawler, parser, data scraper and converter for extracting structured data from websites.

This is for websites which don't yet offer their data in structured open standard formats (like RDF, XML, JSON or CSV for which there are comfortable and easy to use connectors yet), so we have to extract all data from HTML pages with a more complicated individual scraper.

Where to write the data: Solr dynamic fields

If you don't want to use standard fields like title, author and content you don't have to change the config file schema.xml (which defines the fields of the Solr index).

Our Solr server is preconfigurated with dynamic fields so that you can fill standard fields like title or content or additional dynamic fields like yourfield_b for one boolean, yourfield_bs for booleans, yourfield_s for a string and yourfield_ss for some strings or yourfield_t for a text or yourfield_tt for some texts to be filled with data.

Have a look at schema.xml or managed-schema to see all possible data types of dynamic fields, like boolean, string, dates, text and so on.

Enable new fields or facets in the user interface

If you did not use preconfigurated fields like tags (fieldname is tag_ss) and want to use them not only to find data but as interactive filters (facets) for the navigation:

To enable your own additional fields as facets (interactive filters) in the user interface just map the technical Solr fieldnames to user friendly labels in the config of the user interface with the option $cfg[facet].