Web Crawler
In this section, we present how to use a web crawler within MindsDB.
A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. Within the realm of MindsDB, a web crawler can be employed to harvest data, which can be used to train models, domain specific chatbots or fine-tune LLMs.
Prerequisites
Before proceeding, ensure the following prerequisites are met:
- Install MindsDB locally via Docker or use MindsDB Cloud.
- To connect Web Crawler to MindsDB, install the required dependencies following this instruction.
- Install or ensure access to Web Crawler.
Connection
This handler does not require any connection parameters.
Here is how to initialize a web crawler:
CREATE DATABASE my_web
WITH ENGINE = 'web';
Usage
Get Websites Content
Here is how to get the content of docs.mindsdb.com
:
SELECT *
FROM my_web.crawler
WHERE url = 'docs.mindsdb.com'
LIMIT 1;
You can also get the content of internal pages. Here is how to fetch the content from 10 internal pages:
SELECT *
FROM my_web.crawler
WHERE url = 'docs.mindsdb.com'
LIMIT 10;
Another option is to get the content from multiple websites.
SELECT *
FROM my_web.crawler
WHERE url IN ('docs.mindsdb.com', 'docs.python.org')
LIMIT 1;
Get PDF Content
MindsDB accepts file uploads of csv
, xlsx
, xls
, sheet
, json
, and parquet
. However, you can utilize the web crawler to fetch data from pdf
files.
SELECT *
FROM my_web.crawler
WHERE url = '<link-to-pdf-file>'
LIMIT 1;
For example, you can provide a link to a pdf
file stored in Amazon S3.