DocsScraper.jl

Efficient RAG knowledge pack creator from online Julia documentation
Author JuliaGenAI
Popularity
6 Stars
Updated Last
4 Months Ago
Started In
May 2024

DocsScraper: "Efficient RAG knowledge pack creator from online Julia documentation"

Dev Build Status Coverage Aqua

DocsScraper is a package designed to create "knowledge packs" from online documentation sites for the Julia language.

It scrapes and parses the URLs and with the help of PromptingTools.jl, creates an index of chunks and their embeddings that can be used in RAG applications. It integrates with AIHelpMe.jl and PromptingTools.jl to offer highly efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.

Features

  • URL Scraping and Parsing: Automatically scrapes and parses input URLs to extract relevant information, paying particular attention to code snippets and code blocks. Gives an option to customize the chunk sizes
  • URL Crawling: Optionally crawls the input URLs to look for multiple pages in the same domain.
  • Knowledge Index Creation: Leverages PromptingTools.jl to create embeddings with customizable embedding model, size and type (Bool and Float32).

Installation

To install DocsScraper, use the Julia package manager and the package name (it's not registered yet):

using Pkg
Pkg.add(url="https://github.com/JuliaGenAI/DocsScraper.jl")

Prerequisites:

  • Julia (version 1.10 or later).
  • Internet connection for API access.
  • OpenAI API keys with available credits. See How to Obtain API Keys.

Building the Index

using DocsScraper
crawlable_urls = ["https://juliagenai.github.io/DocsScraper.jl/dev"]

index_path = make_knowledge_packs(crawlable_urls;
    index_name = "docsscraper", embedding_dimension = 1024, embedding_bool = true, target_path="knowledge_packs")
[ Info: robots.txt unavailable for https://juliagenai.github.io:/DocsScraper.jl/dev/home/
[ Info: Scraping link: https://juliagenai.github.io:/DocsScraper.jl/dev/home/
[ Info: robots.txt unavailable for https://juliagenai.github.io:/DocsScraper.jl/dev
[ Info: Scraping link: https://juliagenai.github.io:/DocsScraper.jl/dev
. . .
[ Info: Processing https://juliagenai.github.io:/DocsScraper.jl/dev...
[ Info: Parsing URL: https://juliagenai.github.io:/DocsScraper.jl/dev
[ Info: Scraping done: 44 chunks
[ Info: Removed 0 short chunks
[ Info: Removed 1 duplicate chunks
[ Info: Created embeddings for docsscraper. Cost: $0.001
a docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5
[ Info: ARTIFACT: docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.tar.gz
┌ Info: sha256:
└   sha = "977c2b9d9fe30bebea3b6db124b733d29b7762a8f82c9bd642751f37ad27ee2e"
┌ Info: git-tree-sha1:
└   git_tree_sha = "eca409c0a32ed506fbd8125887b96987e9fb91d2"
[ Info: Saving source URLS in Julia\knowledge_packs\docsscraper\docsscraper_URL_mapping.csv      
"Julia\\knowledge_packs\\docsscraper\\Index\\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5"

make_knowledge_packs is the entry point to the package. This function takes in the URLs to parse and returns the index. This index can be passed to AIHelpMe.jl to answer queries on the built knowledge packs.

Default make_knowledge_packs Parameters:

  • Default embedding type is Float32. Change to boolean by the optional parameter: embedding_bool = true.
  • Default embedding size is 3072. Change to custom size by the optional parameter: embedding_dimension = custom_dimension.
  • Default model being used is OpenAI's text-embedding-3-large.
  • Default max chunk size is 384 and min chunk size is 40. Change by the optional parameters: max_chunk_size = custom_max_size and min_chunk_size = custom_min_size.

Note: For everyday use, embedding size = 1024 and embedding type = Bool is sufficient. This is compatible with AIHelpMe's :bronze and :silver pipelines (update_pipeline(:bronze)). For better results use embedding size = 3072 and embedding type = Float32. This requires the use of :gold pipeline (see more ?RAG_CONFIGURATIONS)

Using the Index for Questions

using AIHelpMe
using AIHelpMe: pprint, load_index!

# set it as the "default" index, then it will be automatically used for every question
load_index!(index_path)

aihelp("what is DocsScraper.jl?") |> pprint
[ Info: Updated RAG pipeline to `:bronze` (Configuration key: "textembedding3large-1024-Bool").
[ Info: Loaded index from packs: julia into MAIN_INDEX
[ Info: Loading index from Julia\DocsScraper.jl\docsscraper\Index\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5
[ Info: Loaded index a file Julia\DocsScraper.jl\docsscraper\Index\docsscraper__v20240823__textembedding3large-1024-Bool__v1.0.hdf5 into MAIN_INDEX
[ Info: Done with RAG. Total cost: $0.009
--------------------
AI Message
--------------------
DocsScraper.jl is a Julia package designed to create a vector database from input URLs. It scrapes and parses the URLs and, with the assistance of      
PromptingTools.jl, creates a vector store that can be utilized in RAG (Retrieval-Augmented Generation) applications. DocsScraper.jl integrates with     
AIHelpMe.jl and PromptingTools.jl to provide efficient and relevant query retrieval, ensuring that the responses generated by the system are specific to the content in the created database.

Tip: Use pprint for nicer outputs with sources and last_result for more detailed outputs (with sources).

using AIHelpMe: last_result
# last_result() returns the last result from the RAG pipeline, ie, same as running aihelp(; return_all=true)
print(last_result())

Output

make_knowledge_packs creates the following files:

index_name\
│
├── Index\
│   ├── index_name__artifact__info.txt
│   ├── index_name__vDate__model_embedding_size-embedding_type__v1.0.hdf5
│   └── index_name__vDate__model_embedding_size-embedding_type__v1.0.tar.gz 
│
├── Scraped_files\
│   ├── scraped_hostname-chunks-max-chunk_size-min-min_chunk_size.jls
│   ├── scraped_hostname-sources-max-chunk_size-min-min_chunk_size.jls
│   └── . . .
│
└── index_name_URL_mapping.csv
  • Index: contains the .hdf5 and .tar.gz files along with the artifact__info.txt. Artifact info contains sha256 and git-tree-sha1 hashes. 
  • Scraped_files: contains the scraped chunks and sources. These are separated by the hostnames of the URLs.
  • URL_mapping.csv contains the scraped URLs mapping them with the estimated package name.

Google Summer of Code 2024

This project was developed as part of the Google Summer of Code (GSoC) program. GSoC is a global program that offers student developers stipends to write code for open-source projects. We are grateful for the support and opportunity provided by Google and the open-source community through this initiative.