This library combines HTTP, Gumbo and Cascadia for a more simple way to scrape data.
Based on tidyverse/rvest.
The package and the maintenance will be moved to TidierOrg/TidierVest.jl
TidierOrg provides a tidyverse for Julia
using Harbest
starwars = read_html("https://rvest.tidyverse.org/articles/starwars.html")
titles = html_elements(starwars, ["section", "h2"]) |> html_text3
titles
# 7-element Vector{String}:
# "The Phantom Menace"
# "Attack of the Clones"
# "Revenge of the Sith"
# ⋮
# "Return of the Jedi"
# "The Force Awakens"
html = read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565")
table = html_elements(html, ".tracklist") |> html_table
table
# 28×4 DataFrame
# Row │ No. Title Performer(s) Length
# │ String String String String
# ─────┼──────────────────────────────────────────────────────────────────────────────────────
# 1 │ 1. "Everything Is Awesome" Tegan and Sara featuring The Lon… 2:43
# 2 │ 2. "Prologue" 2:28
# 3 │ 3. "Emmett's Morning" 2:00
# 4 │ 4. "Emmett Falls in Love" 1:11
# 5 │ 5. "Escape" 3:26
# ⋮ │ ⋮ ⋮ ⋮ ⋮
# 25 │ 25. "Everything Is Awesome" Jo Li (Joshua Bartholomew and Li… 1:26
# 26 │ 26. "Everything Is Awesome (unplugge… Shawn Patterson and Sammy Allen 1:24
# 27 │ 27. "Untitled Self Portrait" Will Arnett 1:08
# 28 │ 28. "Everything Is Awesome (instrume… 2:41
# 19 rows omitted
Read an url
Get the elements you want from an html
Get the text, you can also use html_text2
or html_text3
for cleaner text
Get the content of an attribute, if string not provided it would try to get you an attribute
Create a DataFrame from an HTML Table node
- I'm actively accepting suggestions