Scrape a Website on a Schedule with Script Kit

InstructorJohn Lindquist

Share this video with your friends

Send Tweet

When you want to collect news sources, airline ticket prices, or any events from sites that don't offer APIs, you can use scrapers to grab elements from off the page.

Script Kit includes a scrapeSelector() helper that takes the URL you want to scrape and the selector you want from the page. Using the // Schedule metadata, you can also have this script run in the background on a Chron schedule and collect the data for you.

Install scrape-tech-news

// Name: Scrape Tech News
// Schedule: 0 11 * * *

import "@johnlindquist/kit"

let h3s = await scrapeSelector(
  "https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB?hl=en-US&gl=US&ceid=US%3Aen",
  "h3"
)

let filePath = home("tech.md")
await ensureFile(filePath)
let contents =
  `

## ${new Date()}

` + h3s.map(h3 => `### ${h3}`).join("\n")
await appendFile(filePath, contents)
Christine Wilks
~ 3 years ago

Strange, I couldn't get this to work with google tech news: "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSFFpZ0FQAQ?hl=en-GB&gl=GB&ceid=GB%3Aen" But it worked fine with "reddit.com"

John Lindquistinstructor
~ 3 years ago

I imagine it's timing out. You can try increasing the timeout to 10 seconds (it defaults to 5):

let h3s = await scrapeSelector(
"https://news.google.com/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB?hl=en-US&gl=US&ceid=US%3Aen",
  "h3",
  el => el.innerText,
  {
    timeout: 10000,
  }
)