Web Scraping Images with Node, Xray, and Download

InstructorJohn Lindquist

Share this video with your friends

Send Tweet

Node makes scraping images off the web extremely easy using a couple handy packages: Xray and Download. Simple scrape the img tag, grab all the src attributes, filter out images you don't want, then hand them over to Download to grab them.

Nabil Makhout
~ 8 years ago

Let's say I put my application on an application server, how will things download then? Won't it download the images on the server? If so how would I be able to do it on the clients pc?

Paul
~ 8 years ago

Hi. You can't create(and download) any files on a client machine because of the security issues - https://en.wikipedia.org/wiki/JavaScript#Security

Sequoia McDowell
~ 8 years ago

FYI if you're scraping large files like mp3s rather than small images you might not want to start downloads in a simple forEach. I don't know exactly what happens if you attempt to download 250 large files at once, but it probably isn't good! :) Another reason to avoid this would be to not accidentally DOS the site if it's a small mom & pop server rather than google.

A function like async's parallelLimit will allow you to say "download in parallel, but only 5 at a time" which may work better for you and the site operator.

John Lorrey
~ 8 years ago

hmm this download npm module doesn't seem to want to work for that for loop. I removed the for loop and just used url-download module passing in the whole arrary to download. var download = require("url-download"); download(results, './images').on('close', function (err, url) { console.log(url + ' has been downloaded.'); How this helps someone reading this.

izdb
~ 8 years ago

This lesson doesn't appear to work at all anymore, copied the code to node v5.3.0. Fails with no errors.

{ "name": "xray-tuts", "version": "1.0.0", "description": "", "main": "app.js", "scripts": { "test": "echo "Error: no test specified" && exit 1" }, "author": "", "license": "ISC", "devDependencies": { "download": "^5.0.2", "x-ray": "^2.3.1" } }

Vinny
~ 8 years ago

Well, apparently the Download package has been updated. I checked their docs and fixed the code accordingly:

// ... ^^ imports and xray config

(function(err, result) {
  var images = result.filter(function(img) { 
    return img.width > 100; 
  })
  .map(function(img) {
    // Here is the new download code.
    // Download takes asset url and download destination.
    // I used map() here, but forEach would provide the 
    // same output
    Download(img.src, './images');
   });

   // Write the original return result to JSON file
   s.writeFile('./results.json', JSON.stringify(result, null, '\t'));
});
Alonso Lamas
~ 7 years ago

Thanks Vinny!