NodeJS and Cheerio web scraping
I made an application where I scrape a page, on that page I have a script like this
<script> var myData = { Time: '10:46:29 am', car1: 'Volvo', car2: 'Ferarri', car3: 'VW' }; <script>
With cheerio
and request
node module I get the script but I need to get the value of the car1
, car2
and car3
.
request('http://my-url.com', function(error, response, body) { var $ = cheerio.load(body); var htmlData = $('body script').last().prev().html(); console.log(data); });
I’ve tried to use JSON.parse(htmlData)
but I get the following errors SyntaxError: Unexpected token T
.
Is there any way to parse the javascript from the script, or can someone explain me how to get the values for car1
and car2
via regex?
Answer
I would recommend doing a series of string replacements and then do JSON.load
, to get the JavaScript object, like this
var data = "{ Time: '10:46:29 am', car1: 'Volvo', car2: 'Ferarri', car3: 'VW' };"; var obj = JSON.parse(data .replace(/((?:[A-Za-z_][wd])+):/g, '"$1":') .replace(/'/g, '"') .replace(/;s*$/, '')); console.log(obj.car1, obj.car2, obj.car3); // Volvo Ferarri VW
Here,
.replace(/((?:[A-Za-z_][wd])+):/g, '"$1":')
will replace all the strings matched, of the form (?:[A-Za-z_][wd])+
with the same matched string surrounded by "
and followed by :
, with "$1":
.
And then
.replace(/'/g, '"')
will replace all '
with "
(assuming your data will not have '
in them).
And then
.replace(/;s*$/, '')
will replace the ;
followed by whitespace characters at the end, with empty string (basically we remove them).
At this point, the string will look like this
{ "Time": "Friday", "car1": "Volvo", "car2": "Ferarri", "car3": "VW" }
and now we simply parse it as JSON string, with JSON.parse
to get the JavaScript object.