Friday, September 2, 2016

Getting Data out of Open Context & Doing Useful Things With It: Coda

Getting Data out of Open Context & Doing Useful Things With It: Coda
Previously, on tips to get stuff out of Open Context…

In part 1, I showed you how to generate a list of URLs that you could then feed into `wget` to download information.

In part 2, I showed you how to use `jq` and `jqplay` – via the amazing Matthew Lincoln, from whom I’ve learned whatever small things I know about the subject – to examine the data and to filter it for exactly what you want.

Today – combining wget & jq

Today, we use wget to pipe the material through jq to get the csv of your dreams. Assuming you’ve got a list of urls (generated with our script from part 1), you point your firehose of downloaded data directly into jq. The crucial thing is to flag wget with `-qO-` to tell it that the output will be *piped* to another program. In which case, you would type at the terminal prompt or command line:
wget -qO- -i urls2.txt | jq -r '.features [ ] | .properties | [.label, .href, ."context label", ."early bce/ce", ."late bce/ce", ."item category", .snippet] | @csv' > out.csv
Which in Human says, ” hey wget, grab all of the data at the urls in the list at urls2.txt and pipe that information into jq. JQ, you’re going to filter for raw output the information within properities (which is within features), in particular these fields. Split the information fields up via commas, and write everything to a new file called out.csv.”

…Extremely cool, eh? (Word to the wise: read Ian’s tutorial on wget to learn how to form your wget requests politely so that you don’t overwhelm the servers. Wait a moment between requests – look at how the wget was formed in the open context part 1 post).

No comments:

Post a Comment