I’ve been busy lately harvesting LOTS of full text data from @TroveAustralia’s digitised journals – so many opportunities for research! You should be able to get to all the code & data from the new Trove journals section of my GLAM Workbench. glam-workbench.github.io/trove

Ok, so I’ve downloaded the OCRd text from 27,426 issues of 358 digitised journals/series in @TroveAustralia. That’s 6.6gb of full text. Tune in tomorrow for full details…

All 9,738 OCRd text files harvested from books, pamphlets and leaflets in @TroveAustralia’s ‘book’ zone have been uploaded to @aarnet’s CloudStor for easy browsing/download. There’s also a 400mb zip file if you want the whole lot.

The harvesting method and code is available in this not... updates.timsherratt.org/2019/0

So @TroveAustralia includes more than 370,000 press releases, speeches, and interview transcripts issued by Aust federal politicians & saved by the Parliamentary Library. Learn how to harvest metadata & full text to create your own datasets in this notebook. nbviewer.jupyter.org/github/GL

Among the OCRd texts I’m currently harvesting from Trove’s journals zone are things like the NSW Post Office Directories from 1886 onwards. Useful sources for compiling data about occupations, locations etc?

Wow, there are now over 371,000 press releases, interview transcripts and more from the @ParlLibrary available through @TroveAustralia. Just working on a new notebook to harvest sets for research… trove.nla.gov.au/article/resul

Another collection of OCRd text from @TroveAustralia is on its way…

Newspaper articles in @TroveAustralia with ‘White Australia Policy’ in their titles – 3,600 thumbnails in one big, zoomable image. Zoom in for the article ids… easyzoom.com/imageaccess/3df68

Playing with @TroveAustralia newspaper results. Here’s illustrated articles with ‘White Australia Policy’ in their title…

The final tally – after much tweaking I’ve downloaded OCRd text from 9,738 works in the @TroveAustralia books zone. This includes ephemera such as pamphlets and posters as well as more booky books. Here’s the full metadata, all the text files, & harvesting code. nbviewer.jupyter.org/github/GL

I’m looking for books in @TroveAustralia, but there’s lots of ephemera (pamphlets, posters etc) in the book zone. So I tried grabbing the images of ‘books’ with one page & found some nice stuff including this collection of playbills. trove.nla.gov.au/book/result?q

Text of over 3 thousand digitised books and pamphlets downloaded so far from @TroveAustralia…

After talking to @PrimahadiWijaya today about work at @MonashLing, I started harvesting metadata & full text from digitised books in @TroveAustralia. OCRd text from about 2,000 books downloaded so far. More soon… github.com/GLAM-Workbench/trov

What I did at ! Here’s a CSV with basic details of 7,719 digitised books available through @TroveAustralia. I’m not sure if they all have OCRd text available, but if they do I’ll attempt to download it once I’m back home. github.com/GLAM-Workbench/trov

TIL that the web pages for digitised works (like books and journal issues) on @TroveAustralia embed a lot of useful metadata that you can’t get through the API. Here’s how to extract it. nbviewer.jupyter.org/github/GL

Just posting the link to my ‘Introducing APIs’ slides for again, so that they show up in my MicroBlog feed… slides.com/wragge/introducing-

Hmm, it occurs to me that the method I used to generate newspaper article thumbnails from Trove, could also be used to extract illustrations (cartoons, drawings, photos etc)…

So, I’ve finally figured out a way to automatically generate nice-looking thumbnails from @TroveAustralia newspaper articles. Demo notebook here. mybinder.org/v2/gh/GLAM-Workbe

So apparently our cultural institutions need to 'articulate a shared narrative that directly connects them with Australia’s story’... aph.gov.au/Parliamentary_Busin

