Anyone got a link / FYI re whatever the "masto data scraping" thing is?

@hugh ...which I realised I didn't actually boost at any point. 🤦‍♀️

@virtualwolf I must admit I'm having a hard time cranking up the rage machine on this. Mastodon has multiple user-level tools for people who don't want their toots scraped.

@hugh @virtualwolf seconded. I mean, you have to know that anything you post in public can and will be mined, right?

@fortescue @hugh @virtualwolf in this case it was academics who should have known better and hand-waved over their dismal anonymisation efforts.

I mean, yes, public posts should be anticipated to be accessed by the public in ways you don't expect, but that's no reason not to call out anyone who's being a dickhead with your stuff.

@mike @fortescue @virtualwolf

Genuinely naive question - how is this different to a search engine spidering sites and maintaining an index? Is it because of what (meta)data they are storing? Or is it the research that is dubious rather than the scaping per se?

@hugh @fortescue @virtualwolf I only skimmed the paper this morning (and it's now (shock!) been pulled due to privacy issues) but it seemed at first glance that they were using API queries to pull the instance timelines directly, where a reputable search engine would view the pages as a real user would, which include things like, oh for instance, a user's "do not index this content" directives, if present.

They seemed to think this was fine because they did a tiny bit of scrubbing of usernames.

@hugh @fortescue @virtualwolf also while it's HARD to argue that short posts such as toots are subject to copyright, it's not impossible. There are legit creative works in there.

While academic institutions have pretty broad fair use rules in their favour, it's not absolute and there was no indication this group even considered this possibility.

Just because you CAN download a given thing quite legitimately via a web a site, it's not automatically public domain. There's still ownership.

@mike @fortescue @virtualwolf 👍🏻 thanks! Makes sense. There are 2 settings admins can turn on to help with this, based on a paste I saw it looks like 4 toots from my instance got harvested but happily I appear to have the admin settings right now.

@fortescue @hugh @virtualwolf Looks like Harvard Dataverse have already taken it down though. Too much non-consented personal information for any ethical review to approve, so essentially unusable as a dataset for its intended purpose.

