Anyone know if there is some paper, blog, whatever explaining why academic/scholarly journal articles continue to be ~99% published in PDF instead of HTML?
I'm not being snarky, I'm genuinely baffled and interested in knowing the answer.
@hugh I have no empirically-based answer but I think it might have to do with PDF ~feeling more scholarly~ or something? that PDFs are easier to save to desktop or reference managers for future use, or that people still like the look of PDFs as a surrogate for ye olde paper
@hugh Could it be a holdover from printing? I've heard PDF is preferred by most professional printers.
Could also be DRM 😷
@hugh possibly just hangover from when not every desktop computer had an internet connection and web browser installed?
@hugh No reference, but I vaguely recall something about pagination maybe being involved: if you aren't directly quoting someone (allowing a text search) but just paraphrasing a claim, or adding a 'see also', or something like that, then citations with page numbers direct readers to the relevant text. That can be solved with e.g. numbered paragraphs, but style guides are a possible source of inertia...
@GardenOfForkingPaths I have a LOT of anecdotal evidence supporting this, though I suspect they feed into each other (also some PDF articles have no page numbers!)
@vi This is precisely the reason PDFs are a problem.
@vi ever tried reading a scholarly article PDF on a phone?
@hugh No reference, but given some experiences teaching faculty to use co-editing tools (wikis, etc.), I think PDFs represent "fixed" to many people. Non-fixed documents seem to make some people very, very nervous. And PDFs are fixed--pretty much what you see on one machine is what you see on another. NOT true for HTML.
Also, inertia. 😀
file formats Show more
@hugh Its much easier to go from a print-ready file to a PDF than from HTML to a print-ready file, and that kind of conversion is work the publisher pays for. Ask about formats around some smalll-press/self-publishing communities who sell both ebooks and printed books and you will get an earful. Also, a PDf is one file, a HTML page is a folder so if the HTML and the images get separated it will no longer work.
file formats Show more
@hugh In addition, 90% of scholarly literature was published before 1995, so it is digitalized by scanning. Most scanners output PDFs but not clean HTML. And PDFs have stable page numbers for citation, whereas citing inside long flowing text files is a hard problem (nobody wants to cite paragraph 43, they want page 12).
@hugh this is 100% anecdotal, but ever since I started grad school, I went from hating PDFs to loving them. If I ever have the option to read a document as PDF or HTML (or something else) I always choose PDF now. There are a few reasons why:
1. PDFs fit into my workflow. I can easily download a bunch of documents to my Dropbox, organize them into folders, read them on my iPad, annotate them, and send them to people. There’s zero friction.
@hugh 2. Reading is PDFs is easy. On an iPad-size screen, it’s very similar to reading paper. This also breaks down to two similar points:
2a. When you’re reading for hours at a time, pagination is far superior to scrolling.
2b. When a journal database gives you HTML, there’s typically zero effort put into its readability. The font will be tiny, the lines will be too long. Images won’t display properly. It’s awful.
@hugh 3. Every PDF reader comes with annotation features. I can always mark up my documents, no matter what system I’m using. I can also send them to a colleague without ever worrying about if they’ll be able to read it.
Now, imagine trying to replicate one of these points with HTML. Even when it’s technically possible, there’s too much friction. Software is too inconsistent. The HTML ecosystem just isn’t built for the same things that the PDF ecosystem is.
@John thanks this is a useful perspective.
More of a response to other responses:
Poor formatting of HTML articles provided by journals: this may change with increased demand for HTML articles, and it isn't necessarily a problem if a `Pocket/Readability/Instapaper for Academic Articles` emerged.
Annotation systems could be setup fairly easily, I suspect, for HTML content. Especially with a reader program. [Annotation of PDFs isn't super available on #Linux readers, I think.]
@bthall yeah, hypothes.is is working in the annotation space, and you're 100% right on formatting going where the demand is (i.e. it's a circular argument). The most compelling explanation for me was John@glammr.us meta point about portability (I guess it's right there in the name!). The offline portability piece seems to be the real issue. I'm wondering if ePub or something like it might be part of the solution.
Yeah, I had ePub in mind when writing my post, too. I think that it would work fairly well, especially with a reconfigured ePub reader to fit articles rather than books and sync one's library across devices. HTML to ePub is fairly trivial, LaTeX is at least sometimes rendered in ePub readers, and SVG is as well for equations, graphs and diagrams (I don't know why I haven't heard of LaTeX to SVG being a thing)
@bthall the main thing holding ePub back is its association with books and DRM I suspect. If you called it a ‘bundled webpage’ or something it might be looked at differently.
@hugh I just found this, which seems to be a means of distributing ePub documents (books, articles): https://wiki.mobileread.com/wiki/OPDS#OPDS_Catalog_Generation
It's supported by a number of apps. End users would need to configure their reader devices/apps for a given document provider's catalog, but among academics that might not be a big thing to ask.