What if we used Ai as an excuse to provide structured open data in plain text to everyone?

§1: This is not an ordered list

I am a fan of HTML.
I am a fan of standards.
I am a fan of lists.

But now that I’m a law librarian, I now *less of a fan* of HTML. This is because the w3c’s HTML standard for ordered lists means that my work will always involve PDFs.

From I Blame the W3C’s HTML Standard for Ordered Lists [tech, soc, Patreon]:

I have one particular favorite hobby horse example of this, which really captures how apparently trivial errors can have far-reaching consequences.

That example is the Ordered List (<ol>).

HTML’s Ordered List is not a ordered list. Instead of implementing Ordered List correctly, HTML gave us an dynamic automatically numbered list, instead, and called it an ordered list.

This is self-evidently insane. It is predicated on a fundamental misunderstanding of why human beings assign numbers to things such as list items. The chief reason humans assign ordinal numbers to list items is to be able to refer to those list items by number. Consequently, while it’s fine if a document generator auto assigns numbers to list items, as a convenience to the document drafter, once the document is “published”, the assignment of number to list items must be stable.

This is because the ordinal numbers of list items in an ordered list are content not presentation.

The W3C spent literal decades dying on the hill that the ordinal numbers of the items in an ordered list are not content, only presentation.

It caved enough to allow the start number attribute, which – get this – it deprecated in HTML4, and then, in HTML5 (officially released in 2014) it un-deprecated, with the note that the numbers assigned list items in an ordered list are, in fact, content not presentation…

Content not presentation. Remember this!

… There is one particular type of document in which the correct handling of the ordinal numbers of lists is paramount. A document type in which the ordinal numbers of the lists cannot be arbitrarily assigned by computer, dynamically, and in which the ordinal numbers of the lists are some of the most important content in the document.

I’m referring of course to law.

HTML, famously, was developed to represent scientific research papers, particularly physics papers. It should come as no surprise that it imagines documents to have things like headings and titles, but fails to imagine documents to have things like numbered clauses, the ordinal numbers of which were assigned by, for example, an act of the Congress of the United States of America.

Of course this is not specific to any one body of law – pretty much all law is structured as nested ordered lists where the ordinal numbers are assigned by government body.

It is just as true for every state in the Union, every country, every province, every municipality, every geopolitical subdivision in the world.

HTML, from the first version right up to the present version, is fundamentally inimical to being used for marking up and serving legal codes as web pages. It can be done, of course – but you have to fight the HTML every step of the way. You have no access to any semantic markup for the task, because the only semantic markup for ordered lists is OL, which treats the ordinal numbers of ordered lists as presentation not content.

And why is this important?

But the fact that HTML’s Ordered Lists are but misnamed dynamic auto-numbered lists is only one problem with them; there’s another problem following from the insane decision to consider the ordinal item numbers of Ordered Lists “presentation” instead of “content”.

In the browser, copy and paste does not work on what HTML relegates to “presentation”.

You can only copy and paste the content of a webpage.

So one of the functional consequences of HTML treating the ordinal numbers of ordered lists as presentation not content is that the user can’t cut and paste them from the browser window.

When a user tries to copy and paste from an ordered list in an HTML page, the ordinal numbers assigned by the OL tag are not included – the numbers are left behind. What the user winds up pasting into the target document is the copied list items – without their numbers!

It seems self-evidently wild to me that any characters in a text document, such as an HTML document, would ever be considered “presentation”, and not content, since they’re, you know, characters in a text document signifying information. It is a massive violation of user expectations. But that is exactly what the W3C did with Ordered Lists.

And this is why when you deal with legislation and bylaws and legal text, you usually end up having to deal with PDFs, lots and lots of PDFs.

§2 : When the information you want is within a set of 15,000 PDFs

If you have been following my work, you might vaguely recall that I have occasionally allude to a particular project of mine: to add all the Supreme Court of Canada decisions into Wikidata and to include the relevant details when a decision has an intervener involved.

SCC documents list interveners if they are part of a decision, such as in the example below:

The Supreme Court of Canada publishes its decisions both in HTML and in PDF, and with this kind of document, we don’t have any problems when we try to cut and paste text. But for my particular project, I want to extract this information from thousands of documents, and not one at a time.

Now I could try to scrape the Supreme Court of Canada website in order to download a personal copy of each decision, but I don’t want to do that. And I don’t have to that, because someone has already done this and has shared the work.

That someone is Sean Rehaag of the York University’s Refugee Law Lab:

The bulk data set of canadian-legal-data is available on Hugging Face.

The Hugging Face Dataset interface allows us to run SQL queries to limit and refine the SCC dataset to what we might choose to download from the whole:

Having access to this dataset has saved me many hours of labour. I am so very grateful that The Refugee Law Lab has shared their work openly.

§3 Non-Profit

Most libraries that used to collect printed government documents don’t really “collect” copies of this kind of government-produced information anymore, now that the documents are published online. While some larger institutions do capture preservation copies of online government documents, many libraries tend to point their readers to the source: government websites where users can find the legislation, council minutes, and meeting agendas that they might need. If they can find it.

I think there’s a role that libraries could fill here. The large legal database companies don’t collect legislation and reports at the local level. You can’t use their expensive AI to ask questions about how to interpret a local bylaw.

Libraries could collect government and legal information at the local level. Librarians and library staff could collect, clean, standardize, describe, and share local government meetings, reports, and minutes. With the help from transcription software, the library could provide transcripts of meetings that are available on video. People in the community could use these datasets to support civic engagement projects like Toronto’s Civic Dashboard and local journalism efforts like The Documenters Network. And yes, some people would use these datasets for their own personal or commercial LLM use.

Years ago, I wrote an essay called, Why Libraries Should Maintain the Open Data of Their Communities. It’s almost ten years later, and I guess that I am still beating the same drum.

Librarian of Things

What if we used Ai as an excuse to provide structured open data in plain text to everyone?

§1: This is not an ordered list

§2 : When the information you want is within a set of 15,000 PDFs

§3 Non-Profit

Fediverse Reactions

Leave a Reply

What if we used Ai as an excuse to provide structured open data in plain text to everyone?

§1: This is not an ordered list

§2 : When the information you want is within a set of 15,000 PDFs

§3 Non-Profit

Share this:

Fediverse Reactions

Leave a Reply