Boolean is Dead AND I feel fine

I have an unpopular opinion about librarianship. Well, I have several but the one that I feel comfortable in sharing today is this: for years, the profession has been overly reliant on teaching boolean logic as the means to find research and known works in our library catalogues, subject databases, and the internet.

I’m no longer worried about sharing this opinion because while boolean has usually functioned (while not serving us well), the growing number of services that are replacing indexes with machine learning, suggests a future in which that boolean will no longer work*.

Vox on YouTube: We’re already using AI more than we realize [06:31]

Opening a DIALOG, pre-Chatbot

For years, I used to ask student librarians if they were still teaching DIALOG at library school. That’s because even though DIALOG had long fallen from common use, it was still commonly taught. I think the reason why it persisted for so long was because DIALOG presented the last remaining compelling reason to use complicated nested boolean logic in one’s search.

Dialog (https://en.wikipedia.org/wiki/ProQuest_Dialog) is an online information service owned by ProQuest, who acquired it from Thomson Reuters in mid-2008.[1][2]

Dialog was one of the predecessors of the World Wide Web as a provider of information, though not in form.[3][4] The earliest form of the Dialog system was completed in 1966 in Lockheed Martin under the direction of Roger K. Summit.[5] According to its literature,[6] it was “the world’s first online information retrieval system to be used globally with materially significant databases”. In the 1980s, a low-priced dial-up version of a subset of Dialog was marketed to individual users as Knowledge Index.[7] This subset included INSPEC, MathSciNet, over 200 other bibliographic and reference databases, as well as third-party retrieval vendors who would go to physical libraries to copy materials for a fee and send it to the service subscriber.

DIALOG was not cheap and generally only available at some universities and large companies within industries that had R&D budgets. To access DIALOG, you would have to dial in, log into your account, and you (or your employer) would be charged by the number of records that were returned from each of your searches. Researchers who were able to distill their complex research needs in the form of Boolean had a financial incentive to be as precise as possible.

Figure 2. An example search strategy for ‘oral protein-calorie supplementation for children with chronic disease’ from “Think outside the search box: A comparative study of visual and form-based query builders

When I went to library school, we were given free access to DIALOG and I was taught how to generate searches that would result in a reasonable amount of results to download and review to see if contained the answer to the research question that we were assigned. The goal was to have end-result of a set of something between 30 and 50 results.

It was a very procedural and mechanical approach to the act of researching and best suited for the literature of the sciences and engineering.

It was only until I became a professional librarian, that I learned to supplement my teaching and research to include other ways of finding written works. I would recommend students to use our woefully under-utilized Annual Reviews subscription or to seek out the first chapters of recent dissertations to find a literature review when they came to the Desk with a research question that they didn’t realize was way too broad for their first university-level paper. I would show students how to use the works cited in their textbooks and their readings as a jumping off point to find other relatable and reliable sources. I pointed out that while recency could be important, they should also pay attention to how many times a work was cited by others. I wanted to ground their knowledge in people and not just through a complicated search string in a commercial product.

Using Boolean as a technique was appropriate at the time of library school because within DIALOG, there were very specialized domain specific indexes that you would select and while each of these contained a great many items, it was in no way comparable to the the scale of the internet or even our federated, cross-disciplinary academic discovery systems.

Figure 1. A traditional form-based query builder (PubMed) from “Think outside the search box: A comparative study of visual and form-based query builders

When I designed library websites, I would suggest to my colleagues that linking our websites to the advanced search screens of the databases we subscribed to, was inappropriate according to much of the evidence-based peer review user-experience research that existed. Even the vendors who sold us these research tools did not recommend defaulting to a search screen that required the user to think in terms of search sets connected by boolean logic, as their research suggested that it was better for most to use facets to refine broad searches. And reader, I was harshly rebuked for suggesting that we leave the comfort of Boolean searching. I was not the only one.

Title screen of a powerpoint presentation reading, Web Librarians who do Ux: we are so sad, we are so very very sad

This makes me wonder: What are librarians going to do now when we discover that our advanced searching techniques no longer work in Google?

And what are librarians going to do when they find that the advanced search screen has been replaced with an AI-driven chatbot?

There is More to Reliable Chatbots than Providing Scientific References: The Case of ScopusAI“, By Teresa Kubacka, Feb 21, 2024

NOT Hallucination but Tokenization

[(Boolean searching does not work with machine-learning search systems) OR (Boolean searching kind of works with LLMs)] AND [(I don’t believe many (people OR librarians) know the technical reason why)].

From the post, Google Search is Dying is this passage:

For example, one proposed failing query was "quotes don't give". But HN user saalweachter has pointed out that it is a very intricate punctuation problem.

    For instances, on the ["quotes don't give"] example, the first result I get is https://www.goodreads.com/quotes/tag/never-give-up

    If I do a find-in-page for "quotes don't give", I get zero results. Oh no! Perfidy!

    ... but, if you look more closely, you'll find this string waaaaay down at the bottom:

    tags: don-t-give-up, don-t-give-up-on-your-dreams, don-t-give-up-on-yourself, don-t-give-up-quotes, don-t-give-up-the-fight, encouragement, ...

    Thanks to the wonders of tokenization, that "don-t-give-up-quotes, don-t-give-up-the-fight" gives you the string of tokens, "don t give up quotes don t give up the fight", which contains the exact phrase "quotes don t give", which is the tokenization of the phrase "quotes don't give".

What are wonders of tokenization? To find out, Let’s build the GPT Tokenizer with Andrej Karpathy.

If you don’t want to watch this 2 hour and 13 minute technical video, you are going to have to suffer from my so-reductive-it’s-probably-harmful distillation of the concept: tokenization turns words into numbers so that LLMs can calculate distances between words in a spatial array.

You could have each letter of the alphabet be represented by an integer (e.g. a=1) but this is not very efficient. Tokenization makes things more manageable by associating an integer with a chunk of text called a token, such as ‘alpha=1’.

From Karpathy, I learned of the website, Tiktokenizer which clearly illustrates how various LLM systems tokenize text. As you can see below, while “Boolean” is a token represented by 7035, the integer 2586 maps to ” Please don”. Tokenization is why we can’t have nice things with LLMs.

Congratulations. You now know one of the main reasons why LLM systems tend to fail at spelling, ordering items in reverse, and other points of failure:

Boolean is NOT dead AND I’m okay with it

Can we still use precise language and boolean searches in systems that are made accessible by a model generated by machine learning and not by indexing? Can we provide systems that provide both “traditional searching” and “natural language searching”? I don’t have good answers to these questions, but I don’t think the answer is a simple, no.

*Despite what I had said earlier about how ‘boolean won’t work in the future’, I don’t really believe it.

I don’t believe it yet because we still need alternative understandable and confirm-able systems, to compare our machine learning systems against if we want to really to understand and benchmark the capabilities and limits of LLMs in terms of accuracy and effectiveness.

And I don’t believe boolean (or as I prefer to call it, SQL) is dead because there is still a real need for replicable and reliable tools for systematic literature searches.

On that note, did you know that…

DIALOG still exists?!?

DIALOG is owned by Clarivate.

Evidently, there is still a market for high-quality sources of valuable research that provides precise and replicable searching to support essential systematic review research, which is an essential methodology in the health sciences, untainted by the presence of Reddit threads in the search results.

Which makes me think… DIALOG is going to make a comeback, isn’t it?

Further Reading

2 Responses to “Boolean is Dead AND I feel fine”

  1. Interesting.

    I think the future of academic search is RAG (retrieval augmented generation) systems. As you know, this uses your input to search for relevant sources (chunks of text) and these are used by the LLM to generate an answer.

    Those can and are still mostly a blend of lexical search (usually BM25) and “natural language/semantic search” (typically a bi-encoder dense embedding). These should be in theory capable of boolean searching as there is an inverted index for at least the lexical search part (but boolean may not be implemented)

    See also https://journal.code4lib.org/articles/17443 .

    Some like Scite.ai assistant, the upcoming Primo CDI research assistant even uses a language model to convert user input (natural language) into a search string.

    I tried with some odd ball input strings, Scite.ai assistant happily searched with those strings in their scite.ai index, bypassing the subword tokenization issue.

Leave a Reply

Your email address will not be published. Required fields are marked *