I have an unpopular opinion about librarianship. Well, I have several but the one that I feel comfortable in sharing today is this: for years, the profession has been overly reliant on teaching boolean logic as the means to find research and known works in our library catalogues, subject databases, and the internet.
I’m no longer worried about sharing this opinion because while boolean has usually functioned (while not serving us well), the growing number of services that are replacing indexes with machine learning, suggests a future in which that boolean will no longer work*.
Opening a DIALOG, pre-Chatbot
For years, I used to ask student librarians if they were still teaching DIALOG at library school. That’s because even though DIALOG had long fallen from common use, it was still commonly taught. I think the reason why it persisted for so long was because DIALOG presented the last remaining compelling reason to use complicated nested boolean logic in one’s search.
Dialog (https://en.wikipedia.org/wiki/ProQuest_Dialog) is an online information service owned by ProQuest, who acquired it from Thomson Reuters in mid-2008.[1][2]
Dialog was one of the predecessors of the World Wide Web as a provider of information, though not in form.[3][4] The earliest form of the Dialog system was completed in 1966 in Lockheed Martin under the direction of Roger K. Summit.[5] According to its literature,[6] it was “the world’s first online information retrieval system to be used globally with materially significant databases”. In the 1980s, a low-priced dial-up version of a subset of Dialog was marketed to individual users as Knowledge Index.[7] This subset included INSPEC, MathSciNet, over 200 other bibliographic and reference databases, as well as third-party retrieval vendors who would go to physical libraries to copy materials for a fee and send it to the service subscriber.
DIALOG was not cheap and generally only available at some universities and large companies within industries that had R&D budgets. To access DIALOG, you would have to dial in, log into your account, and you (or your employer) would be charged by the number of records that were returned from each of your searches. Researchers who were able to distill their complex research needs in the form of Boolean had a financial incentive to be as precise as possible.
When I went to library school, we were given free access to DIALOG and I was taught how to generate searches that would result in a reasonable amount of results to download and review to see if contained the answer to the research question that we were assigned. The goal was to have end-result of a set of something between 30 and 50 results.
It was a very procedural and mechanical approach to the act of researching and best suited for the literature of the sciences and engineering.
It was only until I became a professional librarian, that I learned to supplement my teaching and research to include other ways of finding written works. I would recommend students to use our woefully under-utilized Annual Reviews subscription or to seek out the first chapters of recent dissertations to find a literature review when they came to the Desk with a research question that they didn’t realize was way too broad for their first university-level paper. I would show students how to use the works cited in their textbooks and their readings as a jumping off point to find other relatable and reliable sources. I pointed out that while recency could be important, they should also pay attention to how many times a work was cited by others. I wanted to ground their knowledge in people and not just through a complicated search string in a commercial product.
Using Boolean as a technique was appropriate at the time of library school because within DIALOG, there were very specialized domain specific indexes that you would select and while each of these contained a great many items, it was in no way comparable to the the scale of the internet or even our federated, cross-disciplinary academic discovery systems.
When I designed library websites, I would suggest to my colleagues that linking our websites to the advanced search screens of the databases we subscribed to, was inappropriate according to much of the evidence-based peer review user-experience research that existed. Even the vendors who sold us these research tools did not recommend defaulting to a search screen that required the user to think in terms of search sets connected by boolean logic, as their research suggested that it was better for most to use facets to refine broad searches. And reader, I was harshly rebuked for suggesting that we leave the comfort of Boolean searching. I was not the only one.
This makes me wonder: What are librarians going to do now when we discover that our advanced searching techniques no longer work in Google?
NOT Hallucination but Tokenization
[(Boolean searching does not work with machine-learning search systems) OR (Boolean searching kind of works with LLMs)] AND [(I don’t believe many (people OR librarians) know the technical reason why)].
From the post, Google Search is Dying is this passage:
What are wonders of tokenization? To find out, Let’s build the GPT Tokenizer with Andrej Karpathy.
If you don’t want to watch this 2 hour and 13 minute technical video, you are going to have to suffer from my so-reductive-it’s-probably-harmful distillation of the concept: tokenization turns words into numbers so that LLMs can calculate distances between words in a spatial array.
You could have each letter of the alphabet be represented by an integer (e.g. a=1) but this is not very efficient. Tokenization makes things more manageable by associating an integer with a chunk of text called a token, such as ‘alpha=1’.
From Karpathy, I learned of the website, Tiktokenizer which clearly illustrates how various LLM systems tokenize text. As you can see below, while “Boolean” is a token represented by 7035, the integer 2586 maps to ” Please don”. Tokenization is why we can’t have nice things with LLMs.
Congratulations. You now know one of the main reasons why LLM systems tend to fail at spelling, ordering items in reverse, and other points of failure:
Boolean is NOT dead AND I’m okay with it
Can we still use precise language and boolean searches in systems that are made accessible by a model generated by machine learning and not by indexing? Can we provide systems that provide both “traditional searching” and “natural language searching”? I don’t have good answers to these questions, but I don’t think the answer is a simple, no.
*Despite what I had said earlier about how ‘boolean won’t work in the future’, I don’t really believe it.
I don’t believe it yet because we still need alternative understandable and confirm-able systems, to compare our machine learning systems against if we want to really to understand and benchmark the capabilities and limits of LLMs in terms of accuracy and effectiveness.
And I don’t believe boolean (or as I prefer to call it, SQL) is dead because there is still a real need for replicable and reliable tools for systematic literature searches.
On that note, did you know that…
DIALOG still exists?!?
Evidently, there is still a market for high-quality sources of valuable research that provides precise and replicable searching to support essential systematic review research, which is an essential methodology in the health sciences, untainted by the presence of Reddit threads in the search results.
Which makes me think… DIALOG is going to make a comeback, isn’t it?
Further Reading
- Exploring ChatGPT for Next-generation Information Retrieval: Opportunities and Challenges, Web Intelligence, vol. Pre-press, no. Pre-press, pp. 1-14, 2024, https://doi.org/10.48550/arXiv.2402.11203
- Situating Search, CHIIR ’22: Proceedings of the 2022 Conference on Human Information Interaction and Retrieval, March 2022, Pages 221–232, https://doi.org/10.1145/3498366.3505816
2 Responses to “Boolean is Dead AND I feel fine”
[…] research team uses machine learning to do rapid systematic literature reviews (which means my post, Boolean is Dead AND I feel fine needs an […]
Interesting.
I think the future of academic search is RAG (retrieval augmented generation) systems. As you know, this uses your input to search for relevant sources (chunks of text) and these are used by the LLM to generate an answer.
Those can and are still mostly a blend of lexical search (usually BM25) and “natural language/semantic search” (typically a bi-encoder dense embedding). These should be in theory capable of boolean searching as there is an inverted index for at least the lexical search part (but boolean may not be implemented)
See also https://journal.code4lib.org/articles/17443 .
Some like Scite.ai assistant, the upcoming Primo CDI research assistant even uses a language model to convert user input (natural language) into a search string.
I tried with some odd ball input strings, Scite.ai assistant happily searched with those strings in their scite.ai index, bypassing the subword tokenization issue.