I worked on LLMs


Generative AI sure is the talk of the town these days. Controversial, is it not? It’s with some trepidation that I disclose that I spent the last 6 months becoming an expert in large language models (LLMs). Earlier this year when I moseyed through the foundational LLM paper, that was where it began.

I’d like to start talking about this more, because I’ve been frustrated with the public conversation. Among both anti-AI folks as well as AI enthusiasts, people have weird, impossible expectations from LLMs, while being ignorant of other capabilities. I’d like to provide a reality check, so that readers can be more informed as they argue about it.

Scope of expertise

First of all, many people today use “generative AI” to denote a category of models that generate text or images. I don’t like this category, because text generators and image generators have very little in common. They use completely different techniques, operate at different scales, have different use cases, and different social consequences. My expertise only includes LLMs, and does not include image models.

Second, I’m not an AI scholar. I did not build LLMs, and did not study AI ethics. What my expertise amounts to, is that I took courses and workshops, and I applied this knowledge toward practical business use cases. I spent six months on it, so let’s not exaggerate my expertise—I just know more than the average person.

Third, you know me, I’m opinionated. I am not here to offer a purely informational account. But I will take pains to distinguish between my personal opinion and generally accepted wisdom.

I am not at liberty to say what I did with LLMs.  But I mostly do other things, and my career does not rely on LLMs. I don’t have any financial interest in persuading you one way or another about LLMs.

The big questions

The central question of LLMs is essentially, what are they good for? Are they good for society? Are they overhyped?

Right now, LLMs are definitely a hot topic among corporations and investors. Investors are asking, what are you doing to leverage new technology? Corporations are throwing everything at the wall so they can tell investors that they’re doing something AI. And nobody knows which things will stick. If you could truly predict which things LLMs will be good at and which things they will be bad at, then you could go into venture capital and make a lot of money.

In my personal (baseless) opinion, I think LLMs are overhyped. And what I mean by that, is that I believe that investors will overall be disappointed. They’re investing a lot of resources into developing LLM-powered products, and I predict the return on investment will underperform.

But that’s very different from saying LLMs are totally useless. There’s a world of space between “LLMs will not provide sufficient return on investment” and “LLMs are completely useless.” LLMs will definitely find some use cases, I guarantee it. They already have.

But LLM use cases won’t necessarily be obvious when they arrive. Many uses cases will occur within businesses or between businesses–they won’t be customer-facing. And when an LLM is used in a customer-facing product, you may not be able to recognize it unless you know what to look for.

You already use LLMs

I was a bit surprised to learn this myself, but you already use LLMs in your daily life. As of 2020, Google Translate uses an LLM. I’m defining an “LLM” as a natural language processing model that uses transformer architecture. While Google doesn’t explicitly use the term “LLM”, the description absolutely fits.  Perhaps this is not surprising, if you know that the foundational paper on LLMs was published by Google researchers, with the express purpose of doing language translation.

I was even more surprised to learn that as of 2019, Google Search also uses an LLM. Google uses BERT to understand user queries. (This is not to be confused with a more recent Google feature, which uses an LLM to summarize results.) Google explains what this means:

Let’s look at another query: “do estheticians stand a lot at work.” Previously, our systems were taking an approach of matching keywords, matching the term “stand-alone” in the result with the word “stand” in the query. But that isn’t the right use of the word “stand” in context. Our BERT models, on the other hand, understand that “stand” is related to the concept of the physical demands of a job, and displays a more useful response.

You might look and this and say, that’s not a real LLM, it’s not a chatbot, and it’s not generating text. But that’s part of the point I’m trying to make. You need to revise your assumptions of what an LLM is. BERT is one of the canonical examples of an LLM, and it generates vectors rather than text. It’s arguable whether BERT counts as “generative” AI, but it ultimately uses the same technology.

Now, it’s very much in character for tech startups to “move fast and break things”, to pursue the hot new technology and fall flat on their face. But I really must impress upon you, Google is hardly a tech startup. Google is extremely cautious about its big money-making product, Google Search. Google would not integrate LLMs into their search engine unless they collected enough quantitative evidence to convince a dozen committees.  So LLMs may or may not be overhyped, but they’re not useless.

What’s next?

So, how’s that for an opening statement about LLMs?  My goal is not to be the local AI apologist, nor AI skeptic, nor to present a “fair and balanced” perspective between the two extremes.  I want people to have a better understanding of what they’re talking about, and that still leaves plenty of room for reasonable disagreement.

I’m thinking of a few different topics I might write about in the future. Is there anything that readers are particularly interested to hear about?

Comments

  1. says

    only read partway thru your article and im gonna guess your job was to wet blanket some ludicrous business ideas for AI and then be promptly ignored by the corpos who hired you. no need to confirm or deny…

  2. says

    given that u previously mentioned working in the financial sector, which to my pov is an out of control fuckpile dedicated to engineering conjobs so complicated they take super-geniuses to understand then having them enacted by a billion coked up nepo baby college football alumni, i hafta be a bit thankful the AI ain’t gonna be as useful to them as they hope…

  3. says

    i wonder if google translate’s latin is as broken and nonsensical as last time i tried it, a few years ago. was i catching the tail end of the old school or the buggy bleeding edge of AI at that moment?

  4. says

    finished the article. the thing i’m most curious about at the moment is the environmental impact of LLMs? I can easily think of a reason press on it may be wildly inaccurate, that is, How have they compared to the energy use to any other human activity online? netflix and other video streamers may well vastly exceed the environmental impact of chatGPT, in which case they would be a better target for environmental regulation.

    not that anyone in tech should get a pass, just that it’s easy for AI to be a whipping boy for everybody’s frustrations and fears, which would make the focus on its environmental impact into yellow journalism, if it really is not significantly greater than playing world of warcraft.

    my thought on reducing the environmental impact is to shift end use to specialized hardware, like video game platforms, but instead it’s an AI box. that way you don’t have everybody essentially streaming the use of LLMs, but can run it more feasibly than you could on a stock commercial PC. but idk, i know very little about any of this shit, at the moment.

  5. says

    @GAS,
    I still work for the same employer, so I won’t say anything about them.

    Google Translate probably still isn’t very good at translating stuff, but I’m sure it’s a measurable improvement over what they had before. Something you really need to think about in any LLM application is: What is the error rate? What is the error tolerance? Google Translate doesn’t need to achieve 0% error rate, it just needs to beat out the error rate of the competition (old Google Translate).

    I don’t know much about environmental impact, I’d have to look into that for you. Something I do know, is pricing. Through AWS, Claude Opus (a very large model!) currently costs 7 cents to generate 1000 tokens (roughly 1000 words). This is comparable to renting 8 GB RAM for one hour, or transcribing 3 minutes of audio. I don’t think that’s an exorbitant amount. Obviously if people are doing a lot of it, the total costs could be large.

  6. invivoMark says

    A question that’s been bothering me for a while is, how big of a leap is it to give a generative LLM a new “ability?” For instance, ChatGPT infamously will make up fake references. If OpenAI wanted to address this specific issue, can they just add some (relatively) small amount of code around the use of references, and give it the ability to cite accurate references where certain information can be found? Or is it a huge task that would take another decade of ChatGPT development? Or is it just not even in the realm of what ChatGPT is, and an entirely different LLM would need to be built from the ground up?

    I’m also curious, since Google search has been using LLM code, is that partly why many people seem to report that Google search is so much more useless than it used to be in the early 2000s?

    I’m also a little bit skeptical about Google Translate’s use of a more sophisticated LLM being necessarily better. If a simple translator gives me a sentence that doesn’t make sense, I can assume it used an incorrect homonym or something and I can pretty reliably figure out what went wrong. But if it’s programmed to make translations that look right, even when they’re not, that makes it harder to check!

  7. says

    @invivoMark

    A question that’s been bothering me for a while is, how big of a leap is it to give a generative LLM a new “ability?”

    It’s within reach of individual data scientists.

    Building a new LLM is very complex. For instance, the large Llama 3 model has 70 B parameters. This would take about 16*70GB or 1.1 terabytes of RAM to train. Not to mention gathering and filtering all the training data. But, if you already have a trained a model, you can fine-tune it with a much smaller data set, and much smaller number of parameters. It’s not an easy process, but there are tools to help you do it. I think a lot of chatbots already have a layer of fine-tuning to make them good at basic problem solving.

    For instance, ChatGPT infamously will make up fake references. If OpenAI wanted to address this specific issue, can they just add some (relatively) small amount of code around the use of references, and give it the ability to cite accurate references where certain information can be found?

    That’s known as the hallucination problem. It’s really two problems. The first problem is that LLMs tend to make stuff up when they don’t know things. The second problem is that they don’t know things. I’m not sure about the first problem, but the way to solve the second problem is to actually hook the model up to some data. e.g. give it access to a database, and under the hood ask it to generate a sql query. You can’t just put references in its training data and expect it to remember any of that; that’s not what LLMs are for, that’s what databases are for.

    I’m also curious, since Google search has been using LLM code, is that partly why many people seem to report that Google search is so much more useless than it used to be in the early 2000s?

    I really don’t think so. I don’t entirely understand what people complain about with Google, but I think the complaints are about search results, which is entirely different from query interpretation. Query interpretation works so well that hardly anyone even thinks about it. The way it used to be, people would type questions into search bars, and we had to teach people not to do that because that’s not how search engines work. Except, now that is how Google works, and typing in a question basically works fine.

    I’m also a little bit skeptical about Google Translate’s use of a more sophisticated LLM being necessarily better. If a simple translator gives me a sentence that doesn’t make sense, I can assume it used an incorrect homonym or something and I can pretty reliably figure out what went wrong. But if it’s programmed to make translations that look right, even when they’re not, that makes it harder to check!

    I think it’s fair to speculate that LLM translation errors are less transparent. I wouldn’t know the details on how Google decided to make the change, I’m just sure there was a lot of quantitative testing & committees, because that’s Google’s MO.

  8. invivoMark says

    The way it used to be, people would type questions into search bars, and we had to teach people not to do that because that’s not how search engines work.

    It’s supposedly how AskJeeves worked. Then they lost the search engine battle to Google, which didn’t work that way. Maybe there’s a lesson to be learned there, but I don’t feel like speculating at this hour of night.

  9. says

    @j-r,
    I don’t understand what you mean. Proving theorems seems unrelated to the problem of generating accurate references.

  10. another stewart says

    I now use Google Search as a backup when DuckDuckGo doesn’t give me what I want. (I tend to search on relatively obscure topics, and I think that Google Search wins out because of having a bigger dataset.) What now tends to happen is that if I give it a couple of search terms, one common and one obscure, I get results where the obscure term isn’t present, which isn’t what I want – the obscure term was included to narrow the field. I can use quotes to require a term, or use the AND operator, but the default behaviour is less that ideal. Since you tell me that Google is now using a LLM to interpret queries perhaps I should stop using lists of keywords, and rephrase my queries as questions.

    Sometimes I just go straight to Google Scholar.

    Nowadays search engines try to correct mispellings in the query. This is a two edged sword. I don’t always type in the words I wanted correctly. On the other hand they “correct” words that I have typed in correctly. My irritation at the latter exceeds my gratitude for the former.

    I wonder to what degree the lower performance of Google Search is due to choices made by Google, and to what degree it is due to the proliferation of low quality content on the internet.

    More broadly, “AI” performs best in narrow fields (go, chess, and other games, protein folding, …). Automated translation between some language pairs seems to be working well enough, and perhaps the audience for that application isn’t going to be blindly trusting of the result. (I use it on German botanical books and papers; I don’t need it to the same degree for French, Spanish and Latin.) But generative text seems to be a step too far; many applications require accuracy and the likes of ChatGPT can’t be relied on to produce it.

  11. says

    @another stewart,
    Have you heard of Google Web Search? It strips a way many of the “features” that Google has added over the years.

    Google has definitely made a lot of decisions that one might disagree with, although I don’t know how important that is compared to the changes in internet content. I’ve heard lots of people complain about Google Search, saying it’s worse than it used to be, but it’s a subjective problem, and one that I do not myself perceive.

    A lot of what I’m saying about Google is based on talking to someone who works there. One thing I’ve heard is they consider TikTok to be a big competitor, because people find stuff like restaurants… on TikTok? That’s wild to me. That doesn’t sound to me like people are leaving Google for want of quality search results.

  12. another stewart says

    I’ve tried https://udm14.org/ which would appear to be the same thing. Now you mention Google Web Search, I see that it’s available directly (hidden under more). (Google Scholar isn’t even mentioned under more anymore.)

    My initial impression (based on very little data) is that it gets rid of the clutter, but doesn’t improve the underlying search performance.

    Google probably worries more about the mass market (the type of people who primarily use search to do things like find restaurants) than “power users”. It’s among the latter that one sees reports of people leaving Google. Subjectively, I feel that it’s harder to get useful results from a Google Search than it used to be. (A recent feature that’s been lost is a statement of the number of results; while I didn’t believe the numbers it was useful to get an idea of which of alternate terms/spellings predominate – it got used as input to article title disputes in Wikipedia. Perhaps Google Ngrams would serve as an alternative, but I suspect that has a more restrictive – and biased – corpus.)

    I’ve tried Bing Copilot. It has the hallucination/poisoned data problem (if I ask it something I know the answer to it’s often obviously wrong), and getting a useful answer tends to be like getting blood out of a stone. One thing that has been noticed about LLMs is that if you ask for a description of a plant species they give you information that applies to other species of the genus.

  13. j-r says

    Sorry, I got delayed.

    The idea is to not only let the LLM guess a possibly correct answer, but also enough additional context information to enable some level of independent validation of the answer.

    Lets assume we have a theorem prover with a sufficiently rich formal model of the problem domain that provides a prover function

    boolean P(, , )

    that returns true iff the proposition can be proven within the timeout. Note that genrally verifying a proof has much lower computational complexity than coming up with one, so we let the LLM do the complex part of the work. E.g. a self driving system main loop would then look like the following PseudoCode>

    while (true) {
    sensorData = acquireSensorData()
    sensorState, reasoning = getGuessFromLLM(sensorData)
    if (P(“${sensorData} indicates ${sensorState}”, reasoning, 0.1s))
    handleState(sensorState)
    else
    emergencyBreak()
    }

    A lot of work obviously is hidden behind the “Data indicates State” proposition, but IMO this method has advantages: regulatory bodies have something concrete to check during certification, namely the logic of the domain model (e.g. whether it includes Asimov’s three laws of robotics:-) Also we get much more insight onto what the system ‘thinks’ (much like the rationalization human brains do).

    I think this will work well for use cases like the above where formalization of the problem is fully under the system creator’s control.

    For natural language applications the advantage is weaker, because the LLM also has to guess the formalization of the prompt, eg. like this for a programming companion (where S(program) is the semantic function of the programming language)

    prompt = getPrompt()
    program, formalization, reasoning =getGuessFromLLM(prompt)

    if (P(“S(${program}) implies ${formalization}”, reasoning, 10min))
    print(“I think you want a program with semantics ${formalization} which is fulfilled by ${program}”)
    else
    print(“I think you want a program with semantics ${formalization}, but I couldn’t come up with one in 10 minutes”)

    This could still be useful even if no suitable program can be found if the LLM was good at guessing reasonable formalizations of the problem stated in the prompt, because getting a good formalization can be an important step for finding a solution.

  14. j-r says

    Messed up the non existant markup: the prover signatur is meant to look like P(proposition, proof_hints, timeout)

  15. says

    @j-r,
    I don’t think I’m sufficiently well-versed in AI research to really engage with what you’re saying. But, proofs start with a certain set of base facts, and reason to conclusions. The ability to reason to a conclusion absolutely does not help if you don’t have base facts to begin with. If we have the name of a scholarly paper, is it a real paper, or is it not? There is no way to reason your way to a conclusion, you have to look it up.

  16. j-r says

    Yes, certainly. That is the problem when ‘understanding’ natural language input is part of the task. Someone or something will have to verify that the meaning of the natural language content has been adequately assimilated. Splitting this task in two by asking the AI not to just generate a natural language result, but requiring it to produce a formalization of its inputs could possibly reduce that verification effort required by the user, because some of the verification work could be supported by tools that operate deterministically and efficiently on that formalized output.

    I’m not yet convinced that LLMs are or will be capable of generating such output, but if they could with some reasonable success rate then IMO that would imply that – for some suitable meaning of the word – they do ‘understand’. And – also IMO – only then LLMs will actually be useful for anything besides doing horoscopes.

Leave a Reply

Your email address will not be published. Required fields are marked *