Skip to main content


Turns out that LLM summaries are actually useful.

Not for *summarizing* text -- they're horrible for that. They're weighted statistical models and by their very nature they'll drop the least common or most unusual bits of things. Y'know, the parts of a message that are actually important.

No, where they're great is as a writing check. If an LLM summary of your work is accurate that indicates what you wrote doesn't really have much interesting information in it and maybe you should try harder.

This entry was edited (3 weeks ago)

reshared this

in reply to Dan Sugalski

@Dan Sugalski I disagree, they're great for summarizing. It's akin to high powered skimming, so yeah it's not good for little unusual details, but if you're using an AI to summarize text and have any sense you're not looking at text where you're worried about those little exceptions.

Prime example that I've leaned on often is that Amazon now provides an AI summarization at the top of reviews. Odd exceptions are just that with reviews, but it'll definitely highlight the general patterns and let you know if you should dig deeper.

Also I've honestly used it like above once or twice with the intent of just seeing if my point comes across... most conversations aren't really about the fine details or unusual bits?

in reply to Shiri Bailem

@shiri This is where they're deceptive and you're setting yourself up for trouble.

If the backing model has seen "this product is useful" 100x more often than "this product is not useful", when summarizing text containing "this product is not useful" or its longwinded analog you are *far* more likely to get "this product is useful" as a result of the summarization, which inverts the meaning. That is probably not what you want.

in reply to Dan Sugalski

@Dan Sugalski not at all how it works and maybe you should be checking your bias on some of this?

What it does is highlight the points being brought up, some examples:

A dehumidifier I recently purchased with good reviews, basically boiling down to nothing of note (it's a very small dehumidifier and I wasn't worried about water capacity because it has a drain hose and I planned to put it on the counter next to the sink):

Customers are satisfied with the dehumidifier's performance, quiet operation, and small size. They find it effective at removing moisture from the air without disturbing sleep. Many appreciate its ease of use and emptying process. However, opinions differ on its value for money and water capacity.

Now a different dehumidifier with worse reviews, if I was considering this one I'd check in deeper about those complaints:

Customers appreciate the dehumidifier's ability to remove moisture from basements and windows. They find it quiet and effective in controlling humidity levels. However, some customers report durability issues with the product breaking down after a short period of use. There are also complaints about error codes and vibrations. Opinions differ on functionality, noise level, pump performance, recharging time, and energy efficiency.

Despite what you said before, these are not weighted statistical models, it's drastically different from that and does show debatable levels of comprehension of text (debatable only because it's an argument for philosophers where the line of comprehension is).

I'm not advocating for the abuses of AI (ranging from ecological impact of poor tuning and overuse of high end models to efforts by companies to replace artists), but this technology really does have meaning and value when used properly. Text summarization for large bulk non-sensitive text is absolutely one of those cases.

in reply to Shiri Bailem

@shiri

these are not weighted statistical models


So what would you say is an LLM if not a weighted statistical model of language?

in reply to Roufnichan LeGauf 🌈

@Robert Wire 🌈 ... they are a Large Language Model... There's a reason they're called LLMs and not WSMs...

They are their own thing. A weighted statistical model is far too simplistic for how an LLM works and is more along the lines of an old 90s spam filter.

LLMs are very complex systems which go far beyond my scope of patience to explain. Might be good to just review the wikipedia article on it: en.wikipedia.org/wiki/Large_la…

in reply to Shiri Bailem

@shiri I checked the Wikipedia article and it confirms what Dan said. Not sure what your point is.
in reply to Shiri Bailem

@shiri @barubary I don't think you've developed an intuitive understanding of the universal approximation theorem. en.wikipedia.org/wiki/Universa…

I'm happy to explain further, if you don't understand how a weighted statistical model could demonstrate some behaviours we associate with understanding, while remaining wholly incapable of others.

in reply to Dan Sugalski

@shiri

How about an actual study...

crikey.com.au/2024/09/03/ai-wo…

tl;dr: they're shit at summarising. But it's obvious anyway - to summarise requires the ability to accurately identify the important points, which requires understanding.

LLMs have no understanding therefore they cannot summarise.

And if you need proof they can't reason:

garymarcus.substack.com/p/llms…

in reply to adaddinsane (Steve Turnbull)

@adaddinsane (Steve Turnbull) @Dan Sugalski yeah... fantastic evidence showing bias and lack of reasoning in your response:

  • Government review? It wasn't asked to do general summaries, it was asked to reference page numbers, citations, etc all of which are precision tasks, especially precision numerical tasks. This doesn't mean it's bad at summarizing, it means it's bad at the task given.
  • The "no formal reasoning" argument being cited has been going around for a while now and has nothing to do with reasoning, AI is bad at math and handles data in a similar sense to a quick skim, and a quick skim of a math problem with gotchas in it is going to catch just about everyone.

If you're going to argue they're bad at summarizing then you need to actually show summarizing not compiling paperwork, not citations, not math.

God, I feel like half this anti-AI drivel itself could be AI generated.

There are real problems with AI, but people like y'all smoke screen the shit out of them to the point where I sometimes wonder if corporations are paying for anti-AI sock puppets to just hide the real issues.

in reply to Shiri Bailem

@shiri @adaddinsane
I mean, sure. If you only count one extremely specific and restricted meaning of "summarizing" as "real summarizing", LLMs might be adequate for it. But this is far from a consensual meaning.

In this sense, your argument resembles the "no true Scotsman" fallacy (unless you _actually_ want everyone to lower their expectations of what an adequate summary should be, which I do not want to assume).

in reply to ben_tinc

@ben_tinc @adaddinsane (Steve Turnbull) @Dan Sugalski from Meriam-Webster: Summarize means to make a summary.

Summary:
(adjective) Covering the main points succinctly
(noun) an abstract, abridgment, or compendium especially of a preceding discourse

Summaries are not citations and definitely aren't math, they're basically just paraphrasing down text.

Digging a little further into the government study being cited, here's a great bit of text from the study itself:

It is important to note that the results should not be extrapolated more widely as:
β€’ the timeframe allowed to optimise the model was limited to one week as this was as short duration PoC; and
β€’ the PoC’s point-in-time results relate to the use of certain prompts using a specific LLM and selected for one specific use-case. This limits the generalisability of the findings to other use cases and LLMs.

In other words the authors themselves said this isn't an indication that it's bad at summarizing, just that this solution doesn't look promising for their specific use case.

I get where you're feeling the no-true-scotsman vibe, but it comes from the fact that I'm pointing out additional requirements on top of summarizing as not being reasons for it being bad at summarizing. One was a government office testing for a specific special use case rather than a general testing of summarization abilities, and the other was entirely about it's ability to work math problems.

in reply to Shiri Bailem

@shiri @adaddinsane
I think it is fair to say that LLMs might be adequate for certain tasks which go by the label summary, as your own positive experiences show very clearly.

And it is certainly proper of the study authors to not promote over-generalization of their results. However, they also say that while the model performed _especially_ bad wtt identifying references to ASIC, humans scored better across all metrics.

in reply to ben_tinc

@shiri @adaddinsane
[...] and this includes what I would call the absolute minimum: highlighting the central point of the summarized text in the 'general summary' portion of the test (section 5.1 in the full report).
in reply to Ben Aveling

@Ben Aveling @adaddinsane (Steve Turnbull) @Dan Sugalski already did: foggyminds.com/display/c6ef095…


@adaddinsane (Steve Turnbull) @Dan Sugalski yeah... fantastic evidence showing bias and lack of reasoning in your response:
  • Government review? It wasn't asked to do general summaries, it was asked to reference page numbers, citations, etc all of which are precision tasks, especially precision numerical tasks. This doesn't mean it's bad at summarizing, it means it's bad at the task given.
  • The "no formal reasoning" argument being cited has been going around for a while now and has nothing to do with reasoning, AI is bad at math and handles data in a similar sense to a quick skim, and a quick skim of a math problem with gotchas in it is going to catch just about everyone.

If you're going to argue they're bad at summarizing then you need to actually show summarizing not compiling paperwork, not citations, not math.

God, I feel like half this anti-AI drivel itself could be AI generated.

There are real problems with AI, but people like y'all smoke screen the shit out of them to the point where I sometimes wonder if corporations are paying for anti-AI sock puppets to just hide the real issues.


in reply to Dan Sugalski

@shiri It would appear your anti-AI bias has tainted your ability to judge them fairly.

Amazon’s product AI, environmental and other issues aside, provides valid high level summarization of product reviews. At present time, the model has not been tainted or tampered with to provide positive feedback, but rather provides an accurate representation of the typical review feedback. It tends to covers likes, dislikes, concerns, and overall views

in reply to ClickyMcTicker

@ClickyMcTicker @shiri I'm sorry, but how exactly would you back up those claims you're making? "Amazon said so" isn't very compelling.
in reply to Ted Mielczarek

@Ted Mielczarek @ClickyMcTicker @Dan Sugalski where did I say "Amazon said so"? I just cited Amazon's use case as a great example from personal experience.

I'm not digging around constantly for some article somewhere that says "AI has been tested to get the general gist right in some arbitrary scientific study", I'm doubtful such a thing exists given how fuzzy it is. But I can sure as hell poke holes in the bullshit y'all are posting.

If y'all are ever interested in the real problems of AI hit me up, but until then maybe enjoy screaming about the evils of how cars are killing the horse drawn carriage business and worse than horses at steering themselves while ignoring all other issues around them.

in reply to Shiri Bailem

@shiri No worries. It's possible to disagree on topics like this and still attempt to communicate effectively!
in reply to Shiri Bailem

@shiri
>most conversations aren't really about the fine details or unusual bits

I suspect this captures the crux of what the root post was pointing out when C/VP levels were mentioned: a criticism of conversations [between higher level employees and lower level] that are devoid of helpful/new information.

in reply to Shiri Bailem

@shiri Amazon=case study of why ai summaries are useless. On books, just far too generic (readers say the pacing was good; readers say the characterization was good). Drops all the bits I’m most interested in (plot, themes, why the pacing was good). On other items, just lifts phrases out of reviews with no idea of the context (on the negative side, some users report this item boils water very fast - this on a camping stove).
in reply to Clare Hooley

@Clare Hooley @Dan Sugalski so your argument that it's useless as a whole is that it's bad in areas where summarizations aren't so useful in general?

Books where a general summary of the comments of everyone about the book... in what world would this be useful?

And on a camping stove you're citing the fact that people often said redundant things about it and that those things showed up in the summary as evidence that it's bad? You think a good summarization would edit our things it thinks unnecessary, and you think that there's any version in which that would be an improvement?

in reply to Shiri Bailem

a good summary would provide me information to allow me to assess what readers/users really thought. These do not (in this case of the stove, yes it’s obvious it’s made a mistake with the β€˜on the negative side’ but it’s not always obvious to pick out things that are completely wrong but statistically likely).
This entry was edited (3 weeks ago)
in reply to Clare Hooley

@Clare Hooley @Dan Sugalski The problem is that is what users really thought (aside from as you said it's listed as a negative which is an honest goof), it's saying that because a lot of people apparently thought that was good information to include in the review.

The summary isn't trying to tell you whether or not it's a good product, and it's definitely not trying to extrapolate underlying intents, it's just trying to summarize what's talked about in the various reviews so you can have an idea if there's something you might want to look into before you buy it.

I gave a prime example earlier quoting two summaries from dehumidifiers, the first is one I bought myself and the second was me digging for one that was less than 4 stars.

in reply to Shiri Bailem

@shiri πŸ™‚ The only positive thing I can say for the summaries is at least it has picked up β€˜we thank blah for an ARC in exchange for an honest review.’ Isn’t useful.
in reply to Dan Sugalski

this aligns to my theory about LLM investment and potential progress toward generalized intelligence; which is that further development of generative AI will just create models that are better and better at being thoroughly and completely mediocre.
in reply to Dan Sugalski

I could not agree more. I am just experimenting with converting from a blog post to a thread using LLMS.

Check the thread: hachyderm.io/@mapache/11335991…

And the blog post: maho.dev/2024/09/my-coffee-his…

It got rid of the parts I mostly like. A shame.


My relationship with coffee probably began in my mother's womb. She may deny it, but I'm pretty sure she drank more coffee than advised or allowed in the 1980s.

#coffee #story


This entry was edited (3 weeks ago)
in reply to Shiri Bailem

it actually did a good job on summarizing and get the points, it did not miss the big parts. But as @shiri mentions, the points that make the post special (nuanced, human) got "washed".

For example, the blog post talks about coffee as a character: "Coffee in my childhood was always a supporting character in the main story."

Also, some of my fav phrases got removed (e.g. "as we say in Latin America, to be young and not revolutionary is almost a biological contradiction.")

This entry was edited (3 weeks ago)
in reply to tqwhite

@tqwhite @Walrus

Is the misprint that makes one of your key instructions meaningless also part of your standard prompt?

(I have no real idea, but the layout suggests that it is.)

in reply to flere-imsaho

@mawhrin @glc @Walrus that’s weird since I actually use it all the time and it definitely does what I want it to do. Not theory, frequently observed fact.
⇧