• 0 Posts
  • 7 Comments
Joined 2 years ago
cake
Cake day: July 14th, 2023

help-circle

  • LLM image processing doesn’t work the same way reverse image lookup does.

    Tldr explanation: Multimodal LLMs turn pictures into a thousand 200-500 or so words tokens, but reverse image lookups create perceptual hashes of images and look the hash of your uploaded image up in a database.

    Much longer explanation:

    Multimodal LLMs (technically, LMMs - large multimodal models) use vision transformers to turn images into tokens. They use tokens for words, too, but these tokens don’t also correspond to words. There are multiple ways this could be implemented, but a common approach is to break the image down into a grid, then transform each “patch” of a specific size, e.g., 16x16, into a single token. The patches aren’t transformed individually - the whole image is processed together, in context - but it still comes out of it with basically 200 or so tokens that allow it to respond to the image, the same way it would respond to text.

    Current vision transformers also struggle with spatial awareness. They embed basic positional data into the tokens but it’s fragile and unsophisticated when it comes to spatial awareness. Fortunately there’s a lot to explore in that area so I’m sure there will continue to be improvements.

    One example improvement, beyond improved spatial embeddings, would be to use a dynamic vision transformers that’s dependent on the context, or that can re-evaluate an image based off new information. Outside the use of vision transformers, simply training LMMs to use other tools on images when appropriate can potentially help with many of LMM image processing’s current shortcomings.

    Given all that, asking an LLM to find the album for you is like - assuming you’ve given it the ability and permission to search the web - like showing the image to someone with no context, then them to help you find what music video - that they’ve never seen, by an artist whose appearance they describe with 10-20 generic words, none of which are their name - it’s in, and to hope there were, and that they remembered, the specific details that would make it would come up in the top ten results if searched for on Google. That’s a convoluted way to say that it’s a hard task.

    By contrast, reverse image lookup basically uses a perceptual hash generated for each image. It’s the tool that should be used for your particular problem, because it’s well suited for it. LLMs were the hammer and this problem was a torx screw.

    Suggesting you use - or better, using a reverse image lookup tool itself - is what the LLM should do in this instance. But it would need to have been trained to think to suggest this, capable of using a tool that could do the lookup, and have both access and permission to do the lookup.

    Here’s a paper that might help understand the gaps between LMMs and tasks built for that specific purpose: https://arxiv.org/html/2305.07895v7


  • From the blog post referenced:

    We do not provide evidence that:

    AI systems do not currently speed up many or most software developers

    Seems the article should be titled “16 AI coders think they’re 20% faster — but they’re actually 19% slower” - though I guess making us think it was intended to be a statistically relevant finding was the point.

    That all said, this was genuinely interesting and is in-line with my understanding of the human psychology that’s at play. It would be nice to see this at a wider scale, broken down across different methodologies / toolsets and models.





  • Is your goal to create things that can be published or used in a project, or to create audiobooks for yourself to listen to?

    For voiceovers for text, I use Kokoro Fast API, which has a web frontend. The frontend is only compatible with Chromium browsers on desktop or Android, which sucks as my daily driver is Firefox and an iPhone (there are workarounds in the thread) but it supports voice mixing, speed changes, etc… It also has an issue where it keeps the models (about 3GB) in memory; I keep the CPU version loaded normally and swap to the GPU version if I need it to be faster. If you want something similar for Bark, check out Bark-GUI.

    I’ve also dabbled a bit in some TTS features that have Comfy nodes, though at this point mostly just in terms of getting them set up. For my purposes thus far Kokoro has been fine (and I prefer the FastAPI project over the Comfy nodes for most of my uses), but I’ve found nodes for Kokoro, Dia, F5 TTS, Orpheus, and Zonos.

    Autiobooks and audiblez both look promising. A few weeks ago, I used the Kokoro FastAPI web frontend to create an audiobook for an ebook I worked on that used entirely self-hosted AI generation for the outlining and prose. Audiblez, which I found about like two days after that, looks like it would have simplified that process substantially. Still, I’d personally like something more like an audiobook studio, where I can more easily swap voices back and forth, add emotions, play with speed on a more granular level, etc… I’m thinking about building something that contains that at some point myself, but it’ll be a minute - hopefully someone else will beat me there.

    I posted a comment here a few weeks back on a similar topic. I’ve since used OpenReader-WebUI and like it, though that’s not for producing audiobooks, but for a read-along experience. Reproducing the comment below in case it’s helpful for you:

    If you want to generate audiobooks using your own / a hosted TTS server, check out one of these options:

    • OpenReader-WebUI - this has built-in read along capability and can be deployed as a PWA that can allow you to download the audiobooks to your phone so you can use them offline
    • p0n1/epub to audiobook
    • ebook2audiobook If you don’t have a decent GPU, Kokoro is a great option as it’s fast enough to run on CPU and still sounds very good. If you’re going to use Kokoro, Audiblez (posted by another commenter) looks like it makes that more of an all-in-one option. If you want something that you can use without an upfront building of the audiobook, of the above options, only OpenReader-WebUI supports that. RealtimeTTS is a library that handles that, but I don’t know if there are already any apps out there that integrate it. If you have the audiobook generation handled and just want to be able to follow along with text / switch between text and audio, check out https://storyteller-platform.gitlab.io/storyteller/