Yesterday I had a brilliant idea: why not parse the wiki of my favorite table top roleplaying game into yaml via an llm? I had tried the same with beautfifulsoup a couple of years ago, but the page is very inconsistent which makes it quite difficult to parse using traditional methods.

However, my attempts where not very successful to parse with a local mistral model (the one you get with ollama pull mistral) as it first insisted on writing more than just the yaml code and later had troubles with more complex pages like https://dsa.ulisses-regelwiki.de/zauber.html?zauber=Abvenenum So I thought I had to give it some examples in the system prompts, but while one example helped a little, when I included more, it sometimes started to just return an example from the ones I gave to it via system prompt.

To give some idea: the bold stuff should be keys in the yaml structure, the part that follows the value. Sometimes values need to be parsed a bit more like separating pages from book names - I would give examples for all that.

Any idea what model to use for that or how to improve results?

  • HelloRoot@lemy.lol
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    25 days ago

    it was too often deviating from the task talking about stuff or not sticking to the pattern

    Yeah that sounds like it can’t keep a large enough context. Maybe try a beefier model.

    I just suggested userjs because it runs directly in the browser and can use js dom parsers. Also userjs could inject a button that downloads the yaml. Idk if thats desired.

    The page doesn’t seem too complex, as you said - you just have to find the tag with the bold text and then the following paragraphs. A simple loop based parser logic will do.