Yesterday I had a brilliant idea: why not parse the wiki of my favorite table top roleplaying game into yaml via an llm? I had tried the same with beautfifulsoup a couple of years ago, but the page is very inconsistent which makes it quite difficult to parse using traditional methods.
- https://dsa.ulisses-regelwiki.de/Kul_Auelfen.html
- https://dsa.ulisses-regelwiki.de/erw_zauber_sf.html?erw_zaubersf=Alchimieanalytiker
- https://dsa.ulisses-regelwiki.de/KSF_Alter_Adersin.html
However, my attempts where not very successful to parse with a local mistral model (the one you get with ollama pull mistral) as it first insisted on writing more than just the yaml code and later had troubles with more complex pages like https://dsa.ulisses-regelwiki.de/zauber.html?zauber=Abvenenum So I thought I had to give it some examples in the system prompts, but while one example helped a little, when I included more, it sometimes started to just return an example from the ones I gave to it via system prompt.
To give some idea: the bold stuff should be keys in the yaml structure, the part that follows the value. Sometimes values need to be parsed a bit more like separating pages from book names - I would give examples for all that.
Any idea what model to use for that or how to improve results?
Yeah that sounds like it can’t keep a large enough context. Maybe try a beefier model.
I just suggested userjs because it runs directly in the browser and can use js dom parsers. Also userjs could inject a button that downloads the yaml. Idk if thats desired.
The page doesn’t seem too complex, as you said - you just have to find the tag with the bold text and then the following paragraphs. A simple loop based parser logic will do.
Oh, it is very inconsistent. For example: https://dsa.ulisses-regelwiki.de/Trick_Aus.html has no “Regel:” english “Rule” infront of the rule. These have a rule aspect without a bold part infront “(passiv)”: https://dsa.ulisses-regelwiki.de/ESF_Aufliegende_Klinge.html These differ wildly: https://dsa.ulisses-regelwiki.de/Her_Alraune.html
But yea maybe a beefier model. Anything you would recommend?