Yesterday I had a brilliant idea: why not parse the wiki of my favorite table top roleplaying game into yaml via an llm? I had tried the same with beautfifulsoup a couple of years ago, but the page is very inconsistent which makes it quite difficult to parse using traditional methods.
- https://dsa.ulisses-regelwiki.de/Kul_Auelfen.html
- https://dsa.ulisses-regelwiki.de/erw_zauber_sf.html?erw_zaubersf=Alchimieanalytiker
- https://dsa.ulisses-regelwiki.de/KSF_Alter_Adersin.html
However, my attempts where not very successful to parse with a local mistral model (the one you get with ollama pull mistral) as it first insisted on writing more than just the yaml code and later had troubles with more complex pages like https://dsa.ulisses-regelwiki.de/zauber.html?zauber=Abvenenum So I thought I had to give it some examples in the system prompts, but while one example helped a little, when I included more, it sometimes started to just return an example from the ones I gave to it via system prompt.
To give some idea: the bold stuff should be keys in the yaml structure, the part that follows the value. Sometimes values need to be parsed a bit more like separating pages from book names - I would give examples for all that.
Any idea what model to use for that or how to improve results?
Did you give the raw html as input to the model?
I assume for smaller open source models the context window might be too small. I tried it with chatgpt5 just for testing and it did pretty well: https://chatgpt.com/share/68a36599-2f5c-8003-9214-fdd693b53b72
Maybe instead of the raw html you could convert it to md first via pandoc to reduce the tokens or split a page into multiple sections then combine the resulting partial yaml files.
But tbh LLM seems like the wrong tool for this task… A simpler way would be to write a userjs (or python) script for this job. Since the input is a lot of well structured data - LLM are bad at handling a lot of data at once and script are good at wrangling well structured data, no matter how much it is.
I tried feeding the html which didn’t work at all and then just the raw text of the tag with id main (so the text in the white area, but no html tags). It didn’t feel like the task was too difficult in the sense that it never produced good results but that it was too often deviating from the task talking about stuff or not sticking to the pattern once more than one pattern was introduced.
Could you elaborate how userjs might help? Haven’t heard of it before but a quick google search didn’t make it immediately obvious. As I hinted before I tried using a python script with beautifulsoup parsing but due to the page being inconsistent, my results where debatable.
it was too often deviating from the task talking about stuff or not sticking to the pattern
Yeah that sounds like it can’t keep a large enough context. Maybe try a beefier model.
I just suggested userjs because it runs directly in the browser and can use js dom parsers. Also userjs could inject a button that downloads the yaml. Idk if thats desired.
The page doesn’t seem too complex, as you said - you just have to find the tag with the bold text and then the following paragraphs. A simple loop based parser logic will do.
Oh, it is very inconsistent. For example: https://dsa.ulisses-regelwiki.de/Trick_Aus.html has no “Regel:” english “Rule” infront of the rule. These have a rule aspect without a bold part infront “(passiv)”: https://dsa.ulisses-regelwiki.de/ESF_Aufliegende_Klinge.html These differ wildly: https://dsa.ulisses-regelwiki.de/Her_Alraune.html
But yea maybe a beefier model. Anything you would recommend?
I am doing something very similar, but for different kinda source (pdfs) and connverting to json (json/yaml do not matter).
what i have done is
- create a good enough template. - this is very important. i can not show my template exactly as it is work related, but it is simple, like define various key value pairs, and how it is meant to be presented. something like
{ // charecter description "name": "NAME_OF_CHARACTER", "powers": [{name: "fly"},{name: "see_through_walls"} ] }
and so on. try to cover as many cases you think that can be done.
-
install llama cpp (or ollama works too), i am using smollm3 3b (more knowledge, but slower (15-18tps)) and qwen3 1.7b (less knowledge, faster(25 tps)), i am currenty just running stuff on my laptop igpu.
-
here is my simplified code ( i have removed some important bits which are work related from promt, but just imagine a detailed prompt aking model to do something)
# assuming pdf with text - if it does not have text, then we might have to perform ocr import sys import pdftotext input_file = sys.argv[1] # Load your PDF with open(input_file, "rb") as f: pdf = pdftotext.PDF(f) pdf_text = "\n\n".join(pdf) # print(pdf_text) # reading the jsonc template with open('./sample-json-skeleton.jsonc', 'r') as f: template = f.read().strip() # print(template) # creating the prompt - we want to ask the model to fit the given pdf_text into a format sigven by json template prompt = "/no_think You have to parse a given text according to given json template. You must not generate false data, or alter sentences much, and must try to keep most things verbatim \n here is the json template. do note the template currently contains comments, but you should try to not generate any comments. Stick very closely to the structure of template, and do not create any new headers. do not create keys which do not exist in template. if you find a key or title from the source, try to fit it to keys/titles from the template. stick with the format. if you are unable to fit something to given template, add the additional section as that is the catch all section. Stick to the template. \n\n``` \n " + template + " \n``` \n\n and here is the data that you have to parse \n\n``` \n " + pdf_text + " \n```" # print(prompt) # asking llm to parse # using openai's python lib to call, but I am not calling openai's servers. instead I am using a locally hosted openai api compatible server (llama.cpp-server ftw) from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:11737/", api_key="sk-xxx") config = { 'temperature': 0.4, } response = client.chat.completions.create(model="", messages=[{"role": "user", "content": [{"type": "text", "text": prompt},],}],) print(response.choices[0].message.content)
it is not perfect, but i get 85+% on the way, and it is simple enough. if you need some more help, please ask.
Thank you! I have to find the time to try this out, but one issue I see right now: How do you communicate that some specific key value pairs are optional?
i have not honestly been able to ensure that. it partially works by just putting in jsonc that this particular key is optional, but that is not a guarantee. more generally, i try to avoid adding optional keys, and mostly leave that upto llm to put any such line in a catch all miscellaneous section. We do manual checking afterwards, so some inaccuracy is accepted.
on a seperate note, larger, better models usually perform better.
Ah, I see. While I plan for manual checking, it has to be >>90% right to be a viable solution. Anyway, when I find the time to try it out, I will comeback with my results. If you any additional ideas, feel free to share them!
and also, how are you getting the wiki? i would first scrape it . if it is something like fandom, then do not scrape directly, first host your own breeze wiki (https://docs.breezewiki.com/Running.html), then use wget with a optimal rate limit. using breeezewiki will remove some junk, and you will get cleaner html to begin with.
for small models, try to keep total input (prompt plus data) to be small, as they general can not reatin there smarts for much (even if they advertise larger contexts).