What to do with large data files?

Wanted to get some opinions, particularly from the devs, on a rather ridiculous idea that I have.

I’ve been thinking about word games lately, and I found myself down a rabbit hole this weekend. I thought, “what about a generic word lookup extension?” So, I created a proof of concept:

The program works just fine in a browser. The data file contains the entire YAWL (“Yet Another Word List”) with some 200,000+ entries. The data file itself is about 5 MB on my file system. In contrast, the word list that I used for What’s My Word, with 2,000+ entries, clocks in at around 30 kB. This clearly is ridiculous, and I really can’t think of a reason why I would need the entire YAWL loaded into a project. But, for giggles … what if I did?

The strings are already in hex format, so is it better to place these into buffers somehow and access the data that way? Or does the compiler already know to place constants into the game file and make them accessible to the runtime? Is the game file limit still 512 kB for hardware, or has that been expanded?

Would love your thoughts.

2 Likes

For a word list, if you’re OK with some non-words being accepted as words (false positives), you could use a bloom filter as a space-efficient alternative. It’s a one-way function where you can only check if a word is part of the set, so it would work for Wordle’s list of valid guesses but not for the list of target words.

According to an online calculator, a 2000-word list with a 0.1% false positive rate would need about 3.51 kiB. For 10000 words with 0.01% false positives, it would be 23.4 kiB.

2 Likes

Well, I suppose this answers one of my questions. :laughing:

image

1 Like

@kwx TIL! this is very cool

2 Likes

Also @AlexK, no need to pack strings into buffers! The compiler does indeed include strings in a space efficient way.

2 Likes

Maybe compress it and store in base85?

I am doing prj using word list too, recently.
It’s a words reciting game. First vacabulary has about 1500 words, each word with 2 fields(spelling, meaning) stored in a 2d string array.
When I download to device(Meobit) got compile error. After I commented last 600+ words, about 900 left, it works. But could got 021 error in high chance, especially, when connecting USB cable.
(Only Arcade Text / Sprite Text ext imported)

1 Like

So, I’ve been thinking about this one for quite some time. @kwx piqued my interest in Bloom filters with his response, and I just kept returning to this thread every few weeks. I decided to take a break from Monopoly this weekend and take a deep dive on this.

I stumbled upon a great little write up on universal hash functions here,[1] which took me back to the original paper on them by Carter and Wegman.[2] Those two references gave me the information that I needed to give Bloom filters a try.

I wrote a C# program that ingests a word list and spits out a TypeScript array that I can copy-and-paste into MakeCode Arcade. I used the Game Words list by Dana Bell;[3] YAWL[4] would work just as well. This first pass worked just fine for words up to five characters in length. Beyond that, the numbers got too big to fit into TypeScript’s number type. This was my motivation to port bigint to MakeCode.

It works, too! The complete Game Words dictionary up to 12-letter words fits in ~400kB. Take a look!

I’ll come back to this again another time. I’ll write a proper interface to my C# program and put that in GitHub so that y’all can use it with any word list you like. I’ll also refine the TypeScript code a bit and put that up on GitHub for anyone who wants it. I could wrap this into an extension, but because of the size of the dictionary, it might not work on hardware. You will want to strip out the filters that you don’t need to make the code optimal.

I’ll be adding Countdown to my list of future projects. That and Lingo. :slight_smile:

I’m just stoked that it works! Have fun!


[1] Mount, Dave. (2019). CSMC 420 Lecture 10: Hashing - Basic Concepts and Hash Functions. https://www.cs.umd.edu/class/fall2019/cmsc420-0201/Lects/lect10-hash-basics.pdf

[2] Carter, Larry; Wegman, Mark N. (1979). “Universal Classes of Hash Functions.” Journal of Computer and System Sciences . 18 (2): 143–154. https://doi.org/10.1016/0022-0000(79)90044-8 Conference version in STOC’77.

[3] Bell, Dana. (2020). Game Words 2020 (provisional). https://www.tylerhosting.com/gamewords/

[4] Cooper, Mendel Leo. Yet Another Word List (YAWL). Republished by Aaron Bull Schaefer. https://github.com/elasticdog/yawl

2 Likes

Super cool, Alex! It looks like the game file is missing at that URL, though. Can you relink the game?

3 Likes

Aww bummer! Wonder where it went…

Here’s a new link!

1 Like