i think your biggest issue on hardware is going to be memory rather than computational performance. the reason we do 4 bits per pixel is 100% because of memory; packing and unpacking bytes every time we change a pixel is pretty significantly slower than just using one byte per pixel.
but the memory saving makes it worth it. each screen is 160*120 pixels which comes out to roughly 10kb per screen. we have two screens in memory at given time (i.e. we double buffer the screen) so that’s 20kb total. the processor in the meowbit, the STM32F401RET6, has a whopping 96kb of RAM, so that’s not much left over for the user program!