Meowbit hardware floating point support, or "how I spent an entire weekend adding two numbers together"

paul · May 21, 2019, 12:51am

So, the first thing I did upon discovering Makecode Arcade, was to throw together this Mandelbrot hack, just for fun:

@frank_schmidt kindly ran it on real hardware (I did not have a Meowbit at the time - I now do!), and the result was that it was so slow that it was effectively unusable since Arcade is configured to always have Typescript use double-precision software-emulated floating point.

@peli pointed out that there is a fixed-point library, declaring the Fx8 type, which makes real-number operations faster than the slow double emulation. But I still figured there must be a way to use the FPU which the Cortex M4 has.

Just as a test I thought maybe I could pass a pair of numbers, boxed inside a Buffer, to a C++ extension (leading to my other thread: "pxt target arcade" fails - #16 by paul, and thanks are due to @mmoskal for his help there), write code to add them together using the ‘float’ type rather than ‘double’, and that this would emit hardware floating point code. Effectively my idea was to model the FPU as if it were a separate hardware device, and communicate with it passing Buffers back and forth to a native extension…

However, using arm-none-eabi-objdump (from Yotta) to disassemble the generated code, showed that single-precision floating point operations still yielded software emulation. The ‘add(Buffer f1, Buffer f2)’ function’s code looked like:

push	{r4, lr}
ldr	r1, [r1, #8]
mov	r4, r0
ldr	r0, [r0, #8]
bl	0 <__aeabi_fadd>
str	r0, [r4, #8]
pop	{r4, pc}

the call to _aeabi_fadd being the offending function call to the GCC floating point emulation library. That was not what I expected. So as a test, I tried compiling for the samd51 target instead of the stm32f401 target. In this case, the emitted code used the expected hardware instructions (vldr, vadd, vstr):

vldr	s15, [r0, #8]
vldr	s14, [r1, #8]
vadd.f32	s15, s15, s14
vstr	s15, [r0, #8]
bx	lr
nop

As a further test, I tried inlining the above assembly language, and compiling again for stm32f401 but the assembler complained that these instructions were invalid. Even writing a raw .s file gave the same result - the assembler wasn’t going to accept the vldr, vadd, or vstr instructions as valid. But I was sure that they were.

It turns out that whilst the samd51 target is configured to emit native floating point instructions for single-precision floats, the stm32f401 is not. These configs are in the relevant codal packages, i.e.:

github.com

lancaster-university/codal-big-brainpad/blob/master/target.json#L33


      
              "DEVICE_PANIC_HEAP_FULL":1,
              "DEVICE_DMESG_BUFFER_SIZE":2048,
              "CODAL_DEBUG":"CODAL_DEBUG_DISABLED",
              "DEVICE_USB":0,
              "CODAL_TIMESTAMP":"uint64_t",
              "PROCESSOR_WORD_TYPE":"uint32_t"
          },
          "definitions":"-DSTM32F4 -DSTM32F401xE -include codal-big-brainpad/inc/localconf.h",
          "cmake_definitions":{
          },
          "cpu_opts":"-mcpu=cortex-m4 -mthumb -mfloat-abi=softfp -mfpu=fpv4-sp-d16",
          "asm_flags":"-fno-exceptions -fno-unwind-tables",
          "c_flags":"-std=c99 -fwrapv -Warray-bounds",
          "cpp_flags":"-std=c++11 -fwrapv -fno-rtti -fno-threadsafe-statics -fno-exceptions -fno-unwind-tables -Wl,--gc-sections -Wl,--sort-common -Wl,--sort-section=alignment -Wno-array-bounds",
          "linker_flags":"-Wl,--no-wchar-size-warning -Wl,--gc-sections -mcpu=cortex-m4 -mthumb",
          "libraries":[
              {
                  "name":"codal-core",
                  "url":"https://github.com/lancaster-university/codal-core",
                  "branch":"master",
                  "type":"git"

and

github.com

lancaster-university/codal-big-brainpad/blob/master/target-locked.json#L28


      
              "DEVICE_STACK_SIZE": 2048,
              "DEVICE_TAG": 0,
              "DEVICE_USB": 0,
              "EVENT_LISTENER_DEFAULT_FLAGS": "MESSAGE_BUS_LISTENER_QUEUE_IF_BUSY",
              "MESSAGE_BUS_LISTENER_MAX_QUEUE_DEPTH": 10,
              "PROCESSOR_WORD_TYPE": "uint32_t",
              "SCHEDULER_TICK_PERIOD_US": 4000,
              "USE_ACCEL_LSB": 0
          },
          "cpp_flags": "-std=c++11 -fwrapv -fno-rtti -fno-threadsafe-statics -fno-exceptions -fno-unwind-tables -Wl,--gc-sections -Wl,--sort-common -Wl,--sort-section=alignment -Wno-array-bounds",
          "cpu_opts": "-mcpu=cortex-m4 -mthumb -mfloat-abi=softfp -mfpu=fpv4-sp-d16",
          "definitions": "-DSTM32F4 -DSTM32F401xE -include codal-big-brainpad/inc/localconf.h",
          "device": "STM32",
          "generate_bin": true,
          "generate_hex": true,
          "libraries": [
              {
                  "branch": "f7653d0d6fd23e5794a0eec2a034263d4f529ccc",
                  "name": "codal-core",
                  "type": "git",
                  "url": "https://github.com/lancaster-university/codal-core"

vs

github.com

lancaster-university/codal-itsybitsy-m4/blob/master/target.json#L51


      
              "USB_DEFAULT_VID": "0x03eb",
              "USB_DEFAULT_PID": "0x2066",
              "SERCOM_100MHZ_CLOCK": 1,
              "PROTOTYPE_SERCOM_SPI_M_SYNC": "SERCOM0",
              "PROTOTYPE_SERCOM_I2CM_SYNC": "SERCOM1",
              "PROTOTYPE_SERCOM_USART_ASYNC": "SERCOM2"
          },
          "definitions":" -DSAMDX1 -D__SAMD51J19A__",
          "cmake_definitions":{
          },
          "cpu_opts":"-mcpu=cortex-m4 -mthumb -mfloat-abi=softfp -mfpu=fpv4-sp-d16",
          "asm_flags":"-fno-exceptions -fno-unwind-tables",
          "c_flags":"-std=c99 -fwrapv -Warray-bounds",
          "cpp_flags":"-std=c++11 -fwrapv -fno-rtti -fno-threadsafe-statics -fno-exceptions -fno-unwind-tables -Wl,--gc-sections -Wl,--sort-common -Wl,--sort-section=alignment -Wno-array-bounds",
          "linker_flags":"-Wl,--no-wchar-size-warning -Wl,--gc-sections -mcpu=cortex-m4 -mthumb -u Reset_Handler",
          "libraries":[
              {
                  "name":"codal-core",
                  "url":"https://github.com/lancaster-university/codal-core",
                  "branch":"master",
                  "type":"git"

and

github.com

lancaster-university/codal-itsybitsy-m4/blob/master/target-locked.json#L45


      
              "SERCOM_100MHZ_CLOCK": 1,
              "UF2_INFO_ADDR": "(const char*)(*(uint32_t*)(BOOTLOADER_END_ADDR - (16 * 4) + (4 * 4)))",
              "USB_DEFAULT_PID": "0x2066",
              "USB_DEFAULT_VID": "0x03eb",
              "USB_EP_FLAG_NO_AUTO_ZLP": "0x01",
              "USB_MAX_PKT_SIZE": 64,
              "USE_ACCEL_LSB": 0,
              "USE_FULL_ASSERT": 1
          },
          "cpp_flags": "-std=c++11 -fwrapv -fno-rtti -fno-threadsafe-statics -fno-exceptions -fno-unwind-tables -Wl,--gc-sections -Wl,--sort-common -Wl,--sort-section=alignment -Wno-array-bounds",
          "cpu_opts": "-mcpu=cortex-m4 -mthumb -mfloat-abi=softfp -mfpu=fpv4-sp-d16",
          "definitions": " -DSAMDX1 -D__SAMD51J19A__",
          "device": "ITSYBITSY_M4",
          "generate_bin": true,
          "generate_hex": true,
          "libraries": [
              {
                  "branch": "b95d4c21a91ad71d3881eb525fa4bec140965fd5",
                  "name": "codal-core",
                  "type": "git",
                  "url": "https://github.com/lancaster-university/codal-core"

So, I took a mirror of the codal-big-brainpad repo, and made it’s cpu_opts the same as the itsybitsy target. Then, on a local copy of pxt-arcade, modified the pxtarget.json to point to it instead of Lancaster University’s repo i.e.

Changed

"url": "https://github.com/lancaster-university/codal-big-brainpad",
"branch": "v1.0.22",

To

"url": "https://github.com/junk100/codal-big-brainpad",
"branch": "master",

Running this locally via “pxt serve” and compiing test code against my extension finally had the desired result, and I could compile my inline assembly, and run it successfully on hardware, proving that floating point is possible (but requires the codal-big-brainpad target to be patched).

My extension is here: https://github.com/junk100/pxt-shimtest

The test code I wrote is simply:

let f = fpu.createFloat(13)
let f2 = fpu.createFloat(17)
f.addToSelf(f2)
game.splash("Hello" + f.get())

(To prove that the native shim is definitely being used, I added a deliberate math mistake to the in-simulator version of the extension, by adding a random number to the result - https://github.com/junk100/pxt-shimtest/blob/master/extension.ts#L33). Running the code in the simulator gives the ‘wrong’ answer, and the correct calculation is performed on the hardware.

I did actually suspect that the Meowbit might not have the FPU enabled (i.e. requring this startup code: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHBJHIG.html), but the bootloader must already have this covered.

So, now that this overly long post is nearly over:

A question for the Arcade owners
Could/should the codal-big-brainpad target be patched to allow floating point instructions, at least for native extensions? If the “big brainpad” itself doesn’t have the FPU, then perhaps a separate target for the Meowbit (which definitely does).

Thanks for reading!

mmoskal · May 21, 2019, 1:31am

Excellent investigation! The F4 should have FPU enabled - all the chips we use have FPU. I’ll do that tomorrow.

Honestly I don’t think we ever will go for chip without FPU for Arcade anymore. Initially we wanted to use a bigger version of STM32F103 but they are not big enough or they are more expensive than the F401.

It would be nice to somehow expose the FPU but I’m not sure how exactly. We need a uniform data representation so the floats would have to be boxed probably negating most of the perf advantage.

I do intend to optimize the fx8 math though.

Topic		Replies	Views
Need help? Ask us anything! [CLOSED] [12/21/2020 - 1/8/2021] Arcade makecode-helpdesk	54	2202	January 11, 2021
Mandelbrot viewer Arcade	19	2028	May 21, 2019
Play MakeCode Arcade Game In App Help	14	3461	July 7, 2025
What is the most efficient way to draw a single pixel to the screen? Help graphics-and-math	24	475	April 17, 2025
MakeCode Arcade demoscene? Arcade graphics-and-math , demo , art , music	107	3561	July 24, 2025

Meowbit hardware floating point support, or "how I spent an entire weekend adding two numbers together"

Related topics