The CH32V002 is a low-end 32-bit RISC-V microcontroller by WCH. Its core supports the RV32EC instruction set, meaning it has 16 registers and supports a set of 16-bit compressed instructions (in addition to the base ISA’s 32-bit instructions). The compressed instructions are aliases of common 32-bit instructions, allowing them to be substituted in to reduce code size. As a result of the reduced bits, they have a tendency to only accept a limited range of registers as sources or destinations (s0-1, a0-5), or a more limited range of immediates, or both. They are an interesting set of instructions to try and fit code into!

I set myself the goal to toggle PA1 at approximately 1 Hz by any means necessary with any other side effects allowed. I found introducing side effects only started to be necessary below the 50 to 60 byte mark.

This program can be assembled using Bronzebeard.

Initial attempt, 36 bytes:

RCC_BASE = 0x40021000
RCC_PB2PCENR_OFFSET = 24
GPIOA_BASE = 0x40010800
GPIOA_BASE_SHIFTED = (0x40010800 << 1)
GPIOA_CFGLR_OFFSET = 0
GPIOA_BSHR_OFFSET = 16
GPIOA_BCR_OFFSET = 20

init:
	# By default SYSCLK is 24 MHz
	# Enable GPIOA clock (and GPIOC, and write to RO reserved bit as side effect)
	# Side effects are due to abusing the constant in a0 and using it for 3 different things
	lui	a5, %hi(RCC_BASE) 		# equivalent to li a5, RCC_BASE
	c.li 	a0, 0x16
	c.sw	a0, RCC_PB2PCENR_OFFSET(a5)

	# Put PA1 into push-pull output (and modify PA0 mode as side effect)
	# these two instructions save 2 bytes compared to naive "li a5, GPIOA_BASE"
	lui	a5, %hi(GPIOA_BASE_SHIFTED)
	c.srli 	a5, 1
	c.sw	a0, GPIOA_CFGLR_OFFSET(a5)

	# Toggle PA1 (and two other pins as side effect) at roughly 1 Hz
main_loop:
	c.sw	a0, GPIOA_BSHR_OFFSET(a5)
	c.jal 	busy_wait
	# it is odd to me that the chip has both set/reset and reset-only control registers,
	# but it does allow us to do this
	c.sw 	a0, GPIOA_BCR_OFFSET(a5)
	c.jal 	busy_wait
	c.j 	main_loop

busy_wait:
	# use the address 0x40010800 as the cycle count
	# 500 ms @ 24 MHz = 12 million cycles
	# would have guessed 2 cycles per loop iteration, so loop needs about 6 million iterations
	# 0x40010800 >> 7 gives us 8 million
	# but in practice this takes like 8 times longer than expected, so bump to >> 10
	# (flash fetch and branching being slow?)
	c.mv 	a3, a5
	c.srli 	a3, 10
busy_wait_1:
	c.addi 	a3, -1
	c.bnez 	a3, busy_wait_1
	c.jr 	ra

I realised it’s possible to get a nice round 32 bytes if the loop toggles instead of doing a separate set and reset, since then you can inline the busy wait:

# Blinky for the CH32V002 (PA1, approx. 1 Hz) in 32 bytes

RCC_BASE = 0x40021000
RCC_PB2PCENR_OFFSET = 24
GPIOA_BASE = 0x40010800
GPIOA_BASE_SHIFTED = (0x40010800 << 1)
GPIOA_CFGLR_OFFSET = 0
GPIOA_OUTDR_OFFSET = 12

init:
	lui	a5, %hi(RCC_BASE)
	c.li 	a0, 0x16
	c.sw	a0, RCC_PB2PCENR_OFFSET(a5)
	lui	a5, %hi(GPIOA_BASE_SHIFTED)
	c.srli 	a5, 1
	c.sw	a0, GPIOA_CFGLR_OFFSET(a5)
	c.li 	a1, 0x2
main_loop:
	c.xor 	a0, a1
	c.sw	a0, GPIOA_OUTDR_OFFSET(a5)
	c.mv 	a3, a5
	c.srli 	a3, 10
busy_wait:
	c.addi 	a3, -1
	c.bnez 	a3, busy_wait
	c.j 	main_loop

Artful Bytes gets their blinky down to 48 bytes, but they’re targeting a different chip and ISA, so it’s not quite fair.