Show HN: Cicada – A scripting language that integrates with C

57 points by briancr 2 days ago

I wrote a lightweight scripting language that runs together with C. Specifically, it's a C library, you run it through a C function call, and it can callback your own C functions. Compiles to ~250 kB. No dependencies beyond the C standard library.

Key language features: * Uses aliases not pointers, so it's memory-safe * Arrays are N-dimensional and resizable * Runs scripts or its own 'shell' * Error trapping * Methods, inheritance, etc. * Customizable syntax

smartmic 2 days ago

Cool, I like these kinds of projects. When it comes to embedding a scripting language in C, there are already some excellent options: Notable ones are Janet, Guile, and Lua. Tcl is also worth considering. My personal favorite is still Janet[0]. Others?

[0]: https://janet-lang.org/

forgotpwd16 2 days ago

Io is nice (Smalltalk/Self-like). A mostly comprehensive list: https://dbohdan.github.io/embedded-scripting-languages/
- publicdebates 2 days ago
  
  That list (or any similar list) would be so helpful if it had a health column, something that takes into account number of contributors, time since last commit, number of forks, number of commits, etc. So many projects are effectively dead but it's not obvious at first sight, and it takes 2 or 3 whole minutes to figure out. That seems short but it adds up when evaluating a project, causing people to just go to a well known solution like Lua (and why not? Lua is just fine; in fact it's great).
  
  briancr 2 days ago
  
  Seconded.
- briancr 2 days ago
  
  Should have replied directly —- thanks! That’s a great list..
dualogy 2 days ago

AngelScript. Matured & maintained since 2003, is fully typed and with C syntax. https://www.angelcode.com/angelscript/
- briancr 2 days ago
  
  Yes very C-like.. One immediate difference is that in these C-like scripting languages there’s a split between definitions and executable commands. In Cicada there are only executable commands: definitions are done using a define operator. (That’s because everything is on the heap; Cicada functions don’t have access to the stack). I personally think the latter method makes more sense for command-line interactivity, but that’s a matter of taste.
briancr 2 days ago

Thanks! I’m unfamiliar with Janet but I’ve looked into the others you listed.
One personal preference is that a scripting syntax be somewhat ‘C-like’.. which might recommend a straight C embedded implementation although I think that makes some compromises.
zem 2 days ago

squirrel: http://squirrel-lang.org/
- briancr 2 days ago
  
  Yes I like this one. It’s similar and even more C-like, in that it discriminates between classes, class instances, functions, methods vs constructors, etc. (Cicada does not).

publicdebates 2 days ago

The for loop is odd. Why is the word counter in there twice?

    counter :: int

    for counter in <1, 10-counter> (
       print(counter)
       print(" ")
    )

Using backfor to count backwards is an odd choice. Why not overload for?

    backfor counter in <1, 9> print(counter, " ")

This is confusing to me. Maybe I'm misunderstanding the design principles, but the syntax seems unintuitive.

briancr 2 days ago

Yeah this is why the syntax is customizable.. maybe it’s not optimal.
The example I gave was strange and I’ll have to change it. Not sure what I was trying to show there. The basic syntax is just:
for counter in <1, 5> print(counter)
backfor counter in <1, 5> print(counter)
It’s not overloaded because ‘for’ is basically a macro, expanding to ‘iterate, increment counter, break on counter > 5’ where ‘>’ is hard-coded. If ‘for’ was a fundamental operator then yes, there would be a step option and it would be factored into the exit condition.
You’ve got me thinking, there’s probably a way to overload it even as a macro.. hmmm…
- nextaccountic a day ago
  
  Just do for counter in <1, 5>.rev(), which would iterate in a reversed range.
  IMO it's poinless to distinguish synctactically between iterating forwards and backwards, specially if you also support things like for counter in <1, 5>.map({ return args[1] * 2) to irate on even numbers (the double of each number), rather than having to define a fordoubled macro. I mean, adding method like map and rev to ranges is more orthogonal and composes better. (See for example iterators in Rust)
  Not that I don't like syntactic flexibility. I am a big fan of Ruby's unless, for example
  
  briancr a day ago
  
  “IMO it's pointless to distinguish syntactically between iterating forwards and backwards” — I completely agree. It’s really a compiler-macro limitation that’s preventing me from doing this.. though I don’t have to go that route.
  I think what you’re suggesting would require the <a, b> syntax to produce a proper iterator type, which it doesn’t currently do. That’s definitely worth considering — then you could attach methods, etc.
  Thanks for the suggestion! I’ll think about the best way to fix this..

newzino 2 days ago

The "aliases not pointers" approach for memory safety is interesting. Curious how you handle the performance implications - traditional aliasing analysis in compilers is expensive because determining whether two aliases point to the same memory is hard.

Are you doing this at runtime (reference counting or similar), or have you found a way to make the static analysis tractable by restricting what aliasing patterns are allowed?

The 250kB size is impressive for a language with inheritance and N-dimensional arrays. For comparison, Lua's VM is around 200-300kB and doesn't include some of those features. What did you have to leave out to hit that size? I assume no JIT, but what about things like regex, IO libraries, etc?

Also - calling back into C functions from the script is a key feature for embeddability. How do you handle type marshalling between the script's type system and C's? Do you expose a C API where I register callbacks with type signatures, or is there reflection/dynamic typing on the boundary?

briancr 2 days ago

Good questions! The short answer to the first is that the language is interpreted, not compiled, so optimizations are moot.
Aliases are strongly-typed which helps avoid some issues. Memory mods come with the territory —- if ‘a’ and ‘b’ point to the same array and ‘a’ resizes that array, then the array behind ‘b’ gets resized too. The one tricky situation is when ‘a’ and ‘b’ each reference range of elements, not the whole array, because a resize of ‘a’ would force a resize of the width of ‘b’. Resizing in this case is usually not allowed.
Garbage collection is indeed done (poorly) by reference counting, and also (very well) by a tracing function that Cicada’s command line script runs after every command.
You’re exactly right, the library is lean because I figure it’s easy to add a C function interface for any capability you want. There’s a bit of personal bias as to what I did include - for example all the basic calculator functions are in, right down to atan(), but no regex. Basic IO (save, load, input, print) is included.
Type marshaling — the Cicada int/float types are defined by cicada.h and can be changed! You just have to use the same types in your C code.
When you run Cicada you pass a list of C functions paired with their Cicada names: { “myCfunction”, &myCfunction }. Then, in Cicada, $myCfunction() runs the callback.
Thanks for the questions! This is exactly the sort of feedback that helps me learn more about the landscape..
- newzino 2 days ago
  
  Thanks for the detailed response. The interpreted approach makes sense for the use case - when you're embedding a scripting layer, you usually want simplicity and portability over raw speed anyway.
  The aliasing semantics you describe (resizes propagating through aliases) is an interesting choice. It's closer to how references work in languages like Python than to the "borrow checker" approach Rust takes. Probably more intuitive for users coming from dynamic languages, even if it means some operations need runtime checks.
  The hybrid GC approach (reference counting + periodic tracing) is pragmatic. Reference counting handles the common case cheaply, and the tracing pass catches cycles. That's similar to how CPython handles it.
  The C registration API sounds clean - explicit pairing of names to function pointers is about as simple as it gets. Do you handle varargs on the Cicada side, or does each registered function have a fixed arity that the interpreter enforces?
  
  briancr 2 days ago
  
  Yes there are lots of runtime checks.. unfortunately, but I always fork the time-consuming calculations into C anyway so those checks don’t really affect overall performance much.
  Scripted functions have no set arity, and the same applies to callback C functions. Scripted functions collect their arguments inside an ‘args’ variable. Likewise, each C function has a single ‘argsType’ argument which collects the argument pointers & type info, and there are macros to help unpack them but if you want to do the unpacking manually then the function can be called variadically:
  ccInt myCfunction(argsType args)
  { for (int a = 0; a < args.num; a++) printf(“%p\n”, args.p[a]); return 0; }
  So all functions are automatically variadic.
  It’s good to know that these GC/etc. solutions are even used by the big languages..
  
  newzino 21 hours ago
  
  The "all functions are automatically variadic" design is a nice simplicity win. No overloading, no arity mismatches at call sites - just a uniform calling convention.
  The argsType struct with pointer array and count is essentially how varargs works at the ABI level in C anyway, you've just made it explicit. And having the type info alongside the pointers means you get runtime type checking without the caller needing to pass format strings or sentinel values like traditional C varargs.
  The tradeoff is you lose static arity checking at parse time, but for an embedded scripting use case that's probably fine - you're validating at runtime anyway and the error messages can be more helpful than "wrong number of arguments."
  Do you have plans for optional/default arguments, or is that outside the scope? With variadic-by-default it'd be natural to just check args.num and use defaults for missing ones.
  
  briancr 18 hours ago
  
  Yes and the simplicity extends to function definitions too, since you don’t have to specify any type info. E.g.
  f :: { ; print(args) }
  Brevity is especially nice for inline/anonymous functions.
  You can definitely use args.num, args.type[], and args.indices[] to figure out which optional parameters were passed, but I’ve decided that it’s usually easier to pass a full set of parameters into C and have the scripted wrapper handle the optional params. This is easy in Cicada because of ‘code substitution’ (one of the innovations I’m proudest of and if you’ve seen this elsewhere please let me know!). Example:
  callC :: {
  mandatoryArgs :: { int, int } optionalArgs :: { str :: string; str = “default” } code mandatoryArgs = args optionalArgs(), (optionalArgs<<args)() | set default, then allow user to change it $Cfunction(mandatoryArgs, optionalArgs)
  }
  Then you can call it with or without modifying the optional parameters from their default values.
  callC(2, 3) | uses the default string
  callC(2, 3; str = “modified param”)
  callC() runs its arguments as a function, substituted into the params variable, allowing the arguments to modify params. This is weird and I haven’t seen it elsewhere, but it’s very useful.

codr7 2 days ago

Nice, the more the merrier!

I've been working on one for Kotlin lately:

https://gitlab.com/codr7/shik

briancr 2 days ago

Very cool! I’ve never used Kotlin..

briancr 2 days ago

Thanks for the references! Writing a language was almost an accident — I worked on a neural networks tool with a scripted interface back around 2000, before I’d ever heard of some of these other languages.. and I’ve been using/updating it ever since.

Beyond NNs, my use case to embed fast C calculations into the language to make scientific programming easier. But the inspiration was less about the use case and more about certain programming innovations which I’m sure are elsewhere but I’m not sure where — like aliases, callable function arguments, generalized inheritance, etc.

That’s a great list — most of those languages I’ve honestly never heard of..

nextaccountic 2 days ago

> Uses aliases not pointers, so it's memory-safe

How does it deal with use after free? How does it deal with data races?

Memory safety can't be solved by just eliminating pointer arithmetic, there's more stuff needed to achieve it

briancr 2 days ago

There’s no multithreading so race conditions don’t apply. That simplifies things quite a bit.
There’s actually no ‘free’, but in the (member -> variable data) ontology of Cicada there are indeed a few ways memory can become disused: 1) members can be removed; 2) members can be re-aliased; 3) arrays or lists can be resized. In those conditions the automated/manual collection routines will remove the disused memory, and in no case is there any dangling ‘pointer’ (member or alias) pointing to unallocated memory. Does this answer your question?
I agree that my earlier statement wasn’t quite a complete explanation.
Of course, since it interfaces with C, it’s easy to overwrite memory in the callback functions.
- nextaccountic a day ago
  
  I mean, that's a neat tradeoff, however..
  > There’s actually no ‘free’, but in the (member -> variable data) ontology of Cicada there are indeed a few ways memory can become disused: 1) members can be removed; 2) members can be re-aliased; 3) arrays or lists can be resized. In those conditions the automated/manual collection routines will remove the disused memory, and in no case is there any dangling ‘pointer’ (member or alias) pointing to unallocated memory. Does this answer your question?
  Does this mean that Cicada will happily and wildly leak memory if I allocate short lived objects in a loop?
  Why don't you just add some reference counting or tracing GC like everybody else
  > 1) members can be removed;
  Does this causes use after free if somebody had access to this member? Or it will give an error during access?
  
  briancr a day ago
  
  No, there are both referenced-based and tracing-based GC routines that will deallocate short-lived objects. Sorry, I was just trying to enumerate the ways memory goes out of scope to show that none of those ways results in an invalid pointer _within the scripting language_.
  The safety comes because there is no way to access a pointer address within the scripting language. The main functionality of pointers is replaced by aliases (e.g. a = @b.c, a = @array[2], etc.). The only use of pointers is behind the scenes, e.g. when you write ‘b.c’ there is of course pointer arithmetic behind the scenes to find the data in member ‘b’.
  Having said that, it is certainly possible for a C callback routine to store an internal pointer, then on a second callback try to use that pointer after it has fallen out of scope. This is the only use-after-free I can imagine.
  
  nextaccountic 15 hours ago
  
  Okay, this is the usual way to perform safe memory management in managed / high level programming languages.. it was just that your "alias" terminology threw me off
  Note that you can add multithreading later if you adopt message passing / actor model. Even Javascript, which is famously single threaded, gained workers with message passing at some point
  
  briancr 13 hours ago
  
  Yes, multithreading seems to be a consistent theme among the comments.. so I should definitely look into that. Thanks for the comment. (I actually haven’t done much threaded programming myself so this would be a learning experience for me..)
  
  briancr a day ago
  
  Also, if someone else has access to the member, meaning that there is an alias to the member, then the reference count should reflect that. Here’s an example:
  i :: int | 1 reference
  a := @i | 2 references
  remove i | 1 reference
  The data originally allocated for ‘i’ should persist because its reference count hasn’t hit zero yet.

tayistay 2 days ago

Can I call into the interpreter from multiple threads or does it use global state?

briancr 2 days ago

There’s no multithreading capability built into Cicada. So a given instance of the interpreter only has a single concurrent state, and all C callbacks share memory with that global state. Multithreading would require a C-based thread manager.

eps 2 days ago

What's the use case? Clearly, you made it with some specific use in mind, at least initially. What was it?

briancr 2 days ago

To be more specific (see my general comment), I’ve used the language in two open-source projects: 1) a chromosome conformation reconstruction tool, and 2) a fast neural network generator (back end). Re Project 2: I’m also planning to embed the language into results webpages served from the NN generator website.

languagehacker 2 days ago

I've lost count of projects called Cicada

publicdebates 2 days ago

A new one seems to pop up every year, and some every 13 or 17 years.
- briancr 2 days ago
  
  This one’s Brood VI!
briancr 2 days ago

I know, I was dismayed to find out that there’s even another scripting language called Cicada.
The name came when I was living in Seattle and missed the sounds of east coast summer..