Symbol generation with lazily generated names that behave like standard symbols #909

mnieper · 2025-01-31T11:20:16Z

This patch implements the (generate-symbol [pretty-name]) procedure. As (gensym), it returns a symbol with a lazily generated unique name; however, the resulting symbol is not a gensym but an ordinary symbol.

Such a procedure is useful in situations where it is expected that symbols are identical if and only if their names (as returned by symbol->string) are spelled the same.

In particular, this is the case for R6RS code. Strictly speaking, by section 11.10 of the R6RS, gensyms are non-conformant. This patch improves R6RS compatibility by providing an R6RS version of generate-temporaries where the symbols are generated by generate-symbol and not by gensym. Moreover, ordinary symbols have write/read invariance even when only standard lexical syntax is allowed.

For the context of this patch, see also SRFI 260.

mflatt · 2025-02-02T02:48:50Z

This idea seems ok to me. Some thoughts on the implementation here:

It looks like there's a problem with symbol-hash on a generated symbol before interning is forced, as in (symbol-hash (generate-symbol)).
It seems a little awkward to have two representations of a "plain" symbol: a string as the name or (cons <string> #t) after a generated symbol is forced. That could be addressed with a small change to S_uninterned and how it is used (illustrated by the alternative linked below).
The inlined form of symbol->string ends up being quite a lot of code. Maybe symbol->string isn't used often enough for it to matter, but I tried an alternate encoding to make it simpler.
The code for generate-temporaries and for buildmake-symbol seems unnecessarily duplicated.

For your consideration, here's what I'd do: https://github.com/mflatt/ChezScheme/tree/srfi-260. I was aiming to simplify the encoding so that the hash field is effectively used to indicate whether a gensym or generated symbol is reified, and the name field has the symbol->string result in a more consistent place. The encoding as a combination of name and hash didn't end up as nice as I hoped it would, but I like it a little better. Even if you don't like the alternate encoding, some tests and other changes there may be useful.

FWIW (but you may already know this): Another possible approach to generate-temporaries is to generate the scope part of the name, instead of the symbol part. Racket uses that strategy so that the symbol part can be deterministic, which is part of a general effort toward deterministic builds.

mnieper · 2025-02-03T09:21:30Z

Thank you for this thorough review and the improvements, @mflatt. I went through them and incorporated them in my pull request. I simply forgot about checking for symbol-hash, and fixing the representation for uninterned symbols at the same time is great.

Due to the improvements, I could apply two further simplifications.

You added the "discard" flag back on string->symbol. It has always been on gensym->unique-string. Is there a good reason why these two procedures should behave differently?

It seems unfortunate that symbol-hash forces interning a "generated symbol", which can make them more costly than uninterned symbols. As a way out, symbol-hash could generate a random hash for yet uninterned generated symbols, which would then be incorporated into the generated name. The procedure extracting hash values from symbol names would need a special case that detects names constructed in such a way and calculates, in this case, the hash just by extracting it. What do you think?

FWIW (but you may already know this): Another possible approach to generate-temporaries is to generate the scope part of the name, instead of the symbol part. Racket uses that strategy so that the symbol part can be deterministic, which is part of a general effort toward deterministic builds.

This is how I would implement them if there were no historical baggage. As R6RS does not have identifier-hash or identifier<?, the symbol part needs to be sufficiently diverse (definitely not constant), as otherwise algorithms using hashtables to map identifiers (e.g. to model environments) behave badly. So we would at least need a symbol counter (so that the symbols can be named, say, tmp0, tmp1, ...). Moreover, the symbols would then likely be interned by generate-temporaries , which needs locking (as would the symbol counter). All in all, I think the current implementation is the simplest one for CS.

generated symbol names

mflatt · 2025-02-03T14:18:06Z

You added the "discard" flag back on string->symbol. It has always been on gensym->unique-string. Is there a good reason why these two procedures should behave differently?

I think it would be better for gensym->unique-string to have a discard flag. If a program calls gensym->unique-string but doesn't use the result, then it would be nice for the call to go away. But there must be some function that has an effect; symbol-hash relies on gensym->unique-string as producing the effect, and there may be other places that rely on it.

Meanwhile, symbol->string is used much more widely and seems more likely to end up in a place where its result is unused, so it seems worth keeping the current potential optimization. The side effect can be deferred to $gensym->pretty-name or some other internal function.

As a way out, symbol-hash could generate a random hash for yet uninterned generated symbols, which would then be incorporated into the generated name.

I don't immediately see how this would work. My understanding so far is that the interaction below is meant to reliably produce #t, since the same name is written after quote as is printed:

> (define s (generate-symbol))
> s
g-nx26tlsbk3mm7fw2h7u1btx9x
> (eq? s 'g-nx26tlsbk3mm7fw2h7u1btx9x)
#t

To make sure that 'g-nx26tlsbk3mm7fw2h7u1btx9x finds the same symbol, the quoted symbol needs the same hash code. How wold it find that? Would interning need to check a table of generated names just before settling on a hash code for a symbol? What would happen if I fasl s to a file, restart Chez Scheme, (define s2 'g-nx26tlsbk3mm7fw2h7u1btx9x), and then load s out of the fasl file? Or do I misunderstand the intended behavior of generated symbols?

mnieper · 2025-02-03T14:50:40Z

You added the "discard" flag back on string->symbol. It has always been on gensym->unique-string. Is there a good reason why these two procedures should behave differently?

(I meant "missing on `gensym->unique-string", of course.)

I think it would be better for gensym->unique-string to have a discard flag. If a program calls gensym->unique-string but doesn't use the result, then it would be nice for the call to go away. But there must be some function that has an effect; symbol-hash relies on gensym->unique-string as producing the effect, and there may be other places that rely on it.

Okay, but then it would make sense to have a dedicated primitive for that purpose like $force-intern!, which you introduced in 5_7.ms, wouldn't it?

As a way out, symbol-hash could generate a random hash for yet uninterned generated symbols, which would then be incorporated into the generated name.

I don't immediately see how this would work. My understanding so far is that the interaction below is meant to reliably produce #t, since the same name is written after quote as is printed:
> (define s (generate-symbol))
> s
g-nx26tlsbk3mm7fw2h7u1btx9x
> (eq? s 'g-nx26tlsbk3mm7fw2h7u1btx9x)
#t
To make sure that 'g-nx26tlsbk3mm7fw2h7u1btx9x finds the same symbol, the quoted symbol needs the same hash code. How wold it find that? Would interning need to check a table of generated names just before settling on a hash code for a symbol? What would happen if I fasl s to a file, restart Chez Scheme, (define s2 'g-nx26tlsbk3mm7fw2h7u1btx9x), and then load s out of the fasl file? Or do I misunderstand the intended behavior of generated symbols?

I believe you understand the intended behaviour of generated symbols correctly. My idea is the following: If symbol-hash is called before the symbol is interned, a cheap hash is randomly generated without locking and recorded as part of the hash field. Let us assume for demonstration purposes that the code is the number described by #xdeadbeef.

When the symbol is later interned, the symbol's unique name will then be something like g-nx26tlsbk3mm7fw2h7u1btx9x-generated-deadbeef. The hash function in intern.c detects symbol names of this form (in this example, g-<key>-generated-hash and will return #xdeadbeef instead of taking all characters into account.

When a symbol with a name like g-nx26tlsbk3mm7fw2h7u1btx9x-generated-deadbeef is input directly, the hash function will make no difference and likewise return #deadbeef.

The disadvantage of this approach is that it will be easy to forge given hash values. On the other hand, the symbol hash is not cryptographically secure anyway.

At the moment, unique_id is a costly operation (in stats.c) because it involves opening /dev/urandom each time it is called. Generating a random hash can be done much faster.

Maybe, if unique_id is replaced by a fast version (e.g. the device should be kept open and buffered so that reads only seldom have to call the OS), the advantage of computing a cheap hash may be reduced enough. I don't know whether unique_id can be made fast on each supported platform. It is okay if it needs locking because interning the symbol (which has to be done once a unique name is calculated) needs locking anyway.

mflatt · 2025-02-03T15:24:26Z

Okay, but then it would make sense to have a dedicated primitive for that purpose like $force-intern!, which you introduced in 5_7.ms, wouldn't it?

Agreed. It just didn't seem worth the effort of tracking down uses of gensym->unique-string to replace them or to figure out whether there's some way that code outside of the implementation could somehow rely on the side effect.

When the symbol is later interned, the symbol's unique name will then be something like g-nx26tlsbk3mm7fw2h7u1btx9x-generated-deadbeef. The hash function in intern.c detects symbol names of this form (in this example, g-<key>-generated-hash and will return #xdeadbeef instead of taking all characters into account.

Ah, I see now. Yes, that does seem workable.

mnieper and others added 17 commits January 30, 2025 15:25

Add generate-symbol procedure

4c2b15b

Add -symbol->name

7ddb8c8

Correct gensym? in the presence of generated symbols

47e52bc

Add missing

0fb972f

Fix logic of pretty names

30e3932

Improve generated name

367ec49

Add tests (one failing)

8d6153f

Fix tests

7db0600

Add R6RS version of generate-temporaries

2d79d9a

Fix uses of $symbol-name

1bebff7

Document variations on symbol-name

159e153

Update expected errors

ecfd772

Update patch file

f625140

Ensure that bootstrap works with old version

b6604c3

Update documentation

cfbc56b

Update patch files

74f0c2d

adjust symbol encoding to simplify operations

31dbe15

mnieper added 2 commits February 3, 2025 09:00

Restore symbol->lambda-name for new representation

ba94c4d

Remove obsolete real_symname

72b811a

Make use of the list argument of generate-temporaries to build pretty

6f5a62f

generated symbol names

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbol generation with lazily generated names that behave like standard symbols #909

Symbol generation with lazily generated names that behave like standard symbols #909

mnieper commented Jan 31, 2025

mflatt commented Feb 2, 2025

mnieper commented Feb 3, 2025

mflatt commented Feb 3, 2025

mnieper commented Feb 3, 2025

mflatt commented Feb 3, 2025

Symbol generation with lazily generated names that behave like standard symbols #909

Are you sure you want to change the base?

Symbol generation with lazily generated names that behave like standard symbols #909

Conversation

mnieper commented Jan 31, 2025

mflatt commented Feb 2, 2025

mnieper commented Feb 3, 2025

mflatt commented Feb 3, 2025

mnieper commented Feb 3, 2025

mflatt commented Feb 3, 2025