Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol generation with lazily generated names that behave like standard symbols #909

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

mnieper
Copy link
Contributor

@mnieper mnieper commented Jan 31, 2025

This patch implements the (generate-symbol [pretty-name]) procedure. As (gensym), it returns a symbol with a lazily generated unique name; however, the resulting symbol is not a gensym but an ordinary symbol.

Such a procedure is useful in situations where it is expected that symbols are identical if and only if their names (as returned by symbol->string) are spelled the same.

In particular, this is the case for R6RS code. Strictly speaking, by section 11.10 of the R6RS, gensyms are non-conformant. This patch improves R6RS compatibility by providing an R6RS version of generate-temporaries where the symbols are generated by generate-symbol and not by gensym. Moreover, ordinary symbols have write/read invariance even when only standard lexical syntax is allowed.

For the context of this patch, see also SRFI 260.

@mflatt
Copy link
Contributor

mflatt commented Feb 2, 2025

This idea seems ok to me. Some thoughts on the implementation here:

  • It looks like there's a problem with symbol-hash on a generated symbol before interning is forced, as in (symbol-hash (generate-symbol)).
  • It seems a little awkward to have two representations of a "plain" symbol: a string as the name or (cons <string> #t) after a generated symbol is forced. That could be addressed with a small change to S_uninterned and how it is used (illustrated by the alternative linked below).
  • The inlined form of symbol->string ends up being quite a lot of code. Maybe symbol->string isn't used often enough for it to matter, but I tried an alternate encoding to make it simpler.
  • The code for generate-temporaries and for buildmake-symbol seems unnecessarily duplicated.

For your consideration, here's what I'd do: https://github.com/mflatt/ChezScheme/tree/srfi-260. I was aiming to simplify the encoding so that the hash field is effectively used to indicate whether a gensym or generated symbol is reified, and the name field has the symbol->string result in a more consistent place. The encoding as a combination of name and hash didn't end up as nice as I hoped it would, but I like it a little better. Even if you don't like the alternate encoding, some tests and other changes there may be useful.

FWIW (but you may already know this): Another possible approach to generate-temporaries is to generate the scope part of the name, instead of the symbol part. Racket uses that strategy so that the symbol part can be deterministic, which is part of a general effort toward deterministic builds.

@mnieper
Copy link
Contributor Author

mnieper commented Feb 3, 2025

Thank you for this thorough review and the improvements, @mflatt. I went through them and incorporated them in my pull request. I simply forgot about checking for symbol-hash, and fixing the representation for uninterned symbols at the same time is great.

Due to the improvements, I could apply two further simplifications.

You added the "discard" flag back on string->symbol. It has always been on gensym->unique-string. Is there a good reason why these two procedures should behave differently?

It seems unfortunate that symbol-hash forces interning a "generated symbol", which can make them more costly than uninterned symbols. As a way out, symbol-hash could generate a random hash for yet uninterned generated symbols, which would then be incorporated into the generated name. The procedure extracting hash values from symbol names would need a special case that detects names constructed in such a way and calculates, in this case, the hash just by extracting it. What do you think?

FWIW (but you may already know this): Another possible approach to generate-temporaries is to generate the scope part of the name, instead of the symbol part. Racket uses that strategy so that the symbol part can be deterministic, which is part of a general effort toward deterministic builds.

This is how I would implement them if there were no historical baggage. As R6RS does not have identifier-hash or identifier<?, the symbol part needs to be sufficiently diverse (definitely not constant), as otherwise algorithms using hashtables to map identifiers (e.g. to model environments) behave badly. So we would at least need a symbol counter (so that the symbols can be named, say, tmp0, tmp1, ...). Moreover, the symbols would then likely be interned by generate-temporaries , which needs locking (as would the symbol counter). All in all, I think the current implementation is the simplest one for CS.

@mflatt
Copy link
Contributor

mflatt commented Feb 3, 2025

You added the "discard" flag back on string->symbol. It has always been on gensym->unique-string. Is there a good reason why these two procedures should behave differently?

I think it would be better for gensym->unique-string to have a discard flag. If a program calls gensym->unique-string but doesn't use the result, then it would be nice for the call to go away. But there must be some function that has an effect; symbol-hash relies on gensym->unique-string as producing the effect, and there may be other places that rely on it.

Meanwhile, symbol->string is used much more widely and seems more likely to end up in a place where its result is unused, so it seems worth keeping the current potential optimization. The side effect can be deferred to $gensym->pretty-name or some other internal function.

As a way out, symbol-hash could generate a random hash for yet uninterned generated symbols, which would then be incorporated into the generated name.

I don't immediately see how this would work. My understanding so far is that the interaction below is meant to reliably produce #t, since the same name is written after quote as is printed:

> (define s (generate-symbol))
> s
g-nx26tlsbk3mm7fw2h7u1btx9x
> (eq? s 'g-nx26tlsbk3mm7fw2h7u1btx9x)
#t

To make sure that 'g-nx26tlsbk3mm7fw2h7u1btx9x finds the same symbol, the quoted symbol needs the same hash code. How wold it find that? Would interning need to check a table of generated names just before settling on a hash code for a symbol? What would happen if I fasl s to a file, restart Chez Scheme, (define s2 'g-nx26tlsbk3mm7fw2h7u1btx9x), and then load s out of the fasl file? Or do I misunderstand the intended behavior of generated symbols?

@mnieper
Copy link
Contributor Author

mnieper commented Feb 3, 2025

You added the "discard" flag back on string->symbol. It has always been on gensym->unique-string. Is there a good reason why these two procedures should behave differently?

(I meant "missing on `gensym->unique-string", of course.)

I think it would be better for gensym->unique-string to have a discard flag. If a program calls gensym->unique-string but doesn't use the result, then it would be nice for the call to go away. But there must be some function that has an effect; symbol-hash relies on gensym->unique-string as producing the effect, and there may be other places that rely on it.

Okay, but then it would make sense to have a dedicated primitive for that purpose like $force-intern!, which you introduced in 5_7.ms, wouldn't it?

As a way out, symbol-hash could generate a random hash for yet uninterned generated symbols, which would then be incorporated into the generated name.

I don't immediately see how this would work. My understanding so far is that the interaction below is meant to reliably produce #t, since the same name is written after quote as is printed:

> (define s (generate-symbol))
> s
g-nx26tlsbk3mm7fw2h7u1btx9x
> (eq? s 'g-nx26tlsbk3mm7fw2h7u1btx9x)
#t

To make sure that 'g-nx26tlsbk3mm7fw2h7u1btx9x finds the same symbol, the quoted symbol needs the same hash code. How wold it find that? Would interning need to check a table of generated names just before settling on a hash code for a symbol? What would happen if I fasl s to a file, restart Chez Scheme, (define s2 'g-nx26tlsbk3mm7fw2h7u1btx9x), and then load s out of the fasl file? Or do I misunderstand the intended behavior of generated symbols?

I believe you understand the intended behaviour of generated symbols correctly. My idea is the following: If symbol-hash is called before the symbol is interned, a cheap hash is randomly generated without locking and recorded as part of the hash field. Let us assume for demonstration purposes that the code is the number described by #xdeadbeef.

When the symbol is later interned, the symbol's unique name will then be something like g-nx26tlsbk3mm7fw2h7u1btx9x-generated-deadbeef. The hash function in intern.c detects symbol names of this form (in this example, g-<key>-generated-hash and will return #xdeadbeef instead of taking all characters into account.

When a symbol with a name like g-nx26tlsbk3mm7fw2h7u1btx9x-generated-deadbeef is input directly, the hash function will make no difference and likewise return #deadbeef.

The disadvantage of this approach is that it will be easy to forge given hash values. On the other hand, the symbol hash is not cryptographically secure anyway.

At the moment, unique_id is a costly operation (in stats.c) because it involves opening /dev/urandom each time it is called. Generating a random hash can be done much faster.

Maybe, if unique_id is replaced by a fast version (e.g. the device should be kept open and buffered so that reads only seldom have to call the OS), the advantage of computing a cheap hash may be reduced enough. I don't know whether unique_id can be made fast on each supported platform. It is okay if it needs locking because interning the symbol (which has to be done once a unique name is calculated) needs locking anyway.

@mflatt
Copy link
Contributor

mflatt commented Feb 3, 2025

Okay, but then it would make sense to have a dedicated primitive for that purpose like $force-intern!, which you introduced in 5_7.ms, wouldn't it?

Agreed. It just didn't seem worth the effort of tracking down uses of gensym->unique-string to replace them or to figure out whether there's some way that code outside of the implementation could somehow rely on the side effect.

When the symbol is later interned, the symbol's unique name will then be something like g-nx26tlsbk3mm7fw2h7u1btx9x-generated-deadbeef. The hash function in intern.c detects symbol names of this form (in this example, g-<key>-generated-hash and will return #xdeadbeef instead of taking all characters into account.

Ah, I see now. Yes, that does seem workable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants