From b73d45059f162f007991cd94853b011ca7e8a8cd Mon Sep 17 00:00:00 2001 From: Nikita Popov Date: Fri, 14 Feb 2020 16:41:26 +0100 Subject: [PATCH 1/7] Language evolution overview proposal --- rfcs/0000-language-evolution.md | 194 ++++++++++++++++++++++++++++++++ 1 file changed, 194 insertions(+) create mode 100644 rfcs/0000-language-evolution.md diff --git a/rfcs/0000-language-evolution.md b/rfcs/0000-language-evolution.md new file mode 100644 index 0000000..e5be1ca --- /dev/null +++ b/rfcs/0000-language-evolution.md @@ -0,0 +1,194 @@ + * Name: `language_evolution` + * Date: 2020-02-18 + * Author: Nikita Popov + * Proposed Version: PHP 8.0 + * RFC PR: [php-src/rfcs#0002](https://github.com/php-src/rfcs/pull/2) + +# Introduction + +In recent years, there has been an increasing tension in the PHP community on how to handle backwards-incompatible language changes. The PHP programming language has evolved somewhat haphazardly and exhibits many behaviors that are considered undesirable from a contemporary position. + +Fixing these issues benefits development by making behavior more consistent, more predictable and less bugprone. On the other hand, every backwards-incompatible change to the PHP language may require adjustments in hundreds of millions of lines of existing code. This delays the migration to new PHP versions. + +The general solution to this problem is to allow different libraries and applications to keep up with language changes at their own pace, while remaining interoperable. There are quite a few ways in which this general goal can be achieved, which will be discussed in the following. + +# Examples of possible backwards-incompatible changes + +To keep the following discussion grounded in real proposals, this section gives a few examples of possible backwards-incompatible changes based on existing RFCs. These are intended as examples only, this proposal does not endorse all of them. + +## Strict types + +The [scalar type declarations RFC](https://wiki.php.net/rfc/scalar_type_hints_v5) introduced the `strict_types` declare, which allows controlling the behavior of scalar type declarations. If it is enabled, passed types must match the declared type exactly (modulo details). If it is disabled, certain type coercions are permitted. + +While this is an already existing feature, it is worth giving it some consideration in this context, because it is an existing example of opting into a backwards-incompatible change. To provide some historical context: At the time, internal functions already accepted scalar types and validated them according to coercive semantics. To maintain consistency with internal type checks, userland types would have to follow the same, often undesirable, semantics. The `strict_types` declare allows keeping the behavior between userland/internal consistent, while still providing an opt-in to strict type checking. + +I think that overall, the introduction of scalar types together with a `strict_types` directive worked out fairly well. Apart from the common complaint (at least in the early days) of having to specify this option in every single file, the main technical issue people encountered is around the treatment of callbacks: + +Callbacks invoked by internal functions like `array_map()` always have coercive argument semantics, even if invoked from a strictly typed file. Conversely, callbacks invoked in a strictly typed file but coming from a weakly typed one, will use strict argument semantics. This is a shortcoming in the original design of `strict_types`: Callbacks should have been special-cased to use the typing mode at the declaration site, not the call-site. I am mentioning this issue here primarily to illustrate that the idea of making code seamlessly interoperate based on opt-ins may not always work out perfectly in edge-cases. + +## Explicit pass-by-reference + +The [explicit call-site pass-by-reference](https://wiki.php.net/rfc/explicit_send_by_ref) RFC proposes to allow marking arguments that are passed by reference using a `&` at the call-site, next to the existing marker at the declaration-site. The reasons for why this is desirable are laid out in the motivation section of that proposal. + +However, only *allowing* the use of `&` at the call-site does not give us many benefits: The use of a call-site marker has to be *required* in order to reap the full benefits for readability, static analysis and performance. + +Unfortunately, requiring the marker results in the worst possible type of backwards-compatibility break: It becomes very hard to write code that is compatible both with PHP 8 (where the call-site marker is required) and PHP 7 (where the call-site marker is forbidden). Such a change could only be rolled out over a very long time frame, by first allowing an optional marker, and only making it required many, many years later. This is the kind of change that benefits most from an opt-in mechanism. + +## Forbidding dynamic object properties + +One of the motivating cases mentioned in the [namespace-scoped declares](https://wiki.php.net/rfc/namespace_scoped_declares) RFC, and (in a different form) the [locked classes](https://wiki.php.net/rfc/locked-classes) RFC. Most code nowadays expects that all properties are declared in classes (apart from specific exceptions like `stdClass` and of course excluding magic like `__set()`). Setting a property that has not been declared is very likely a typo, not an intentional act. Unfortunately PHP silently allows it, and there is no good way to disable this behavior (a common workaround is to include a trait with throwing magic methods). + +Having the option to make undeclared property accesses (modulo mentioned exceptions) Error exceptions would be benefitial for modern code. However, there is also a lot of old code that does not declare properties, so this needs to be opt-in. + +## Strict operators and friends + +While `strict_types` can be used to disable type coercions for function arguments, there currently is no way to disable the same for basic language constructors, such as arithmetic operators. The [strict operators](https://wiki.php.net/rfc/strict_operators) RFC proposes a `strict_operators` opt-in declare to forbid most type coercions. + +This RFC is interesting in that it not only adds new errors, but also changes behavior in some places. For example, the `switch` statement will use strict comparison (modulo details), while normally `switch` uses weak comparison (`==`). This is an important distinction: If a declare only adds errors, you can always produce valid code by assuming the option is enabled (regardless of whether it actually is). If the declare carries a behavior change, then it is truly important whether the option is enabled or not. + +## Name resolution changes + +A [`function_and_const_lookup='global'`](https://wiki.php.net/rfc/use_global_elements) declare has been proposed and declined recently. While this particular proposal was declined, we may still wish to consider other name resolution changes that bring the rules for functions/constants and classes in line. + +# Approaches + +Three general approaches (more in terms of philosophy than technical detail) have been discussed in the past and are summarized here. + +## New language with common implementation (codename P++) + +This approach has been suggested by Zeev, and there's some existing discussion of this in the [P++ FAQ](https://wiki.php.net/pplusplus/faq) and [P++ Concerns](https://wiki.php.net/pplusplus/concerns). A PHP internals [unofficial straw poll](https://wiki.php.net/rfc/p-plus-plus) was unanimously opposed to the idea. + +The idea behind P++ is that a new language is introduced, which shares an implementation, and is interoperable with PHP. However, P++ could have major differences in syntax and behavior, and could pursue different design goals. As the name suggests, this is similar to the situation of C and C++, which are usually both supported by the same compiler and are nominally interoperable. + +I think there are a number of problems with this approach, and they all essentially come down to P++ being "one big change": + + * There is only one chance to introduce backwards-incompatible changes. Once P++ is released, we are back to square once. Assuming that we will not manage to create the perfect language on the first try, it would be better to introduce a more sustainable mechanism. + * A one-time major change places a high upgrade burden. Depending on scope, it might be more akin to switching languages than upgrading the language version. This makes it less likely that old code will switch to P++. + * If the divergence is too large, then it may be hard to ensure interoperability between PHP and P++. For example, while C and C++ are nominally compatible, in practice this requires exporting C++ code through a C-compatible FFI interface. This makes it fairly easy to integrate C in C++, but not the other way around. Hypothetically, if P++ introduced generics but PHP did not, it's not clear how they would interoperate. + * Frankly, we just don't have the development resources to pull this off. This needs a concerted multi-year effort from a larger development team than we have right now; our resources are best invested elsewhere. + +On the positive side, P++ would allow us to make some more radical changes than the approaches discussed below. People sometimes bring up ideas like "drop the `$` from variable names", and we generally just brush this off as the craziness that it is, but P++ would make such changes at least principally feasible. + +## Editions + +"Editions" are a concept popularized by the Rust programming language. See the [edition guide](https://doc.rust-lang.org/edition-guide/editions/index.html) and the [epoch RFC](https://github.com/rust-lang/rfcs/blob/master/text/2052-epochs.md) for more information. + +Editions are specified at the package (in Rust: crate) level, and opt-in to a bundled set of backwards-incompatible changes. Different packages using different editions remain compatible. Editions are intended to be supported forever, and there are some limitations on what kind of changes are permitted in editions (one of the significant limitations is that no standard library changes are allowed). + +Editions were also intended to serve as a rallying point from a marketing perspective, though I believe that this coupling between a purely technical mechanism and the marketing angle was found to be confusing and detrimental in hindsight. + +## Fine-grained declares + +An alternative to the "editions" approach is to introduce more fine-grained declare directives for individual changes. This is inspired by the existing `strict_types` directive, and has been brought up in quite a few of the recent proposals in the preceding examples section. + +Some considerations regarding the differences between editions and fine-grained declares: + + * The main advantage of "editions" is that changes are grouped and hierarchical. This reduces the number of language "dialects" that are available at a given time to the number of editions. Fine-grained declares on the other hand create `2^N` different dialects, where N is the number of boolean declares. + * The main advantage of fine-grained declares is that it allows updating code more gradually (by handling changes one by one), and by allowing to opt out of specific changes in parts of the codebase. For example, if there was a hypothetical `no_dynamic_properties` declare, then one may wish to enable it for most code, but disable it in one particular file where one interacts with a legacy library that requires the use of dynamic properties. + * Relatedly, fine-grained declares allow handling cases where we want to leave people with a choice as to whether they want a certain change or not. "Editions" carry a strong implication that you *should* be using the new edition and the changes it entails. For example, would the current `strict_types` option be enabled as part of a new edition? + +# Technical realization of "per-package" options + +Independently of how "fine-grained" the opt-ins are, there are also multiple ways in which they could be specified on a technical level. These will be discussed in the following. + +## Status quo: Declares at top of file + +We already use `declare(strict_types=1);` at the top of the file to enable strict typing, so it would be natural to continue relying on this mechanism. + +This approach has two big advantages: First, it already works, is familiar and does not require the introduction of any new language facilities. Second, the file remains self contained: The used language dialect can be determined without looking at additional files, which is especially helpful for tooling. + +It also has some disadvantages: First, it doesn't scale. This approach is only compatible with the "editions" approach, where only a single `declare(edition=2020);` line is needed. It is not feasibly to use this with the "fine-grained declares" approach. + +Second, packages will very likely want to use one language dialect for the entire package, not mix and match across different files. While having the declare in every single file is nominally more explicit, in practice the programmer will model this as "I'm working on a PHP 2020 project" in their mind, and will not double-check the edition whenever they open a new file. This may lead to surprises if a file forgets to specify the edition by accident. + +## New opening tag + +A minor variation on the preceding variant: Instead of using declares, a new opening tag can be introduced. Once again this only works for P++ ` 1, + 'no_dynamic_properties' => 1, + // ... +]); +``` + +The intention is that these will be specified in `composer.json`, and Composer will take care of registering the declares with PHP. + +This approach avoids the disadvantages of specifying declares in each file: It can scale with an arbitrary number of declares, and the programmer can assume that declares hold for the whole project, unless they get explicitly overridden. + +One disadvantage of this approach (and all the other "package-based" approaches dicussed in the following) is that you can no longer tell the used language dialect by looking at a single file. While this should not be a problem for humans (for whom a package-oriented approach is more useful), it may be an issue for tooling, as it may not be possible to correctly process files without having a larger context. + +Unlike the two package-based approaches discussed in the following, namespace-scoped declares are based on the existing, well-established and well-understood "namespace" feature. This is both an advantage (it does not introduce any new concepts or need for additional code) and a disadvantage: + +While namespaces commonly map directly to a package, they don't always do so. For example, the main amphp package uses the `Amp\` namespace, while other amphp packages use `Amp\FooBar\`. Here, the `Amp\` namespace cannot really be treated as a single package. There are additional issues (e.g. due to the ability to have multiple namespaces in a single file), but these can be resolved, see the linked RFC for more detailed discussion. + +## Explicit package declaration + +There is no proper RFC for this variant, but a prototype [pull request](https://github.com/php/php-src/pull/4490) is available. This introduces a new "package" concept that is orthogonal to namespaces. The package would have to be declared in each file, next to the namespace: + +```php + Date: Tue, 18 Feb 2020 11:57:58 +0100 Subject: [PATCH 2/7] Add string interpolation as an example of a lexer change --- rfcs/0000-language-evolution.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/rfcs/0000-language-evolution.md b/rfcs/0000-language-evolution.md index e5be1ca..c70d6cd 100644 --- a/rfcs/0000-language-evolution.md +++ b/rfcs/0000-language-evolution.md @@ -50,6 +50,12 @@ This RFC is interesting in that it not only adds new errors, but also changes be A [`function_and_const_lookup='global'`](https://wiki.php.net/rfc/use_global_elements) declare has been proposed and declined recently. While this particular proposal was declined, we may still wish to consider other name resolution changes that bring the rules for functions/constants and classes in line. +## String interpolation changes + +The [arbitrary expression interpolation](https://wiki.php.net/rfc/arbitrary_expression_interpolation) RFC proposes to introduce the `"foo #{1 + 1} bar"` syntax for interpolating arbitrary expressions. This poses a minor backwards-compatibility break, because currently `#{}` is allowed inside strings and does not have special meaning. A backwards-compatibility break could be avoided by making the syntax opt-in. It would even be possible to use the `"foo {1 + 1} bar"` syntax, though whether that is a good idea is a different question. + +This example is interesting, because it involves a change to the PHP lexer, while all the previous examples involved changes to the compiler or runtime. + # Approaches Three general approaches (more in terms of philosophy than technical detail) have been discussed in the past and are summarized here. From 25994d3e850d48915819bf75e04fc8c93a7bcd37 Mon Sep 17 00:00:00 2001 From: Nikita Popov Date: Tue, 18 Feb 2020 12:02:10 +0100 Subject: [PATCH 3/7] Fix link --- rfcs/0000-language-evolution.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0000-language-evolution.md b/rfcs/0000-language-evolution.md index c70d6cd..4145d92 100644 --- a/rfcs/0000-language-evolution.md +++ b/rfcs/0000-language-evolution.md @@ -2,7 +2,7 @@ * Date: 2020-02-18 * Author: Nikita Popov * Proposed Version: PHP 8.0 - * RFC PR: [php-src/rfcs#0002](https://github.com/php-src/rfcs/pull/2) + * RFC PR: [php/php-rfcs#0002](https://github.com/php/php-rfcs/pull/2) # Introduction From bf1c4a2b3b58a56018539e63852c8831352e3b37 Mon Sep 17 00:00:00 2001 From: Nikita Popov Date: Tue, 18 Feb 2020 14:08:16 +0100 Subject: [PATCH 4/7] Clarify that "supported forever" only applies to Rust --- rfcs/0000-language-evolution.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0000-language-evolution.md b/rfcs/0000-language-evolution.md index 4145d92..494b30f 100644 --- a/rfcs/0000-language-evolution.md +++ b/rfcs/0000-language-evolution.md @@ -79,7 +79,7 @@ On the positive side, P++ would allow us to make some more radical changes than "Editions" are a concept popularized by the Rust programming language. See the [edition guide](https://doc.rust-lang.org/edition-guide/editions/index.html) and the [epoch RFC](https://github.com/rust-lang/rfcs/blob/master/text/2052-epochs.md) for more information. -Editions are specified at the package (in Rust: crate) level, and opt-in to a bundled set of backwards-incompatible changes. Different packages using different editions remain compatible. Editions are intended to be supported forever, and there are some limitations on what kind of changes are permitted in editions (one of the significant limitations is that no standard library changes are allowed). +Editions are specified at the package (in Rust: crate) level, and opt-in to a bundled set of backwards-incompatible changes. Different packages using different editions remain compatible. Editions in Rust are intended to be supported forever, though we may adopt a different [support timeline](#maintenance-burden-and-support-timeline). There are also some limitations on what kind of changes are permitted in editions (one of the significant limitations is that no standard library changes are allowed, something we would likely want to adopt as well). Editions were also intended to serve as a rallying point from a marketing perspective, though I believe that this coupling between a purely technical mechanism and the marketing angle was found to be confusing and detrimental in hindsight. From 599fd1c01b045f32128bbcecdb4a5b1ae5367ac5 Mon Sep 17 00:00:00 2001 From: Nikita Popov Date: Wed, 19 Feb 2020 11:23:30 +0100 Subject: [PATCH 5/7] Fix typo --- rfcs/0000-language-evolution.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfcs/0000-language-evolution.md b/rfcs/0000-language-evolution.md index 494b30f..c5d5d6b 100644 --- a/rfcs/0000-language-evolution.md +++ b/rfcs/0000-language-evolution.md @@ -68,7 +68,7 @@ The idea behind P++ is that a new language is introduced, which shares an implem I think there are a number of problems with this approach, and they all essentially come down to P++ being "one big change": - * There is only one chance to introduce backwards-incompatible changes. Once P++ is released, we are back to square once. Assuming that we will not manage to create the perfect language on the first try, it would be better to introduce a more sustainable mechanism. + * There is only one chance to introduce backwards-incompatible changes. Once P++ is released, we are back to square one. Assuming that we will not manage to create the perfect language on the first try, it would be better to introduce a more sustainable mechanism. * A one-time major change places a high upgrade burden. Depending on scope, it might be more akin to switching languages than upgrading the language version. This makes it less likely that old code will switch to P++. * If the divergence is too large, then it may be hard to ensure interoperability between PHP and P++. For example, while C and C++ are nominally compatible, in practice this requires exporting C++ code through a C-compatible FFI interface. This makes it fairly easy to integrate C in C++, but not the other way around. Hypothetically, if P++ introduced generics but PHP did not, it's not clear how they would interoperate. * Frankly, we just don't have the development resources to pull this off. This needs a concerted multi-year effort from a larger development team than we have right now; our resources are best invested elsewhere. From e5648065d1147f0f485a793487a1bbe467f8c1d7 Mon Sep 17 00:00:00 2001 From: Nikita Popov Date: Wed, 19 Feb 2020 11:24:09 +0100 Subject: [PATCH 6/7] Replace dot-file with underscore-file --- rfcs/0000-language-evolution.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/rfcs/0000-language-evolution.md b/rfcs/0000-language-evolution.md index c5d5d6b..fa0a2fa 100644 --- a/rfcs/0000-language-evolution.md +++ b/rfcs/0000-language-evolution.md @@ -153,13 +153,13 @@ The advantage of this approach is that it resolves the ambiguities that exist ar ## Filesystem based packages -An alternative to explicit package declarations are filesystem based packages. A package is defined by placing a special file, say `.package.php`, inside a directory, which can then also contain per-package configuration. +An alternative to explicit package declarations are filesystem based packages. A package is defined by placing a special file, say `_package.php`, inside a directory, which can then also contain per-package configuration. -Relative to the previous variant, this has the advantage of not needing an explicit package declaration in every single file. Additionally, unlike both namespace-scoped declares and explicit package declarations, it provides a well-defined place where to look for package-related declares (the `.package.php` file). The previous two variants would place the declares inside `composer.json` by convention, but `namespace_declare()` or `package_declare()` could also be called from other places, which makes tooling support more complicated. +Relative to the previous variant, this has the advantage of not needing an explicit package declaration in every single file. Additionally, unlike both namespace-scoped declares and explicit package declarations, it provides a well-defined place where to look for package-related declares (the `_package.php` file). The previous two variants would place the declares inside `composer.json` by convention, but `namespace_declare()` or `package_declare()` could also be called from other places, which makes tooling support more complicated. The disadvantage of this approach is the filesystem coupling, which introduces a number of problems in PHP: -* Cache invalidation: Within one request, PHP would presumably cache both which directories do not contain a `.package.php` file, and the contents of the ones which do. Changes to either of those would be ignored during the request. The more problematic case is how these would interact with opcache. If `validate_timestamps` is enabled, we would have to check all the directories (and parent directories) for added, removed or changed `.package.php` files as well, which may carry an additional performance penalty. +* Cache invalidation: Within one request, PHP would presumably cache both which directories do not contain a `_package.php` file, and the contents of the ones which do. Changes to either of those would be ignored during the request. The more problematic case is how these would interact with opcache. If `validate_timestamps` is enabled, we would have to check all the directories (and parent directories) for added, removed or changed `_package.php` files as well, which may carry an additional performance penalty. * Path canonicalization: In order to determine whether a file is part of a package, canonicalized paths need to be available. For example, symlinks must be resolved, and case-(in)sensitivity of the filesystem must be handled correctly. While this functionality is available on the filesystem layer through `realpath()`, this functionality currently does not exist for general PHP streams (not even phars support this properly). This may be solvable with additional stream wrapper hooks. * Directory traversal: In order to find package files, we need to go "upwards" in a given path to find package files. Once again this operation is not supported by stream wrappers. In fact, many stream wrappers do not have a meaningful concept of a "directory". * Generally, it may be hard to make this feature work with arbitrary streams. I have a package that replaces the `file` stream wrapper, in order to intercept all included files and replace them with preprocessed files stored in a temporary directory. I have no idea just how this would interact with a directory-based package system. From 741340d51aebd66b3c5dd1897b4a0fb1af530c71 Mon Sep 17 00:00:00 2001 From: Nikita Popov Date: Wed, 19 Feb 2020 11:41:09 +0100 Subject: [PATCH 7/7] Mention new file extension possibility --- rfcs/0000-language-evolution.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfcs/0000-language-evolution.md b/rfcs/0000-language-evolution.md index fa0a2fa..a572943 100644 --- a/rfcs/0000-language-evolution.md +++ b/rfcs/0000-language-evolution.md @@ -111,6 +111,8 @@ Second, packages will very likely want to use one language dialect for the entir A minor variation on the preceding variant: Instead of using declares, a new opening tag can be introduced. Once again this only works for P++ `