Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PropertyString, cleanHtml helper and escapeOrCleanHtml helper. #4004

Draft
wants to merge 14 commits into
base: dev
Choose a base branch
from

Conversation

EreMaijala
Copy link
Contributor

Related to #3998, this is a draft for PropertyString that can carry additional information along with a plain text string. It now has explicit support for HTML content, and there's also a cleanHtml helper to handle sanitization. escapeHtmlExt is a replacement for escapeHtml allowing the HTML version to be returned if desired.

@maccabeelevine What do you think? Obviously this needs some template changes where HTML is desirable, but should otherwise be fairly easy to use.

TODO:

  • Add unit tests

@EreMaijala EreMaijala marked this pull request as draft October 11, 2024 10:20
Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @EreMaijala, looks like a great start! A few minor thoughts/suggestions...

* @license http://opensource.org/licenses/gpl-2.0.php GNU General Public License
* @link https://vufind.org/wiki/development Wiki
*/
class EscapeHtmlExt extends \Laminas\View\Helper\Escaper\AbstractHelper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this name, but I also can't think of a better one... :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither. I wanted to extend EscapeHtml, but the silly @Final thing prevents that. And while we could substitute our own EscapeHtml, the Laminas class is used in so many places, that it's not quite straightforward. We could of course change all the references to an interface, but that would require more wide-ranging changes. If you think that'd be a better way forward, I'd be happy to work on that. What would support that is the fact that EscapeHtmlAttr needs a substitute as well (to be able to control the IE-compatibility of the escaping process), so doing both at the same time would probably make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have really strong feelings on this -- no solution feels obviously like the best. Maybe @maccabeelevine will have some thoughts to share... ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And while we could substitute our own EscapeHtml, the Laminas class is used in so many places, that it's not quite straightforward.

This was certainly my original idea with #3998, to make the template changes as simple as possible, and it's safer here than in my PoC implementation since it would do nothing unless you had a PropertyStringInterface value and passed allowHtml. But I'm sure there are complications I'm not seeing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A month later, a fresh thought on the name: would escapeOrCleanHtml be more descriptive than escapeHtmlExt, since that's what this is doing -- escaping the HTML unless told to clean it instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Ere agrees, I like that naming!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it a lot better than escapeHtmlExt!

Copy link
Member

@maccabeelevine maccabeelevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea here, and it's so much better than the original #3998 PoC. I added some specific questions, but I guess my primary question (and maybe the hardest to answer) is how we deal with generic templates that may or may not have HTML in certain fields for some backends.

So, would it be appropriate in the default record driver's core.html to change

<h1<?=$this->schemaOrg()->getAttributes(['property' => 'name'])?>><?=$this->escapeHtml($this->driver->getShortTitle() . ' ' . $this->driver->getSubtitle() . ' ' . $this->driver->getTitleSection())?></h1>

to

<h1<?=$this->schemaOrg()->getAttributes(['property' => 'name'])?>><?=$this->escapeHtmlExt($this->driver->getShortTitle(), allowHtml: true)?> <?=$this->escapeHtml($this->driver->getSubtitle() . ' ' . $this->driver->getTitleSection())?></h1>

Performance-wise, this should be ok, since for any other record driver the title would not be a PropertyString and so the allowHtml would be ignored. But is it ok from a template complexity standpoint?

* @license http://opensource.org/licenses/gpl-2.0.php GNU General Public License
* @link https://vufind.org/wiki/development Wiki
*/
class EscapeHtmlExt extends \Laminas\View\Helper\Escaper\AbstractHelper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And while we could substitute our own EscapeHtml, the Laminas class is used in so many places, that it's not quite straightforward.

This was certainly my original idea with #3998, to make the template changes as simple as possible, and it's safer here than in my PoC implementation since it would do nothing unless you had a PropertyStringInterface value and passed allowHtml. But I'm sure there are complications I'm not seeing.

@demiankatz
Copy link
Member

Regarding @maccabeelevine's question about changing the default driver's core.html to support HTML, I don't think I would have a problem with that in the interest of greater flexibility....

@EreMaijala
Copy link
Contributor Author

@demiankatz Do we want to support HTML for all fields or for longer textual fields like summary? Different choices will lead to different complications e.g. with search term highlighting.

@demiankatz
Copy link
Member

@demiankatz Do we want to support HTML for all fields or for longer textual fields like summary? Different choices will lead to different complications e.g. with search term highlighting.

If we wanted to start small, I would think the bare minimum would be longer textual fields and titles. But I'm not opposed to rolling out the support more widely if it's easier to do it all at once than piecemeal. I tend to favor the most pragmatic strategy, whatever that turns out to be. :-)

@EreMaijala
Copy link
Contributor Author

Here are a couple of thought:

  1. Could add a default "allow HTML" toggle and/or setting to the escaper so that the parameter doesn't need to be repeated. Downside: easy to accidentally leave enabled outside of the intended scope.
  2. Could add configurable defaults per field type ('title', 'alt-title', 'summary' etc.). This would allow specifying the fields with HTML in configuration.
  3. truncate helper is going to have trouble with HTML. Perhaps we can make it smarter, but it won't be easy.
  4. titles will be a problem with any tags that you can't have inside a heading like <h1>. Perhaps we need the field type from 2. to also indicate which tags are allowed.
  5. escapeHtmlExt needs to deviate a bit further from the invoke params available with Laminas' helper. It seems that it'd be better to decouple.

@demiankatz
Copy link
Member

1. Could add a default "allow HTML" toggle and/or setting to the escaper so that the parameter doesn't need to be repeated. Downside: easy to accidentally leave enabled outside of the intended scope.

I think the downsides probably outweigh the benefits here, since I can't think of any scenario where this doesn't potentially lead to unexpected side effects or confusion. I think it's better to be explicit.

2. Could add configurable defaults per field type ('title', 'alt-title', 'summary' etc.). This would allow specifying the fields with HTML in configuration.

...and this would be a good way to be explicit in a number of contexts.

3. truncate helper is going to have trouble with HTML. Perhaps we can make it smarter, but it won't be easy.

Possible lazy solution: truncate based on a tag-stripped version. If the stripped version is under the limit, display the HTML as-is. If the stripped version is too long, show the truncated, stripped version, and the user will have to "see more" to get the rich version. Not ideal, but maybe a quick place to start.

4. titles will be a problem with any tags that you can't have inside a heading like `<h1>`. Perhaps we need the field type from 2. to also indicate which tags are allowed.

Maybe we need the ability to define multiple named tag allow-lists. Then we could have a "default" list and a "heading" list and the configurable settings could refer to an allow-list, or "none". This would empower users to create their own more granular lists as needed, but we could use these two obvious ones as a starting point.

5. escapeHtmlExt needs to deviate a bit further from the invoke params available with Laminas' helper. It seems that it'd be better to decouple.

Given the way Laminas seems to be gradually cutting off the ability to extend anything, I agree that building tools that meet our needs and wrap around Laminas' public interfaces is better than trying to build upon those interfaces directly.

@demiankatz demiankatz deleted the branch vufind-org:dev November 1, 2024 18:01
@demiankatz demiankatz closed this Nov 1, 2024
@demiankatz demiankatz reopened this Nov 5, 2024
@demiankatz demiankatz changed the base branch from dev-11.0 to dev November 5, 2024 14:33
@demiankatz demiankatz added this to the 11.0 milestone Nov 5, 2024
@demiankatz
Copy link
Member

This PR was accidentally closed by the deletion of the dev-11.0 branch; I have restored and reopened it. Sorry for the inconvenience!

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look at this and had just a few more minor thoughts and questions...

* @license http://opensource.org/licenses/gpl-2.0.php GNU General Public License
* @link https://vufind.org/wiki/development Wiki
*/
class EscapeHtmlExt extends \Laminas\View\Helper\Escaper\AbstractHelper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A month later, a fresh thought on the name: would escapeOrCleanHtml be more descriptive than escapeHtmlExt, since that's what this is doing -- escaping the HTML unless told to clean it instead?

'<br>',
array_map(
function ($summary) {
$htmlContent = str_starts_with($summary, '<') && str_ends_with($summary, '>');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a way to eliminate this hacky check for angle brackets. Should we do the check in the record driver and wrap the string in a PropertyString? Should the new helper have a switch to enable this check, or even make this behavior part of the standard "allowHtml"? I'm not totally sure, but it feels like we can do better here, especially since as written, there's no safety on the summary in the special case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These actually come from the Summaries content loaders. I've added PropertyString support for the Demo loader (module/VuFind/src/VuFind/Content/Summaries/Demo.php), but the Syndetics loader is a bit complicated. I can move the check there, but reading the code it seems like with Syndetics plus it should just output a <div> with the proper id without subjecting it to cleaning or anything (i.e. it would break unless <div> is allowed in clean HTML). I agree that the mechanism isn't quite optimal, though. Perhaps one option would be adding a flag property to PropertyString indicating that the HTML must not be cleaned. Then we could always pass the result to escapeOrCleanHtml() and it would just output the original HTML when the flag is set. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think adding a flag like "trusted HTML" is probably the best solution. This could be potentially useful in other places where we know HTML has come from a safe source but may have complex characteristics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I can't actually test it because I don't have access to SyndeticsPlus.

@EreMaijala
Copy link
Contributor Author

@demiankatz @maccabeelevine Okay, this should finally be up to date wrt the discussion. What do you think?

Copy link
Member

@maccabeelevine maccabeelevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I remember correctly that the use case once this is merged would be to edit a specific record driver to return a PropertyString instead of generic string from whatever specific methods we want to support HTML values? If so, I think this looks good, and my work on #3991 can be the test case where we refine the approach. Just one suggestion.

*
* @return static
*/
public function setIds(array $ids): static
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is meant for linked data identifiers (i.e. URIs) for the string, but not sure. Is it practical to include some example use case with the demo driver, or should that wait for future PRs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which Demo driver did you have in mind? I don't think linked data IDs make a lot of sense in the context of something like book summaries, so I'm not sure how we would demonstrate them there. A better use case would be to add support in the appropriate MARC-based record driver traits, but that feels like a big enough change that it would be better as a follow-up than as part of this PR. I'm open to other ideas, though, if I'm misunderstanding your proposal!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I agree, better to wait for another PR.

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @EreMaijala -- I gave this another look and have a few more comments.

I'm not sure if I still have demo access to Syndetics or not; I can try after this round of review is addressed. If I don't, we can see if anyone in the community is currently using it. Might be good to have a wiki page of people willing to help with testing of third-party services for situations like this.

'<br>',
array_map(
function ($summary) {
return $this->escapeOrCleanHtml($summary, allowHtml: true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to have the summary context added to match core.phtml?


; By default HTML is not allowed in record fields, but support for sanitized HTML
; can be enabled per field type.
[HTML_Fields]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to make this a little more clear. A few suggestions:

1.) I wonder if the top comment could be something more like "Data retrieved from records is escaped by default, on the assumption that it does not contain HTML. This section can be used to allow limited HTML in specific contexts; set values to true to allow HTML, and leave false to escape content."

2.) I found the "HTML_Fields" section name a bit confusing. Maybe something like Allowed_HTML_Contexts would be better (not that I love that either). I just wonder if "field" might be too limiting, so I rather prefer "context" -- and since all the settings are about turning something on, having a verb seems potentially helpful.

3.) Should we put comments above all the switches clarifying what the contexts really mean?

<?php if ($summary = isset($summary[0]) ? $this->escapeHtml($summary[0]) : false): ?>
<p><?=$this->truncate($summary, 300)?></p>
<?php if ($summary = $summary[0] ?? null): ?>
<?php $shortSummary = $this->truncate($summary, 300); ?>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about truncate splitting HTML and causing broken markup? I thought we addressed this somewhere/somehow, but in my quick review just now I don't see where.

*
* @return ?bool
*/
public function getHtmlTrusted(): ?bool;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nitpick which you're welcome to ignore, but would isHtmlTrusted be a better name for a boolean method? Or do you think it's more important to have parallel get/set methods?

*
* @return static
*/
public function setIds(array $ids): static;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also have an addId() method, in case we're not able to set all the IDs at once?

// Limit elements allowed in headings
// (see https://developer.mozilla.org/en-US/docs/Web/HTML/Content_categories#phrasing_content for more
// information):
$config->set('HTML.AllowedElements', 'a,b,br,em,i,span');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this configurable? If not, should we at least reference this factory in config.ini so people know what "sanitized HTML" actually means?

@EreMaijala EreMaijala changed the title Add PropertyString, cleanHtml helper and escapeHtmlExt helper. Add PropertyString, cleanHtml helper and escapeOrCleanHtml helper. Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants