-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PropertyString, cleanHtml helper and escapeOrCleanHtml helper. #4004
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @EreMaijala, looks like a great start! A few minor thoughts/suggestions...
themes/bootstrap5/templates/RecordDriver/DefaultRecord/data-summary.phtml
Outdated
Show resolved
Hide resolved
* @license http://opensource.org/licenses/gpl-2.0.php GNU General Public License | ||
* @link https://vufind.org/wiki/development Wiki | ||
*/ | ||
class EscapeHtmlExt extends \Laminas\View\Helper\Escaper\AbstractHelper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this name, but I also can't think of a better one... :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me neither. I wanted to extend EscapeHtml, but the silly @Final thing prevents that. And while we could substitute our own EscapeHtml, the Laminas class is used in so many places, that it's not quite straightforward. We could of course change all the references to an interface, but that would require more wide-ranging changes. If you think that'd be a better way forward, I'd be happy to work on that. What would support that is the fact that EscapeHtmlAttr needs a substitute as well (to be able to control the IE-compatibility of the escaping process), so doing both at the same time would probably make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have really strong feelings on this -- no solution feels obviously like the best. Maybe @maccabeelevine will have some thoughts to share... ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And while we could substitute our own EscapeHtml, the Laminas class is used in so many places, that it's not quite straightforward.
This was certainly my original idea with #3998, to make the template changes as simple as possible, and it's safer here than in my PoC implementation since it would do nothing unless you had a PropertyStringInterface value and passed allowHtml
. But I'm sure there are complications I'm not seeing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A month later, a fresh thought on the name: would escapeOrCleanHtml
be more descriptive than escapeHtmlExt
, since that's what this is doing -- escaping the HTML unless told to clean it instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Ere agrees, I like that naming!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it a lot better than escapeHtmlExt
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the idea here, and it's so much better than the original #3998 PoC. I added some specific questions, but I guess my primary question (and maybe the hardest to answer) is how we deal with generic templates that may or may not have HTML in certain fields for some backends.
So, would it be appropriate in the default record driver's core.html to change
<h1<?=$this->schemaOrg()->getAttributes(['property' => 'name'])?>><?=$this->escapeHtml($this->driver->getShortTitle() . ' ' . $this->driver->getSubtitle() . ' ' . $this->driver->getTitleSection())?></h1>
to
<h1<?=$this->schemaOrg()->getAttributes(['property' => 'name'])?>><?=$this->escapeHtmlExt($this->driver->getShortTitle(), allowHtml: true)?> <?=$this->escapeHtml($this->driver->getSubtitle() . ' ' . $this->driver->getTitleSection())?></h1>
Performance-wise, this should be ok, since for any other record driver the title would not be a PropertyString and so the allowHtml would be ignored. But is it ok from a template complexity standpoint?
* @license http://opensource.org/licenses/gpl-2.0.php GNU General Public License | ||
* @link https://vufind.org/wiki/development Wiki | ||
*/ | ||
class EscapeHtmlExt extends \Laminas\View\Helper\Escaper\AbstractHelper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And while we could substitute our own EscapeHtml, the Laminas class is used in so many places, that it's not quite straightforward.
This was certainly my original idea with #3998, to make the template changes as simple as possible, and it's safer here than in my PoC implementation since it would do nothing unless you had a PropertyStringInterface value and passed allowHtml
. But I'm sure there are complications I'm not seeing.
Regarding @maccabeelevine's question about changing the default driver's core.html to support HTML, I don't think I would have a problem with that in the interest of greater flexibility.... |
@demiankatz Do we want to support HTML for all fields or for longer textual fields like summary? Different choices will lead to different complications e.g. with search term highlighting. |
Co-authored-by: Demian Katz <[email protected]>
If we wanted to start small, I would think the bare minimum would be longer textual fields and titles. But I'm not opposed to rolling out the support more widely if it's easier to do it all at once than piecemeal. I tend to favor the most pragmatic strategy, whatever that turns out to be. :-) |
Here are a couple of thought:
|
I think the downsides probably outweigh the benefits here, since I can't think of any scenario where this doesn't potentially lead to unexpected side effects or confusion. I think it's better to be explicit.
...and this would be a good way to be explicit in a number of contexts.
Possible lazy solution: truncate based on a tag-stripped version. If the stripped version is under the limit, display the HTML as-is. If the stripped version is too long, show the truncated, stripped version, and the user will have to "see more" to get the rich version. Not ideal, but maybe a quick place to start.
Maybe we need the ability to define multiple named tag allow-lists. Then we could have a "default" list and a "heading" list and the configurable settings could refer to an allow-list, or "none". This would empower users to create their own more granular lists as needed, but we could use these two obvious ones as a starting point.
Given the way Laminas seems to be gradually cutting off the ability to extend anything, I agree that building tools that meet our needs and wrap around Laminas' public interfaces is better than trying to build upon those interfaces directly. |
This PR was accidentally closed by the deletion of the dev-11.0 branch; I have restored and reopened it. Sorry for the inconvenience! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another look at this and had just a few more minor thoughts and questions...
* @license http://opensource.org/licenses/gpl-2.0.php GNU General Public License | ||
* @link https://vufind.org/wiki/development Wiki | ||
*/ | ||
class EscapeHtmlExt extends \Laminas\View\Helper\Escaper\AbstractHelper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A month later, a fresh thought on the name: would escapeOrCleanHtml
be more descriptive than escapeHtmlExt
, since that's what this is doing -- escaping the HTML unless told to clean it instead?
'<br>', | ||
array_map( | ||
function ($summary) { | ||
$htmlContent = str_starts_with($summary, '<') && str_ends_with($summary, '>'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a way to eliminate this hacky check for angle brackets. Should we do the check in the record driver and wrap the string in a PropertyString? Should the new helper have a switch to enable this check, or even make this behavior part of the standard "allowHtml"? I'm not totally sure, but it feels like we can do better here, especially since as written, there's no safety on the summary in the special case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These actually come from the Summaries content loaders. I've added PropertyString support for the Demo loader (module/VuFind/src/VuFind/Content/Summaries/Demo.php), but the Syndetics loader is a bit complicated. I can move the check there, but reading the code it seems like with Syndetics plus it should just output a <div>
with the proper id without subjecting it to cleaning or anything (i.e. it would break unless <div>
is allowed in clean HTML). I agree that the mechanism isn't quite optimal, though. Perhaps one option would be adding a flag property to PropertyString indicating that the HTML must not be cleaned. Then we could always pass the result to escapeOrCleanHtml() and it would just output the original HTML when the flag is set. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think adding a flag like "trusted HTML" is probably the best solution. This could be potentially useful in other places where we know HTML has come from a safe source but may have complex characteristics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I can't actually test it because I don't have access to SyndeticsPlus.
@demiankatz @maccabeelevine Okay, this should finally be up to date wrt the discussion. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I remember correctly that the use case once this is merged would be to edit a specific record driver to return a PropertyString instead of generic string from whatever specific methods we want to support HTML values? If so, I think this looks good, and my work on #3991 can be the test case where we refine the approach. Just one suggestion.
* | ||
* @return static | ||
*/ | ||
public function setIds(array $ids): static |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is meant for linked data identifiers (i.e. URIs) for the string, but not sure. Is it practical to include some example use case with the demo driver, or should that wait for future PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which Demo driver did you have in mind? I don't think linked data IDs make a lot of sense in the context of something like book summaries, so I'm not sure how we would demonstrate them there. A better use case would be to add support in the appropriate MARC-based record driver traits, but that feels like a big enough change that it would be better as a follow-up than as part of this PR. I'm open to other ideas, though, if I'm misunderstanding your proposal!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I agree, better to wait for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @EreMaijala -- I gave this another look and have a few more comments.
I'm not sure if I still have demo access to Syndetics or not; I can try after this round of review is addressed. If I don't, we can see if anyone in the community is currently using it. Might be good to have a wiki page of people willing to help with testing of third-party services for situations like this.
'<br>', | ||
array_map( | ||
function ($summary) { | ||
return $this->escapeOrCleanHtml($summary, allowHtml: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to have the summary context added to match core.phtml?
|
||
; By default HTML is not allowed in record fields, but support for sanitized HTML | ||
; can be enabled per field type. | ||
[HTML_Fields] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might be able to make this a little more clear. A few suggestions:
1.) I wonder if the top comment could be something more like "Data retrieved from records is escaped by default, on the assumption that it does not contain HTML. This section can be used to allow limited HTML in specific contexts; set values to true to allow HTML, and leave false to escape content."
2.) I found the "HTML_Fields" section name a bit confusing. Maybe something like Allowed_HTML_Contexts would be better (not that I love that either). I just wonder if "field" might be too limiting, so I rather prefer "context" -- and since all the settings are about turning something on, having a verb seems potentially helpful.
3.) Should we put comments above all the switches clarifying what the contexts really mean?
<?php if ($summary = isset($summary[0]) ? $this->escapeHtml($summary[0]) : false): ?> | ||
<p><?=$this->truncate($summary, 300)?></p> | ||
<?php if ($summary = $summary[0] ?? null): ?> | ||
<?php $shortSummary = $this->truncate($summary, 300); ?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry about truncate splitting HTML and causing broken markup? I thought we addressed this somewhere/somehow, but in my quick review just now I don't see where.
* | ||
* @return ?bool | ||
*/ | ||
public function getHtmlTrusted(): ?bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nitpick which you're welcome to ignore, but would isHtmlTrusted
be a better name for a boolean method? Or do you think it's more important to have parallel get/set methods?
* | ||
* @return static | ||
*/ | ||
public function setIds(array $ids): static; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also have an addId() method, in case we're not able to set all the IDs at once?
// Limit elements allowed in headings | ||
// (see https://developer.mozilla.org/en-US/docs/Web/HTML/Content_categories#phrasing_content for more | ||
// information): | ||
$config->set('HTML.AllowedElements', 'a,b,br,em,i,span'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this configurable? If not, should we at least reference this factory in config.ini so people know what "sanitized HTML" actually means?
Related to #3998, this is a draft for PropertyString that can carry additional information along with a plain text string. It now has explicit support for HTML content, and there's also a cleanHtml helper to handle sanitization. escapeHtmlExt is a replacement for escapeHtml allowing the HTML version to be returned if desired.
@maccabeelevine What do you think? Obviously this needs some template changes where HTML is desirable, but should otherwise be fairly easy to use.
TODO: