Per-line parser reporting per-line errors #210

MaxGabriel · 2021-12-20T22:19:38Z

Right now, the functions in Data.Csv generally return Either String (Vector a) or a similar variant, which is great in most cases.

If you want to get errors on a per-line basis, one needs to use Data.Csv.Incremental*. This is unfortunate because the complexity of it is significantly higher (because it also is supporting interleaving IO and incrementally feeding data to the parser).

I think adding convenience functions for taking a ByteString and returning Vector (Either String a), without going through the Incremental functions, could be a good addition. The main use case I have in mind for those is providing better error messages for user-provided CSVs.

If my read of the Cassava docs is right

The text was updated successfully, but these errors were encountered:

andreasabel · 2021-12-21T15:15:02Z

I think adding convenience functions for taking a ByteString and returning Vector (Either String a),

Yeah, that is thinkable.
The current parser (e.g. for without header),

cassava/src/Data/Csv/Parser.hs

Lines 69 to 75 in c821c83

    
           csv :: DecodeOptions -> AL.Parser Csv 
        
           csv !opts = do 
        
               vals <- sepByEndOfLine1' (record (decDelimiter opts)) 
        
               _ <- optional endOfLine 
        
               endOfInput 
        
               let nonEmpty = removeBlankLines vals 
        
               return $! V.fromList nonEmpty

uses the parser modifier sepByEndOfLine':

cassava/src/Data/Csv/Parser.hs

Lines 92 to 105 in c821c83

    
           -- | Specialized version of 'sepBy1'' which is faster due to not 
        
           -- accepting an arbitrary separator. 
        
           sepByEndOfLine1' :: AL.Parser a 
        
                            -> AL.Parser [a] 
        
           sepByEndOfLine1' p = liftM2' (:) p loop 
        
             where 
        
               loop = do 
        
                   mb <- A.peekWord8 
        
                   case mb of 
        
                       Just b | b == cr -> 
        
                           liftM2' (:) (A.anyWord8 *> A.word8 newline *> p) loop 
        
                              | b == newline -> 
        
                           liftM2' (:) (A.anyWord8 *> p) loop 
        
                       _ -> pure []

So, it is just a single parse with a single error returned.
If you want per-line parsing, you will first have to split the input using a variant of sepByEndOfLine' and then map a line parser over it.

PR welcome, if it comes with benchmarks comparing the performance of single parse versus per line parse.

MaxGabriel · 2021-12-21T15:25:57Z

Ok, I'm not interested in the performance implications of this, so just going to close

andreasabel · 2021-12-21T16:45:41Z

Let's leave it open, maybe others are interested.

MaxGabriel closed this as completed Dec 21, 2021

andreasabel changed the title ~~Convenience function for per-line errors~~ Per-line parser reporting per-line errors Dec 21, 2021

andreasabel added PR welcome re: error reporting Concerning error messages delivered by the parser labels Dec 21, 2021

andreasabel reopened this Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-line parser reporting per-line errors #210

Per-line parser reporting per-line errors #210

MaxGabriel commented Dec 20, 2021

andreasabel commented Dec 21, 2021

MaxGabriel commented Dec 21, 2021 •

edited

Loading

andreasabel commented Dec 21, 2021

Per-line parser reporting per-line errors #210

Per-line parser reporting per-line errors #210

Comments

MaxGabriel commented Dec 20, 2021

andreasabel commented Dec 21, 2021

MaxGabriel commented Dec 21, 2021 • edited Loading

andreasabel commented Dec 21, 2021

MaxGabriel commented Dec 21, 2021 •

edited

Loading