3 Replies Latest reply on Aug 25, 2011 9:01 AM by rhauch

    Sequencers and handling errors and validation

    lisak

      Hey, 

       

      I see that I can add errors to sequencer context

       

      context.getProblems().addError( );

       

      But StreamSequencerAdapter doesn't do anything about it then. Only when one throws an exception, then SequencerException is thrown

       

      I'm wondering how the errors should be handled, the sequencer runs in a different thread than client.

       

       

      Also I have a hard time to think about validation, but my guess, considering the purpuse of sequencers, is that validation, if necessary, should be done prior to storing file into repository.

       

      So far I get the uploaded file in a request handler / controller, I do validation (FileName, Size, Mime Type + Extension correction, WordCount, Language) in a service and if it is OK, I store it into ModeShape and sequencers practically do all the work again + aditional extraction of metadata that were not to be validated.

       

      Detection and text extraction is quite expensive for it to be run twice, so my question is, is it possible and advisable to merge the validation into sequencers and let the client know the result at the same time ?  Or is it a bad design and I should really do the validation in the controller. I practically have this already after the validation :

       

      TempFileImpl(fileName, ext, contentType.toString(), tmpFile, fileSize, text, language, wordCount, charCount, null, tmpFile.getPath(), fileId);

       

      So I think I store as much properties as possible to the parent of input contentNode and they will be passed to the sequencer via its context (considering it is possible). Sequencer does the rest. This is the situation when it would be more handy if sequencer's output would aim to the parent of data input node ( jcr:content - jcr:data )  MODE-1075

       

       

       

      I appreciate any pointers, Jakub

        • 1. Re: Sequencers and handling errors and validation
          rhauch

          But StreamSequencerAdapter doesn't do anything about it then. Only when one throws an exception, then SequencerException is thrown

           

          I'm wondering how the errors should be handled, the sequencer runs in a different thread than client.

          I think that's a bug. SequencingService (around line 494) is not using the Problems object passed into the sequencer, so all that information is being lost. Would you mind logging this? I think any problems should at least be logged (at the appropriate logging level for each kind of problem), but there's not much else we can do, other than create some sort of sequencing-error-listener framework. That's definitely possible, so if you'd like to see this, please log an enhancement request for it.

           

          Also I have a hard time to think about validation, but my guess, considering the purpuse of sequencers, is that validation, if necessary, should be done prior to storing file into repository.

           

          So far I get the uploaded file in a request handler / controller, I do validation (FileName, Size, Mime Type + Extension correction, WordCount, Language) in a service and if it is OK, I store it into ModeShape and sequencers practically do all the work again + aditional extraction of metadata that were not to be validated.

           

          Detection and text extraction is quite expensive for it to be run twice, so my question is, is it possible and advisable to merge the validation into sequencers and let the client know the result at the same time ?  Or is it a bad design and I should really do the validation in the controller. I practically have this already after the validation :

           

          TempFileImpl(fileName, ext, contentType.toString(), tmpFile, fileSize, text, language, wordCount, charCount, null, tmpFile.getPath(), fileId);

           

          So I think I store as much properties as possible to the parent of input contentNode and they will be passed to the sequencer via its context (considering it is possible). Sequencer does the rest. This is the situation when it would be more handy if sequencer's output would aim to the parent of data input node ( jcr:content - jcr:data )  MODE-1075

           

          I appreciate any pointers, Jakub

           

          The primary purpose of sequencers is simply to extract useful structured content from uploaded files (well, strictly speaking, from the properties of changed content; it's just most often used for sequencing the content of uploaded files). I can see why it might also be useful for validation, but that's beyond the original goal. I think it's certainly useful to consider how/whether sequencers can be used for validation, and your applications are certainly welcome to use the sequencers. You can do that by using the List<Sequencer> within the SequencingService (which you can get from the engine), and invoking the sequencer(s) on the content you supply.

           

          But if your application needs to validate content *before* uploading it into JCR, then clearly this has to be done before JCR gets a hold of the content. So the only way to do both validation *and* content extraction in one pass is for your application to do this before uploading the content and, if valid, store the original file content *and* the derived content. Essentially, you'd need to bypass the built-in background sequencing capability.

           

          However, if still want to use the built-in background sequencing capability (and don't mind the content being processed a second time), then your application is certainly able to attach more properties to the uploaded content. For example, StreamSequencer implementations are handed a StreamSequencerContext that contain the properties of the node being sequenced. So you can add as properties any additional metadata useful to your sequencer implementation, and your sequencer implementation can simply use those. Our out-of-the-box sequencers just don't do this because they derive all the output from the input content.

           

          As far as having sequencers output stored on the input node, this is certainly possible. Yes, there are complications (and bugs) when the sequencers generate properties that are already there (they always overwrite any existing property value). A workaround might be to add the output content stored as a child node of the input node. WDYT?

          1 of 1 people found this helpful
          • 2. Re: Sequencers and handling errors and validation
            lisak

            Thank you Randall,

             

             

            I've been out for more than a month, so I'm going to get it done now.

             

            First I validate the file ( it includes some processing that would otherwise be done by sequencer), and I save results as properties to the node to be sequenced. I can see that in StreamSequencerAdapter, that props from input Node goes to the SequencerContext.

             

              
                protected StreamSequencerContext createStreamSequencerContext( Node input,
                                                                               Property sequencedProperty,
                                                                               SequencerContext context,
                                                                               Problems problems ) {
                    assert input != null;
                    assert sequencedProperty != null;
                    assert context != null;
                    assert problems != null;
                    ValueFactories factories = context.getExecutionContext().getValueFactories();
                    Path path = factories.getPathFactory().create(input.getLocation().getPath());
            
                    Set<org.modeshape.graph.property.Property> props = Collections.<Property>unmodifiableSet(input.getPropertiesByName()
                                                                                                                  .values());
                    Name fileName = path.getLastSegment().getName();
                    if (JcrLexicon.CONTENT.equals(fileName) && !path.isRoot()) {
                        fileName = path.getParent().getLastSegment().getName();
                    }
                    String mimeType = getMimeType(context, sequencedProperty, fileName.getLocalName());
                    return new StreamSequencerContext(context.getExecutionContext(), path, props, mimeType, problems);
                }
            
            

             

            Input Node for

             

            <mode:pathExpression>//(*.(doc|docx|xls|ppt)[*])/jcr:content[@jcr:data] => /whatever/$1</mode:pathExpression>
            

             

            would be jcr:content

            For example, StreamSequencer implementations are handed a StreamSequencerContext that contain the properties of the node being sequenced. So you can add as properties any additional metadata useful to your sequencer implementation, and your sequencer implementation can simply use those. Our out-of-the-box sequencers just don't do this because they derive all the output from the input content.

             

            What do you mean by "Our out-of-the-box sequencers just don't do this" ? The streamSequencers are out-of-the-box sequencers, aren't they ?

            • 3. Re: Sequencers and handling errors and validation
              rhauch

              For example, StreamSequencer implementations are handed a StreamSequencerContext that contain the properties of the node being sequenced. So you can add as properties any additional metadata useful to your sequencer implementation, and your sequencer implementation can simply use those. Our out-of-the-box sequencers just don't do this because they derive all the output from the input content.

               

              What do you mean by "Our out-of-the-box sequencers just don't do this" ? The streamSequencers are out-of-the-box sequencers, aren't they ?

              I mean that when you upload files into the repository, you can add additional properties to the node that will be sequenced (the "jcr:content" node in this case?). When the sequencers run, they have access to all these other properties on the node (via the StreamSequencerContext, as you mention above). The "out-of-the-box" sequencers are all the concrete implementations that ModeShape provides, and none of these concrete implementations happen to look for any extra properties. Of course, if you implement your own StreamSequencer, you could easily use these property values in your sequencer's logic.

               

              Hope that helps.