5 Replies Latest reply on Feb 24, 2009 5:44 AM by Tom Fennelly

    Smooks CSVs 2 XML transformation

    Patrick Pussar Newbie

      Hi all,
      I want to transform several csv files to one xml file.
      Each csv file holds information about one certain entity (customer, address, order, etc).
      The result xml should be logical sturctured, meaning something like this:

      <order>
       <customer ...>
       <address .../>
       </address>
      </order
      


      Now I am not sure about the best approach to handle this.
      At the moment I plan to have a 3 step processing:
      1. convert all csv to xml
      a,b,c,d,e
      


      becomes to:
      <csv-set>
       <csv-record>
       <attr1>a</attr1>
       <attr2>b</attr2>
       <attr3>c</attr3>
       </csv-record>
      </csv-set>
      


      2. aggregate all xml artefacts
      <envelope>
       <file1>
       <csv-set>
       ...
       </csv-set>
       </file1>
       ....
       <fileX>
       ...
       </fileX>
      </envelope>
      


      3. Transform XML 2 XML
      The xml from step two will be transformed to my target format (see above).


      Not sure if this is the best way to do that, due to I am not familiar with all features of Smooks. Would be nice if someone can give me some comment on that solution (possitive or negative) everything is very welcome!!!!




        • 1. Re: Smooks CSVs 2 XML transformation
          Tom Fennelly Master

          Yeah... an interesting one for sure. So ultimately, you're looking to interleave/merge the records from the different CSV streams and produce an XML of this?

          I assume the message coming into the ESB contains the names of the CSV files?

          Something like you suggested would work. Something else that may or may not be an option for you could be to stream the CSV data into a database, performing inserts/updates where appropriate and then generate the final "merged" view by querying the DB.

          • 2. Re: Smooks CSVs 2 XML transformation
            Patrick Pussar Newbie

             

            "tfennelly" wrote:
            Yeah... an interesting one for sure. So ultimately, you're looking to interleave/merge the records from the different CSV streams and produce an XML of this?

            I assume the message coming into the ESB contains the names of the CSV files?


            Yes exactly, one file contains the delivery, meaning all other filenames. That one becomes picked up by the ESB.
            Yes using a DB in the middle could be also a valid approach. But I am not 100% sure how volatile the files will be. So the DB could be hard to maintain in case of different formats or versions, need to check this point.

            Thanks for you feedback.


            • 3. Re: Smooks CSVs 2 XML transformation
              Tom Fennelly Master

               

              "ama1" wrote:
              But I am not 100% sure how volatile the files will be. So the DB could be hard to maintain in case of different formats or versions, need to check this point.


              Well if you can define a canonical data model for these different formats/versions, then you could:

              1. Define a Java model for this canonical data model.
              2. Define Smooks binding configs for the different formats/versions, binding into the canonical form.
              3. Define Smooks DB routing configs to insert/update the data from the populated canonical data model into the DB.

              So, you handle the variations in formats/versions a bit more cleanly. If a new version/format comes into the equation, you create a new binding config for this (to the canonical model). The routing to the DB, from the canonical data model, is the same.

              • 4. Re: Smooks CSVs 2 XML transformation
                Patrick Pussar Newbie

                Hi,
                I tryed to solve the task without a DB for the moment.
                I used freemarker for templating, but I am running very easy into memory problems, which is no big supprise for me. The problem is the input format (step 2.) which contains all files in seperate xml-trees. Each entity in a seperate tree.
                But I want to transform this into a hierachical xml structure. Meaning to normalise the dataset out of the several trees and end up with a nested hierachical xml structure (step 3.).

                This is the point which causes trouble. To normalise the entities I need to hold at least two complete csv-set trees in memory to combine them. But this causes an OutOfMemory exception.

                Well I am not sure that I use freemarker in the most efficient way, so I have some questions:

                My input.xml:

                <envelope>
                 <file1>
                 <csv-set>
                 <csv-record>
                 <attrX1>a</attrX1>
                 <attrX2>b</attrX2>
                 <attrX3>c</attrX3>
                 </csv-record>
                 ....
                 </csv-set>
                 </file1>
                 <file2>
                 <csv-set>
                 <csv-record>
                 <attrY1>a</attrY1>
                 <attrY2>b</attrY2>
                 <attrY3>c</attrY3>
                 </csv-record>
                 ....
                 </csv-set>
                 </file2>
                </envelope>
                


                1. How does the createOnElement attribute works in detail.
                At the moment I use this config:
                <jb:bindings beanId="customer" class="java.util.Hashtable" createOnElement="envelope,file1">
                 <jb:wiring property="records" beanIdRef="file1Records" />
                 </jb:bindings>
                 <jb:bindings beanId="file1Records" class="java.util.ArrayList" createOnElement="envelope,file1">
                 <jb:wiring beanIdRef="file1Record"/>
                 </jb:bindings>
                 <jb:bindings beanId="file1Record" class="java.util.Hashtable" createOnElement="csv-record">
                 <jb:value property="id" data="attrX1"/>
                 <jb:value property="id" data="attrX2"/>
                 <jb:value property="id" data="attrX3"/>
                 </jb:bindings>
                ...
                


                I think the definition 'createOnElement="envelope,file1" ' is not the best one? But when I remove envelope here my entities did not become initialized...?


                2. Template splitting
                I had a look into this example: http://docs.codehaus.org/display/MILYN/Processing+Huge+Messages+with+Smooks

                It looks like the solution for my problem. But it looks for me that only one split is possible per template? Is this correct?
                If true then I cant use it due to I would need one split per entity tree...


                Sorry for my semi-knowledge but I am really curious, if I can handle this without using a DB :-)



                • 5. Re: Smooks CSVs 2 XML transformation
                  Tom Fennelly Master

                  So first off... 'createOnElement="envelope,file1" ' is definitely not correct. That will result in objects being created on both the "envelope" and "file1" elements. You should set it to 'createOnElement="envelope/file1" ' i.e. it's a contextual path, not a comma separated list.

                  So if you're running into mem issues, it sounds as though you have a huge dataset. My guess is that it has nothing at all to do with FreeMarker. Because of what you're trying to do, it sounds like you need to "sort" the data in one way or another before generating the result. If you can't do that in memory (because there's too much data), then you mat need to consider a DB. I'm sure it would be possible for you to use the file system in some way, but the DB would have the query, sort and paging capability built into it.

                  Re the example you looked at, what do you mean by "But it looks for me that only one split is possible per template?" The template is not doing any splitting. The splitting is done by Smooks and the template is just generating the output. You can perform multiple concurrent (and/or conditional) splits.

                  Can you provide more detailed info please i.e. the exact format of the input files, as well as the exact format of the required output file, with real sample data. Don't imply anything, or assume I'll read between the lines and fill in the gaps. Don't leave anything to my imagination re what you're starting with and what you want as a result!!