Skip navigation
2014

22.jpg

We often need to process text file data while programming. Here is an example for illustrating how to group and summarize text file data in Java: load employee information from text file employee.txt, group according to DEPT and seek COUNT, the number of employee, and total amount of SALARY of each group.

 

Text file employee.txt is in a format as follows:

Java group1.jpg

Java’s way of writing code for this task is:

  1. 1. Import data from the file by rows and save them in emp, the multiple Map objects of sourceList, the List object.
  2. 2. Traverse the object of sourceList, perform grouping according to DEPT and save the result in list, which contains different List objects of group, the Map object.
  3. 3. Traverse group and then traverse each DEPT’s list object, and sum up SALARY.
  4. 4. While traversing group, save the values of DEPT, COUNT and SALARY in result, the Map object, and the results of different departments in resultList, the List object.
  5. 5. Print out the data of resultList.

The code is as follows:

public static void myGroup() throws Exception{

           File file = new File("D:\\esProc\\employee.txt");

           FileInputStream fis = null;

           fis = new FileInputStream(file);

           InputStreamReader input = new InputStreamReader(fis);

           BufferedReader br = new BufferedReader(input);

           String line = null;

           String info[] = null;

           List<Map<String,String>> sourceList= new ArrayList<Map<String,String>>();

           List<Map<String,Object>> resultList= new ArrayList<Map<String,Object>>();

           if ((line = br.readLine())== null) return;//skip the first row

          while((line = br.readLine())!= null){

                    info = line.split("\t");

                    Map<String,String> emp=new HashMap<String,String>();

                    emp.put("EID",info[0]);

                    emp.put("NAME",info[1]);

                    emp.put("SURNAME",info[2]);

                    emp.put("GENDER",info[3]);

                    emp.put("STATE",info[4]);

                    emp.put("BIRTHDAY",info[5]);

                    emp.put("HIREDATE",info[6]);

                    emp.put("DEPT",info[7]);

                    emp.put("SALARY",info[8]);

                    sourceList.add(emp);

           }

           Map<String,List<Map<String,String>>> group = new HashMap<String,List<Map<String,String>>>();

           //grouping object

           for (int i = 0, len = sourceList.size(); i < len; i++) {//group datafrom different DEPT

                    Map<String,String> emp =(Map) sourceList.get(i); 

if(group.containsKey(emp.get("DEPT"))) {

                group.get(emp.get("DEPT")).add(emp) ;

            } else {

List<Map<String,String>> list = new ArrayList<Map<String,String>>() ;

                list.add(emp) ;

group.put(emp.get("DEPT"),list) ;

            }

           }

           Set<String> key = group.keySet();

           for (Iterator it = key.iterator(); it.hasNext();) {//summarize the grouped data                     

String dept = (String) it.next();

                    List<Map<String,String>> list = group.get(dept);

                    double salary =0;

                    for (int i = 0, len = list.size(); i < len; i++) {

                             salary += Float.parseFloat(list.get(i).get("SALARY"));

                    }

                    Map<String,Object> result=new HashMap<String,Object>();

                    result.put("DEPT",dept);

                    result.put("SALARY",salary);

                    result.put("COUNT",list.size());

                    resultList.add(result);

           }

           for (int i = 0, len = resultList.size(); i < len; i++) {//print out the resulting data

                    System.out.println("dept="+resultList.get(i).get("DEPT")+

                                       "||salary="+resultList.get(i).get("SALARY")+

                                       "||count="+resultList.get(i).get("COUNT"));

           }

}

The results after the code is executed are as follows:

dept=Sales||salary=1362500.0||count=187

dept=Finance||salary=177500.0||count=24

dept=Administration||salary=40000.0||count=4

dept=Production||salary=663000.0||count=91

dept=Marketing||salary=733500.0||count=99

 

Here myGroup function has only one grouping field. If it has multiple grouping fields, nested multi-layer collections class is needed and the code will become more complicated. As myGroup function has fixed grouping fields and summarizing fields, if there is any change about the fields, we have no choice but to modify the program. This robs the function of the ability to deal with situations of flexible and dynamic grouping and summarizing. In order to enable it to handle these situations as well as SQL statement does, we need to develop additional program for analyzing and evaluating dynamic expressions, which is a rather difficult job.

 

As a programming language specially designed for processing structured (semi-structured) data and able to perform dynamic grouping and summarizing easily, esProc can rise to the occasion at this time as a better assistive tool. It can integrate with Java seamlessly, enabling Java to access and process text file data as dynamically as SQL does.

 

For example, group according to DEPT and seek COUNT, the number of employees, and the total amount of SALARY of each group. To do this, esProc can import from external an input parameter “groupBy” as the condition of dynamic grouping and summarizing. See the following chart:

Java group2.jpg          

The value of “groupBy” is DEPT:dept;count(~):count,sum(SALARY):salary. esProc needs only three lines of code as follows to do this job:

Java group3.jpg

A1Define a file object and import the data to it. The first row is the headline which uses tab as the default field separator. esProc’s IDE can display the imported data visually, like the right part of the above chart.

 

A2Group and summarize according to the specified field. Here macro is used to dynamically analyze the expression in which groupBy is an input parameter. esProc will first compute the expression enclosed by ${…}, then replace ${…} with the computed result acting as the macro string value and interpret and execute the result. In this example, the code we finally execute is =A1.groups(DEPT:dept;count(~):count,sum(SALARY):salary).

 

A3Return the eligible result set to the external program.

When the grouping field is changed, it is no need to change the program. We just need to change the parameter groupBy. For example, group according to the two fields DEPT and GENDER and seek COUNT, the number of employees, and the total amount of SALARY of each group. The value of parameter groupBy can be expressed like this: DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary. After execution, the result set in A2 is as follows:

Java group4.jpg 

Finally, call this piece of esProc code in Java to get the grouping and summarizing result byusing JDBCprovided by esProc. The code called by Java for saving the above esProc code as test.dfx file is as follows:

         // create esProc jdbc connection

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

//call esProc code (the stored procedure) in which test is the file name of dfx

  1. com.esproc.jdbc.InternalCStatement st;

st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");

// set parameters

st.setObject(1,"DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary");//the parameters are the dynamic grouping and summarizing fields

// execute the esProc stored procedure

st.execute();

// get the result set

ResultSet set = st.getResultSet();

Here the relatively simple esProc code can be called directly by Java, so it is unnecessary to write esProc script file (like the above test.dfx). The code is as follows:

st=(com. esproc.jdbc.InternalCStatement)con.createStatement();

          ResultSet set=st.executeQuery("=file(\"D:\\\\esProc\\\\employee.txt\").import@t().groups(DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary)");

The above Java code directly calls a line of code from esProc script: get data from the text file, group and summarize them according to the specified fields and return the result toset, Java’s ResultSet object.

tu51.jpg

In last part we mentioned that the Operation Department of the Web Company brought about a new demand: adding new conditions to the way the online time are computed. As IT department was using esProc as the tool for computation, it’s easy to handle such changes in requirements. On the other hand, the increasing amount of data could be accommodated by out-of-memory computation with esProc’s file cursor functionality.

 

First, let’s review the way the user behavior information is recorded in the Web Company. Data was recorded in the log file. Everyday a separate log file is generated. For example, the following log file, “2014-01-07.log”, contains the users online actions on January 7, 2014. To compute the online time for user in the week of 2014-01-05 to 2014-01-11, we need to retrieve data from 7 log files:

logtime                             userid               action

2014-01-07 09:27:56        258872799       login

2014-01-07 09:27:57        264484116       login

2014-01-07 09:27:58        264484279       login

2014-01-07 09:27:58        264548231       login

2014-01-07 09:27:58        248900695       login

2014-01-07 09:28:00        263867071       login

2014-01-07 09:28:01        264548400       login

2014-01-07 09:28:02        264549535       login

2014-01-07 09:28:02        264483234       login

2014-01-07 09:28:03        264484643       login

2014-01-07 09:28:05        308343890       login

2014-01-07 09:28:08        1210636885     post

2014-01-07 09:28:09        263786154       login

2014-01-07 09:28:12        263340514       get

2014-01-07 09:28:13        312717032       login

2014-01-07 09:28:16        263210957       login

2014-01-07 09:28:19        116285288       login

2014-01-07 09:28:22        311560888       login

2014-01-07 09:28:25        652277973       login

2014-01-07 09:28:34        310100518       login

2014-01-07 09:28:38        1513040773     login

2014-01-07 09:28:41        1326724709     logout

2014-01-07 09:28:45        191382377       login

2014-01-07 09:28:46        241719423       login

2014-01-07 09:28:46        245054760       login

2014-01-07 09:28:46        1231483493     get

2014-01-07 09:28:48        266079580       get

2014-01-07 09:28:51        1081189909     post

2014-01-07 09:28:51        312718109       login

2014-01-07 09:29:00        1060091317     login

2014-01-07 09:29:02        1917203557     login

2014-01-07 09:29:16        271415361       login

2014-01-07 09:29:18        277849970       login

 

Log files record, in chronological order, users’ operation (action), user ID (userid) and the time when the actions took place (logtime) in the application. Users operations include three different types, which are login, logout and get/post actions.

Previously, the Operation Department provided the following requirements for computation of users online time:

  1. Login should be considered as the starting point of online time, and overnight should be take into consideration.
  2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.
  3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.

   4.  If there is only login, without logout, then the last operation time should be treated as time for logout.

 

Over time, the operations department found that there are some "key point" in users behavior: between login and logout, user who conducted post actions are more loyal to the online application. Therefore, the Web Company plans to introduce an incentive: Based on the original rules, if a user conducted a post operation, his/her online time will be tripled in computation.

 

After receiving the task, the IT engineer considered the possibility for future adjustment in the way of computation, plus the need for added conditions. The decision is to use out-memory cursor and for loop to realize the computation.

 

After analysis, it’s found that most user behavior analysis are done for each user independently. Thus, if the log file are pre-sorted according to userid, the performance for various analysis computation will be raised, with reduced difficulty and shortened process time. The pre-processing programming are as following:

user online time II 1.jpg

As we could see, pre-processing means that we sort and output the seven days log files to a binary file. This way we can eliminate the need for subsequent consolidation and sort.

Meanwhile, the binary files provided by esProc can also help to raise the data/write performance for data.

After pre-processing, the codes for online time computation could be written as following:

user online time II 2.jpg

Note that:

  1. The volume of data for one user in seven days is not big. Thus in cell A5 we can retrieve all log data for a user into memory in one batch.
  2. In the one-loop-for-each-user cycle, the codes in red box implemented the computation of the new business logic: for every post operation conducted, the users’ current time online time will be tripled in computation. The removal of unqualified record is done in cell B9, and in B10 we calculate a serial number for every login (lognum). Records are grouped in B10 according to lognum, to compute the sum of onlinetime for each group. If there is at least one "post" action in the current group of operations, then the sum of onlinetime for current group will be tripled.
  3. Considering the relatively large data resulted, when the computation is done for 10,000 users, and the result also reach 10,000 lines, we’ll do a batch output of the data from memory to a result file. This improves the performance while avoiding the memory overflow at the same time.

 

After meeting this demand, the IT engineers in Web Company found that the single-threaded program does not take full advantage of the of the server’s computing power. Here comes another question: can these engineers leverage esProc’s multi-threaded parallel computing capabilities to take full advantages of the server's quad dual core CPUs? Is it troublesome to shift from single-threaded to multiple-threaded? See “Computing the Online Time for users with esProc (III)”.

Hereby, I just provide a more convenient way to access and process HTTP data. Now I illustrate some use cases in esProc, an innovative data processing language.

t4.jpg

In this example, a servlet provides outward query of employee information in json format. Servlet accesses employee table in the database and saves employee information as follows:

EID   NAME       SURNAME GENDER   STATE        BIRTHDAY        HIREDATE     DEPT         SALARY

1       Rebecca   Moore           F            California   1974-11-20        2005-03-11        R&D           7000

2       Ashley      Wilson          F            New York   1980-07-19       2008-03-16       Finance       11000

3       Rachel      Johnson        F            New Mexico1970-12-17       2010-12-01      Sales           9000

4       Emily         Smith          F            Texas        1985-03-07        2006-08-15       HR              7000

5       Ashley      Smith           F            Texas        1975-05-13        2004-07-30       R&D            16000

6       Matthew   Johnson        M           California     1984-07-07       2005-07-07       Sales           11000

7       Alexis        Smith          F            Illinois         1972-08-16       2002-08-16       Sales           9000

8       Megan      Wilson          F           California     1979-04-19       1984-04-19      Marketing      11000

9       Victoria     Davis            F           Texas         1983-12-07        2009-12-07       HR              3000

doGet function of servlet receives employee id strings of json format, queries corresponding employee information through the database and generates employee information list in json format and then returns it. Process of reading the database and generating employee information is omitted in the following code:

protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {

         // TODO Auto-generated method stub

         String inputString=(String) req.getParameter("input");

         //inputString  input value is"[{EID:8},{EID:32},{EID:44}]";

         if (inputString==null) inputString="";

         String outputString ="";

     

         {...}//code for querying the database through inputString and generating outputSring is omitted here

            // the generated outputString

//"[{EID:8,NAME:"Megan",SURNAME:"Wilson",GENDER:"F",STATE:\...";

         resp.getOutputStream().println(outputString);

         resp.setContentType("text/json; charset=GBK");

}

 

The following code can be used for esProc to access this http servlet:

HTTP1.jpg

A1 Define the input parameter to be submitted to servlet, i.e. the employee id list in json format.

A2 Define httpfile objects, URL is http://localhost:8080/demo/testServlet?input=[{EID:8},{EID:32},{EID:44}].

A3Import A2, the result returned by httpfile objects.

A4Parse by rows the json format information of each employee, and create a sequence.

A5Compute on the table sequence in A4 and combine SURNAME and NAME into FULLNAME.

A6Export results of A5 to a text file.

esProc JDBC is like an incomplete database JDBC driver without physical tables. It regards esProc as a database only having stored procedures (strong computing power and weak storage mechanism). Similar to the use of database JDBC, esProc JDBC calls esProc program as it calls stored procedures. Their difference is that esProc JDBC is a completely embedded computing engine. All computations are completed in the embedded package rather than by an independent server like in the databases.

 

1 Loading driver

Jars need by esProc JDBC include dm.jar, poi-3.7-20101029.jar, log4j_128.jar, icu4j_3_4_5.jar and dom4j-1.6.1.jar. The five jarscan be obtained in\esProc\lib in esProc’s IDE installation directory. Load these jars when starting java application; they can be put in WEB-INF/lib directory for a web application.

Note that esProc JDBC requires JDK1.6 or higher versions.

 

2 Modifying configuration files config.xml and dfxConfig.xml

Prepare file config.xml which contains esProc’s basic configuration information, such as registration code, addressing path, main pathand data source configuration. The file can be found in the directory esProc\config in esProc’s installation directory, and its configuration can be modified before deployment (for detailed explanation of configuration information, please see Appendix).

Configuring authorized information

Configure as follows in file config.xml:

<regCode> license</regCode>

license represents authorization code. Now esProc provides users with free distributions for integration. Free authorization codes are available in the official website.

Save config.xml and dfxConfig.xml in classpath of application projects.

Here we should note that names of the configuration files must beconfig.xml and dfxConfig.ximl, and cannot be changed;reentering is forbidden and esProc JDBC itself cannot be used as a data source and configured as a database connection during configuring database connection information.

 

3 Deploying esProc program

Put the pre-edited esProc script (dfx file) in classpath of the application project, or put it in the path designated by <paths/> node of dfxConfig.xml file.

 

4 Java’s calling of esProc program

We’ll look at situations when esProc program returns a single result set and when it returns multiple result sets.


4.1 Single result set

esProcscript

JDBC integration1.jpg  

Java’s calling

public class CallSingleRS {

    publicvoidtestDataServer(){

        Connection con = null;

    com.esproc.jdbc.InternalCStatementst;

    try{

    //create a connection

    Class.forName("com.esproc.jdbc.InternalDriver");

    con= (Connection) DriverManager.getConnection("jdbc:esproc:local://");

    //call the stored procedure,dfxfile’s name is singleRs

    st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call singleRs(?)");

    //set parameters

    st.setObject(1,"5");

    //execute the stored procedure

    st.execute();

    //get result set

    ResultSetrs = (ResultSet) st.getResultSet();

    }

    catch(Exception e){

    System.out.println(e);

        }

    finally{

    //close the connection

    if (con!=null) {

    try {

    con.close();

            }

    catch(Exception e) {

    System.out.println(e);

            }

          }

        }

    }

}


4.2 Multiple result sets

esProc script

  JDBC ingegration 2.jpg

Java’s calling

Here the main code will be provided and the rest is the same as that in the previous example.

 

//call the stored procedure

st =( com. esproc.jdbc.InternalCStatement)con.prepareCall("call MutipleRs()");

//execute the stored procedure

boolean hasResult = st.execute();

//if the execution returns result sets

if (hasResult) {

//get multiple result sets

ResultSet set = st.getResultSet();

intcsize = set.getMetaData().getColumnCount();

// the multiple result sets are data rows of one column, csize is 1

while (set.next()) {

Object o = set.getObject(1);

//in this example, one table sequence can be read out each time and retrieve table sequences respectively in A2 and A3

    }

}

 

 

Appendix Explanation of Configuration Information

config.xml

<?xmlversion="1.0"encoding="UTF-8"?>

<ConfigVersion="1">

    <regCode>W2r@7-8cLMJ-GVU33-BhTIB3E</regCode>

    <charSet>ISO-8859-1</charSet>

    <!--Configure addressing path of dfx file, which is an absolute path. We can set multiple paths and separate them from each other by semicolons. The path of dfx file can also be put in classpath of application projects, and the classpath takes precedence over addressing path in loading files -->  

<dfxPathList>

        <dfxPath>D:\files\dfx</dfxPath>

    </dfxPathList>

    <dateFormat>yyyy-MM-dd</dateFormat>

    <timeFormat>HH:mm:ss</timeFormat>

    <dateTimeFormat>yyyy-MM-ddHH:mm:ss</dateTimeFormat>

    <!--Configuration method one: configure connection pool in the application server and designate data source name here

--> <DBList>

        <!--The data source name must in consistent with that in dfx file -->

        <DBname="demo">

            <propertyname="url"value="jdbc:hsqldb:hsql://127.0.0.1/demo"/>

            <propertyname="driver"value="org.hsqldb.jdbcDriver"/>

            <propertyname="type"value="HSQL"/>

            <propertyname="user"value="sa"/>

            <propertyname="password"value=""/>

            <propertyname="batchSize"value="1000"/>

            <!--Automatically connect or not. If the setting is true, db.query() function can be directly used to access the database; if it is false,  the connection won’t be automatic and connect(db) statement must be used first for the connection-->

            <propertyname="autoConnect"value="true"/>

            <propertyname="useSchema"value="false"/>

            <propertyname="addTilde"value="false"/>

        </DB>

    </DBList>

    <mainPath>D:\tools\raqsoft\main</mainPath>

    <tempPath>D:\tools\raqsoft\main\temp</tempPath>

    <bufSize>65536</bufSize>

    <localHost>192.168.0.86</localHost>

    <localPort>8282</localPort>

</Config>

dfxConfig.xml

<?xmlversion="1.0"encoding=" UTF-8"?>

<dfxConfig>

    <!--Maximum concurrent jobs.Set the maximum jobs allowed to concur; jobsoperating simultaneously in the same connection should also be dealt with as concurrent jobs. The maximum concurrent jobs should be within the authorized limit.-->

    <maxConcurrents>10</maxConcurrents>

    <!--The longest wait time.If tasks in operation are more thanthe maximum concurrent, the extra tasks will be in a waiting state; if the wait time reaches its limit, abnormitymessage will be shown.-->

    <maxWaitTime>5000

</maxWaitTime>

    <!--Logs configure property files-->

    <log>logger.properties

</log>

    <!--Configure connection pool in the application server and designate data source name here -->

    <jndi-ds-configs>

        <!—jndi-prefix-->

        <jndi-prefix>java:comp/env</jndi-prefix>

        <!--Data source name must be in consistent with that in dfx files -->

        <jndi-ds-config>

            <name>olap</name>

            <dbType>ACCESS</dbType>

            <dbCharset>ISO-8859-1</dbCharset>

            <clientCharset>ISO-8859-1</clientCharset>

            <needTranContent>false</needTranContent>

            <needTranSentence>false</needTranSentence>

            <!--Automatically connect or not. If the setting is true, db.query() function can be directly used to access the database; if it is false,  the connection won’t be automatic and connect(db) statement must be used first for the connection-->

            <autoConnect>true</autoConnect>

        </jndi-ds-config>

    </jndi-ds-configs>

</dfxConfig>

We often encounter the situation that requires text file data processing. Here we’ll look at how to execute conditioned filtering in text files with Java through an example: read employee information from text file employee.txt and select female employees who were born on and after January 1, 1981.

         The text file employee.txt is in a format as follows:

EID   NAME       SURNAME  GENDER  STATE        BIRTHDAY        HIREDATE       DEPT        SALARY

1       Rebecca   Moore                   F       California    1974-11-20       2005-03-11        R&D           7000

2       Ashley       Wilson                   F       New York    1980-07-19       2008-03-16        Finance      11000

3       Rachel      Johnson                 F       New Mexico1970-12-17      2010-12-01        Sales           9000

4       Emily         Smith                     F       Texas          1985-03-07       2006-08-15         HR              7000

5       Ashley       Smith                     F       Texas          1975-05-13       2004-07-30         R&D           16000

6       Matthew   Johnson                 M       California    1984-07-07       2005-07-07        Sales          11000

7       Alexis        Smith                     F       Illinois           1972-08-16       2002-08-16        Sales           9000

8       Megan      Wilson                    F       California     1979-04-19      1984-04-19        Marketing    11000

9       Victoria     Davis                     F       Texas            1983-12-07      2009-12-07        HR                3000

10     Ryan         Johnson                M        Pennsylvania1976-03-12     2006-03-12        R&D             13000

11     Jacob        Moore                  M        Texas            1974-12-16      2004-12-16        Sales           12000

12     Jessica     Davis                    F        New York      1980-09-11      2008-09-11         Sales           7000

13     Daniel       Davis                    M        Florida          1982-05-14      2010-05-14         Finance       10000

22.jpg

Java’s way of code writing is that it reads data from the file by rows, save them in the List objects, traverse List objects, and savethe eligible records in the resultingList objects. Lastly, print out the number of eligible employees. Detailed code is as follows:

       public static void myFilter() throws Exception{

              File file = new File("D:\\employee.txt");

              FileInputStream fis = null;

              fis = new FileInputStream(file);

              InputStreamReader input = new InputStreamReader(fis);

              BufferedReader br = new BufferedReader(input);

              String line = null;

              String info[] = null;

              List sourceList= new ArrayList();

              List resultList= new ArrayList();

              if ((line = br.readLine())== null) return;//skip the first line, exit if the file is null

              while((line = br.readLine())!= null){ //import to the memory from the file

                     info = line.split("\t");

                     Map<String,String> emp=new HashMap<String,String>();

                     emp.put("EID",info[0]);

                     emp.put("NAME",info[1]);

                     emp.put("SURNAME",info[2]);

                     emp.put("GENDER",info[3]);

                     emp.put("STATE",info[4]);

                     emp.put("BIRTHDAY",info[5]);

                     sourceList.add(emp);

              }

              for (int i = 0, len = sourceList.size(); i < len; i++) {//process data by rows

                     Map<String,String> emp =(Map) sourceList.get(i); 

                     SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");

                     if ( emp.get("GENDER").equals("F") && !sdf.parse(emp.get("BIRTHDAY")).before(sdf.parse("1981-01-01")) )

{ //save the eligible records in List objects using the conditional statement

                            resultList.add(emp);

                     }

              }

              System.out.println("count="+resultList.size());//print out the number of eligible employees

       }


The filtering condition of this function is fixed. If the condition is changed, the conditional statement in the program should be modified accordingly. Multiple pieces of code are needed if there are multiple conditions, and the program lacks the ability to handle the provisional, dynamic conditions. Now we’ll rewrite the code and make it universal in some degree by slightly changing the loop of traversing sourceList:

       for (int i = 0, len = sourceList.size(); i < len; i++) {

                     Map<String,String> emp =(Map) sourceList.get(i); 

                     SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");

                     boolean isRight = true;

                     if (gender!=null && !emp.get("GENDER").equals(gender)){//process the condition of gender

                            isRight = false;

                     }

                     if (start!=null && sdf.parse(emp.get("BIRTHDAY")).before(start) ){//process the starting conditionof BIRTHDAY

                            isRight = false;

                     }

                     if (end!=null && sdf.parse(emp.get("BIRTHDAY")).after(end) ){//process the end condition of BIRTHDAY

                            isRight = false;;

                     }

                     if (isRight) resultList.add(emp);//save the eligible records in the resulting list

              }

In the rewritten code, gender, start and end are input parameters of the functionmyFilter. The program can manage situations that GENDER field equals the input value gender, BIRTHDAY field is greater than or equal to the input value start as well as less than or equal to the input value end. If any of the input values is null, the condition will be ignored. Conditions are joined by AND.


If we want to make myFiltera more universal function, for example, join conditions with OR or allow computation between fields, the code will become more complicated, requiringprogram for analyzing and evaluating dynamic expressions. This type of program can be as flexible and universal as database SQL, but it is really difficult to develop.


In view of this, we can turn to esProc to assist with this task. esProc is a programming language designed for processing structured (semi-structured) data. It is quite easy for it to perform the above universal query task and can integrate with Java seamlessly so that Java can access and process text file data as flexibly as SQL does. For example, to query female employees who were born on and after January 1, 1981, esProc can import from external an input parameter “where” as the dynamic condition, see the following chart:

   java filter1.jpg

The value of “where”is:BIRTHDAY>=date(1981,1,1) && GENDER=="F". esProc needs only three lines of code as follows:

java filter2.jpg

 

A1:Define a file object and import data to it. The first row is the headline with tab as the field separator by default. esProc’s IDE can visually display the imported data, as shown on the right of the above chart.

A2:Filter according to the condition. Here macro is used to analyze the expression dynamically. “where” is the input parameter. esProc will first compute the expression enclosed by ${…}, then replace ${…} with the computed result acting as macro string value and interpret and execute the result. In this example, the code we finally execute is =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F").

A3:Return the eligible result set to the external program.


When the filtering condition changes, we just need to change the parameter “where”without rewriting the code. For example, the condition is modified into querying female employees who were born on and after January 1, 1981,or records of employees whose NAME+SURNAMEequals“RebeccaMoore”. The code forwhere’s parameter value can be like this: BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore". After execution, the result set in A2 is shown in the following chart:

java filter3.jpg

Finally, call this piece of esProc code with Java to get the filtering result by using jdbc provided by esProc. The code called by Java for saving the above esProc code as test.dfx file is as follows:

       // create esProcjdbcconnection

       Class.forName("com.esproc.jdbc.InternalDriver");

       con= DriverManager.getConnection("jdbc:esproc:local://");

       //call esProc program (the stored procedure) in which test is the file name of dfx

       st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");

       //set parameters

       st.setObject(1," BIRTHDAY>=date(1981,1,1) && GENDER==\"F\" ||NAME+SURNAME==\"RebeccaMoore\"");//the parameter is the dynamic filtering condition

       // execute esProc stored procedure

       st.execute();

       //get the result set: a set of eligible employees

       ResultSet set = st.getResultSet();


When writing script of relatively simple code, we may write the esProc code directly into Java code that calls the esProc JDBC. This can save us from having to writethe esProc script file (test.dfx):

st=(com. esproc.jdbc.InternalCStatement)con.createStatement();

ResultSet set=st.executeQuery("=file(\"D:\\\\esProc\\\\employee.txt\").import@t().select(BIRTHDAY>=date(1981,1,1)&&GENDER==\"F\" || NAME+SURNAME==\"RebeccaMoore\")");


This piece of Java code directly calls a line of code from esProc script: get data from the text file, filter them according to the specified condition and return the result set toset, the ResultSet object.

As the operator of an online system, the Web Company believes that the users time spent with their online application is a key analysis scenario. Specifically, the online time refers to the cumulative time a user spent with their online business application over a certain period of time.


With the evolving of the company's online application, total number of users has grown and the task of user behavior analysis is becoming more complex. Here, we use the example of computing the online time for users to show the various computing scenarios, ranging from simple to complex. Hopefully this could serve as a reference for similar development projects. In fact, the following approach are also applicable for other categories of user behavior analysis, such as users activity level, user churn, etc..


Lets start from the time when the application just went online. The Operation Department needed to know the users online time with their application every week. For this the engineers from IT department provided the following resolution.


The user behavior information is recorded in log files in the Web Company. Everyday a separate log file is generated. For example, the following log file, “2014-01-07.log”, contains the users online actions on January 7, 2014. To compute the online time for user in the week of 2014-01-05 to 2014-01-11, we need to retrieve data from 7 log files:

logtime userid action

2014-01-07 09:27:56 258872799 login

2014-01-07 09:27:57 264484116 login

2014-01-07 09:27:58 264484279 login

2014-01-07 09:27:58 264548231 login

2014-01-07 09:27:58 248900695 login

2014-01-07 09:28:00 263867071 login

2014-01-07 09:28:01 264548400 login

2014-01-07 09:28:02 264549535 login

2014-01-07 09:28:02 264483234 login

2014-01-07 09:28:03 264484643 login

2014-01-07 09:28:05 308343890 login

2014-01-07 09:28:08 1210636885 post

2014-01-07 09:28:09 263786154 login

2014-01-07 09:28:12 263340514 get

2014-01-07 09:28:13 312717032 login

2014-01-07 09:28:16 263210957 login

2014-01-07 09:28:19 116285288 login

2014-01-07 09:28:22 311560888 login

2014-01-07 09:28:25 652277973 login

2014-01-07 09:28:34 310100518 login

2014-01-07 09:28:38 1513040773 login

2014-01-07 09:28:41 1326724709 logout

2014-01-07 09:28:45 191382377 login

2014-01-07 09:28:46 241719423 login

2014-01-07 09:28:46 245054760 login

2014-01-07 09:28:46 1231483493 get

2014-01-07 09:28:48 266079580 get

2014-01-07 09:28:51 1081189909 post

2014-01-07 09:28:51 312718109 login

2014-01-07 09:29:00 1060091317 login

2014-01-07 09:29:02 1917203557 login

2014-01-07 09:29:16 271415361 login

2014-01-07 09:29:18 277849970 login


Log files record, in chronological order, users operation (action), user ID (userid) and the time when the actions took place (logtime) in the application. Users operations include three different types, which are login, logout and get/post actions.

The Operation Department provided the following requirements for computation of users online time:

1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.

3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.

4. If there is only login, without logout, then the last operation time should be treated as time for logout.


As the online application was just rolled out, the data volume for log file is relatively small. To compute on data from log files for 2014-01-05 to 2014-01-11, we could retrieve all data into memory in one batch, or out to a resulting file. Thus all codes here are written for in-memory computing.


The IT Department leverages esProc to meet the above requirements.

The actual codes are as following:

 

esproc online time calculation 1.png

The ideas for program design are:

1. First, retrieve all log files for the week ( 2014-01-05 to 2014-01-11 ) and merge them in chronological order. Sorting them according to userid and logtime. Add two extra fields, onlinetime and loginflag for subsequent calculations.

2. Onlinetime is for computing of the interval between two operations by the same user. If difference between the operation time of current line and last action is less than 3 seconds, or if the userid of current operation does not equal to that of last one, then onlinetime is directly set to 0.

3. Loginflag is used to indicate a valid onlinetime. If onlinetime does not exceed 10 minutes (600 seconds), or the type of operation is logout, then loginflag is set to true. Otherwise its set to false. If its login operation, then loginflag is directly set to true.

4. Upon the resulting sorted table from previous steps, compute loginflag again. If loginflag was originally set to false, then leave it to false. If the value were originally set to true, then the type of last operation would result to different value. If the last operation were login, then loginflag should still be set to true, otherwise it should be set to false.

5. Upon the resulting sorted table from previous steps, group the data according to userid. Compute the sum of onlinetime for all records whose loginflag is true. This is the total online time for the same user.

6. Output the result in the last step to a file onlinetime.data.

 

The advantage of the above codes lies in the step-by-step computing, which is easy to maintain and modify.

After working for a while, a new problem was found: On the one hand, The Operation Department said that the original way for online time computation should be adjusted, with new conditions added. On the other hand, with the increase of users, the log files grow larger, which is too big to fit into memory in one batch. Well, how should the IT Departments cope with this change in the requirements? See "Cases to carry out the statistics on online time of web applications (II)”.

esProc can process json data. Here we’ll introduce some of the applications through examples. 1. Analyzing and generating json data with esProc; 2. Data-interchange between esProc and application program through json; 3. Reading json file data in esProc.

A  Analyzing and generating json data with esProc

Generally speaking, json is a format used by webpage js program and Java’s server-side program (such as servlet) to interchange data. While data access between Java’s server-side program and the databases adopts SQL result set format. esProc can act as an intermediary in the data computation and data-interchange between the two formats.

In this example, we use esProc to query detailed information of a group of designated employees. Both data input and output will adopt json format. Table employee of database demo contains all information of the employees:

json data 1.jpg

esProc receives an EID list of json format and returns corresponding detailed information of employees in json format. The code is as follows:

1. esProc program test.dfx receives a parameter: jsonEID.

json data 2.jpg

2. esProc completes json analysis, data processing and generates results in json format:

json data 3.jpg

json data 5.jpg

 

A1: Connect to database demo.

A2: Retrieve data from table employee.

A3: Use import@j function to parse the inputting jsonEID parameter (EID list in json format) and generate a table sequence containing only one field EID.

A4: Use align function to get from users data the employee information designated by A3.

A5: Convert employee information into json strings.

A6: Return employee information of json format.

B  Interchanging json data between esProc and Java application

In the above example, esProc program is saved as test.dfx file to be called by Java application. Steps for calling the file are as follows:

1. Deploy esProc in Java application.

See esProc Tutorial for detail.

2. Call test.dfx in Java application.

Code example is as follows:

public void testDataServer(){

                   Connection con = null;

                   com.esproc.jdbc.InternalCStatementst;

                   try{

                            // Users id list in json format can be transmitted from browser-side to the program and converted into strings for use. Here process of receiving json data is omitted and value is assigned directly

                            String jsonEid="[{EID:8},{EID:32},{EID:44}]";

                            // Create a connection

                            Class.forName("com.esproc.jdbc.InternalDriver");

                            con= DriverManager.getConnection("jdbc:esproc:local://");

                            // Call stored procedure. test is the file name of dfx

                            st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");

                            // Set parameters

                            st.setObject(1,jsonEid);

                            // Execute stored procedure

                            st.execute();

                            // Get result set

                            ResultSet set = st.getResultSet();

                            String jsonEmployee=null;

                            if (set.next()) jsonEmployee=set.getString(1);

                            // After getting detailed user information in json format, convert it into json objects and return them to browser-side. How to use jsonEmployee is omitted here

                   }

                   catch(Exception e){

                            System.out.println(e);

                   }

                   finally{

                            // Close the connection

                            if (con!=null) {

                                     try {

                                               con.close();

                                     }

                                     catch(Exception e) {

                                               System.out.println(e);

                                     }

                            }

                   }

}

C  Reading and processing json file data in esProc

JSON file test.json contains information including class, serial number, names, subjects, scores, etc. Format is as follows:

[

    {

        "class": "Class one",

        "id": 1,

        "name": "Emily",

        "subject": "English",

        "score": 84

    },

    {

        "class": "Class one",

        "id": 1,

        "name": "Emily",

        "subject": "Math",

        "score": 77

    },

 

    ......

 

    {

        "class": "Class one",

        "id": 7,

        "name": "Nicholas",

        "subject": "PE",

        "score": 60

    }

]

 

It is convenient for esProc to perform the reading and computation of JSON data. After that the result is submitted to Java application in the format of JDBC result set. Steps are as follows:

1. Developing esProc script

Use esProc editor to develop script (fromJSON.dfx), read the json file, analyze it and complete the computation:

 

json data 5.jpg

A1: Use read() to read json file in string format;

A2: Use import@j() function to parse the json file into a table sequence;

A3: Group students ID and summarize the total scores in A4;

A5: Sort by total scores in descending order and return result set through result in A7

 

2. Java application calls fromJSON.dfx to present result.

Steps are omitted here for they are almost the same as those in the above example.