Skip navigation
2014

esProc can process big text files conveniently by providing cursor data object. The following example is to illustrate this.

Cursor.jpg

Lets assume that there is a text file, sales.txt, with ten million sales records. Its main fields include SellerID, OrderDate and Amount. Now compute each salesmans total Amount of big orders in the past four years. The big orders refer to those whose amount is above 2000.

 

esProc code:

cursor 1.jpg

Code interpretation:

A1: If all the ten million records are read into memory simultaneously, memory will overflow. So the job will be done in batches.

A2: Read by looping, 100,000 rows at a time.

B3: Filter each batch of data, select those records whose amount is above 2000 after the year of 2011.

B4: Group and summarize the filtered data, seek the sales of each salesperson in this batch of data.

B5: Add the computed result of this batch of data to a certain variable (B1), and move on to the computation of next batch of data.

B6: After all the computations, sales of each salesperson in every batch of data will be found in B1. Last, group and summarize these sales data and seek each salespersons total sales amount.


Analysis:

In cell A1, esProc cursor is created with function cursor. The cell name is the cursors variable name. When the cursor is created, data will not be read in the memory directly. Read-in will only be executed while fetch operation or other equal operations are going on, e.g., the code for A1,100000 in cell A2 represents reading data from cursor by looping with 100,000 rows at a time. We can see that the data size in memory is always kept in a relatively small level and no overflows will occur.

select and groups are computation functions specially used with structured data. After the data is read in the memory with esProc cursor, they can be processed and analyzed by employing functions of professional structured data computation library. This is more convenient than writing underlying code by hand. 

Equipped with functions and grammar of semi-structured data processing, e.g., function for data split and merging, looping and traversal statement and branch judgment statement, esProc cursor can do complex task of data cleansing and arrangement and form easily computed structured data.   

 

Splitting and analyzing

For instance, the format of weblog is too complex to be computed and analyzed directly. A typical web blog text need to be transformed into a two-dimensional table of standard format in order to be used in structured data computation or be stored in a database.  

A record in the original weblog:


  1. 10.10.10.145 - - [01/May/2013:03:24:56 -0400] "GET /product/p0040001/review.jsp?page=5 HTTP/1.1" 200 8100 "http://www.xxx.com/xxxx.html""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" 0 tankLee in 9fc0792eb272b78916c3872e9ad –


Time

url

product

userID

sSign

sID

   pSign

01/May/2013:03:24:56

/product/p0040001/review.jsp?page=5

    p0040001

tankLee

  In

9fc0792eb272b78916c3872e9ad

The following case omits the process of file accesses and the final merging of multiple batches of data(refer to the previous example), and lists the code for splitting and analyzing directly.

cursor 2.jpg

Data cleansing

Lets see a typical example of data cleansing. Since the employee table read in from a file is not a standard format, it need to be reorganized into standard structural data in batches. Data of the current batch will be stored in cell D3 temporarily. The rule for reorganizing is:

  1. The record is invalid if UserID and firstName is null or blank string.
  2. UserID can only contain digits; the record is invalid if letters appear in it.
  3. For repeated UserID, only the last entered record is kept.
  4. Delete possible spacing before and after the data.
  5. Capitalize all the first letters of firstName.
  6. Full name is combined in the form of firstName+.+lastName. But, if lastName is null or blank string, fullname equals to firstName.

The following table also omits the process of file accesses and the merging of multiple batches of data, and only lists the code for data cleansing:

 

cursor 3.jpg

24 point calculation is a common intellectual game, which can be played with poker without jokers. Draw four cards randomly, use the four numbers to work out 24 points with four arithmetic operations: addition, subtraction, multiplication and division. In playing poker, JQK usually correspond numbers 11, 12 and 13.

With esProc, we could program the game more conveniently. Working out a solution with four random numbers becomes easier:

24 point 1.png

Lets analyzes the piece of code in detail.

Four numbers for computation are given in A1. First, we try all possible permutations of the four random cards. To do this, we list the repeatable cases in A2 with the help of a four-digit base-4 number. In B2, we only select those cases which are not repeated. See figure below:

24 piont 2.png

Any three symbols of four arithmetic operations need to be inserted in computation. Each symbol is selected arbitrarily among plus, minus, times and division. Execute in A3 the loop of a three-digit base-4 number, and list all possible combinations:

24 piont any three.pngl

Different computation orders have different computed results. The computing order can be changed by adding brackets. Three operational symbols show that the computation can be divided into three steps, and brackets adding decides the execution order of the three steps. Combinations in B3, selected from the results in A3,contain all three elements 1,2,3, and represent all possible execution orders. See figure below:

24 piont different.png

Since there could be repeated numbers among the four randomly selected cards, all permutations of cards will be listed in A4 and remove those repeated ones in B4 in order to avoid redundant loop. Lets look at [8,3,8,3] in A1, its eligible combinations are as follow:

24 piont for above.png

For each permutation, the programming in line 5 and line 6 executes loop of every symbol selection and every computation order. Then computes by calling subprogram in A8 and with the copied parameters to avoid interfering the subsequent computation.

When the subprogram in A8 is called, numerical sequence, symbol sequence, and sequence of computation order should be filled respectively into A8, B8 and C8; at the same time, expression for computation should be prepared in D8. B9 executes the loop of sequence of computation order until the result is gradually obtained. C9 gets the result of a single step by calling subprogram in A13. The subprogram, which is quite simple, achieves computed results of four arithmetic operations based on two given numbers and one of the symbols. D9 aims at the expression used for single step computing, after which the original two numbers will become one, and expression sequence, symbol sequence and numerical sequence will be modified in E9, C10 and D10. Having done the single step computing, the original total computing steps will be reduced by one, so another modification of sequence of computation order will be required in E10. In line 11, unless it is the last step, brackets will be added to newly-created expression to ensure an appropriate computing order.

When the loop in B9 is over, subprogram in A8 completes its computation of expression for this case. B18 is programmed to decide whether the result is 24. The result will be calculated to three decimal places in consideration of computational error of double-precision number. If the result equals to 24, the current permutation is eligible. The corresponding expression will be put in B1 in C12.

If the loop of all conditions is finished, yet no expression is found in B1 when checking A7, no solution has been obtained.

After all the computation, result can be found in B1. See below:

24 piont after.png

Or cellset parameter will take the place of combination of numerical value entered in A1:

24 point esproc.png

Set cellset parameter before doing computation:

24 point the result above.png

The result displayed in B1 after computation is as follow:

24 point esproc final.png

There are quite a lot of chess problems, among which the Eight Queens Problem is the most famous one. Put eight Queens in a 8×8 chessboard, the requirement is no attack between any of them. How many layouts are there to achieve this?

 

As a Queen’s striking range is limited to a row, a column and an oblique line, only one Queen is allowed to appear in a single line, otherwise it is ineligible.

eight queens 1.png

Since there is only one Queen in any single line, a sequence, 8 in length, could be employed. Set in turn the column number a Queen in each row is located. If there is not a Queen in a certain row, mark with zero; when we place a new Queen in the board, and the column number is not within the sequence, which means there is no Queen in this column.

 

At the same time, we also need to make sure the new Queen has no counterpart in the diagonal. If a Queen is located in row m of column k, there are two places at most in row m+n of the same oblique line. Between the two places and the Queen, the horizontal distance is equal to the vertical distance, which means there are at most two places (m+n,k+n) and (m+n,k-n) are in the same oblique line with the Queen.

 

So, we would know the number of eligible conditions after examining status of each Queen in every line. 

esProc can do all the work with loop computation, as shown in the following form:

eight queens esProc.png

i is employed during the computation to record the serial number of the current row where a Queen is placed. The code in the 2nd line of the above form shows that with each loop of the code, the Queen will move to the next column, and in this way, traversal of every place of the current row will be completed. For the code in the 3rd line, when the Queen move to the 9th column, its traversal in all places in the current row has completed; the record of the current row restores to zero and i equals i-

1 and return to continue with traversal in the last row. Here note that when the entire loop in the first line is done, the traversal is completed, i will be reset as zero, and the loop is over. For the code of in the 4th line, when moving the queen in the first row, we could locate the second Queen without making any judgments. The code of in the 6th line judges whether there is any located Queen in a same column; and the code in the 7th line judges whether there is any located Queen in a same oblique line. If the answers of both judgment are no, we could locate the Queen in the next row. When all the eight Queens are located, record their current places with the code in the 9th line.

 

The computed result in A10 is:

eight queens esProc 92.png

Check detailed result in C1:

eight queens esProc chess.png