Skip navigation

tu51.jpg

In last part we mentioned that the Operation Department of the Web Company brought about a new demand: adding new conditions to the way the online time are computed. As IT department was using esProc as the tool for computation, it’s easy to handle such changes in requirements. On the other hand, the increasing amount of data could be accommodated by out-of-memory computation with esProc’s file cursor functionality.

 

First, let’s review the way the user behavior information is recorded in the Web Company. Data was recorded in the log file. Everyday a separate log file is generated. For example, the following log file, “2014-01-07.log”, contains the users online actions on January 7, 2014. To compute the online time for user in the week of 2014-01-05 to 2014-01-11, we need to retrieve data from 7 log files:

logtime                             userid               action

2014-01-07 09:27:56        258872799       login

2014-01-07 09:27:57        264484116       login

2014-01-07 09:27:58        264484279       login

2014-01-07 09:27:58        264548231       login

2014-01-07 09:27:58        248900695       login

2014-01-07 09:28:00        263867071       login

2014-01-07 09:28:01        264548400       login

2014-01-07 09:28:02        264549535       login

2014-01-07 09:28:02        264483234       login

2014-01-07 09:28:03        264484643       login

2014-01-07 09:28:05        308343890       login

2014-01-07 09:28:08        1210636885     post

2014-01-07 09:28:09        263786154       login

2014-01-07 09:28:12        263340514       get

2014-01-07 09:28:13        312717032       login

2014-01-07 09:28:16        263210957       login

2014-01-07 09:28:19        116285288       login

2014-01-07 09:28:22        311560888       login

2014-01-07 09:28:25        652277973       login

2014-01-07 09:28:34        310100518       login

2014-01-07 09:28:38        1513040773     login

2014-01-07 09:28:41        1326724709     logout

2014-01-07 09:28:45        191382377       login

2014-01-07 09:28:46        241719423       login

2014-01-07 09:28:46        245054760       login

2014-01-07 09:28:46        1231483493     get

2014-01-07 09:28:48        266079580       get

2014-01-07 09:28:51        1081189909     post

2014-01-07 09:28:51        312718109       login

2014-01-07 09:29:00        1060091317     login

2014-01-07 09:29:02        1917203557     login

2014-01-07 09:29:16        271415361       login

2014-01-07 09:29:18        277849970       login

 

Log files record, in chronological order, users’ operation (action), user ID (userid) and the time when the actions took place (logtime) in the application. Users operations include three different types, which are login, logout and get/post actions.

Previously, the Operation Department provided the following requirements for computation of users online time:

  1. Login should be considered as the starting point of online time, and overnight should be take into consideration.
  2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.
  3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.

   4.  If there is only login, without logout, then the last operation time should be treated as time for logout.

 

Over time, the operations department found that there are some "key point" in users behavior: between login and logout, user who conducted post actions are more loyal to the online application. Therefore, the Web Company plans to introduce an incentive: Based on the original rules, if a user conducted a post operation, his/her online time will be tripled in computation.

 

After receiving the task, the IT engineer considered the possibility for future adjustment in the way of computation, plus the need for added conditions. The decision is to use out-memory cursor and for loop to realize the computation.

 

After analysis, it’s found that most user behavior analysis are done for each user independently. Thus, if the log file are pre-sorted according to userid, the performance for various analysis computation will be raised, with reduced difficulty and shortened process time. The pre-processing programming are as following:

user online time II 1.jpg

As we could see, pre-processing means that we sort and output the seven days log files to a binary file. This way we can eliminate the need for subsequent consolidation and sort.

Meanwhile, the binary files provided by esProc can also help to raise the data/write performance for data.

After pre-processing, the codes for online time computation could be written as following:

user online time II 2.jpg

Note that:

  1. The volume of data for one user in seven days is not big. Thus in cell A5 we can retrieve all log data for a user into memory in one batch.
  2. In the one-loop-for-each-user cycle, the codes in red box implemented the computation of the new business logic: for every post operation conducted, the users’ current time online time will be tripled in computation. The removal of unqualified record is done in cell B9, and in B10 we calculate a serial number for every login (lognum). Records are grouped in B10 according to lognum, to compute the sum of onlinetime for each group. If there is at least one "post" action in the current group of operations, then the sum of onlinetime for current group will be tripled.
  3. Considering the relatively large data resulted, when the computation is done for 10,000 users, and the result also reach 10,000 lines, we’ll do a batch output of the data from memory to a result file. This improves the performance while avoiding the memory overflow at the same time.

 

After meeting this demand, the IT engineers in Web Company found that the single-threaded program does not take full advantage of the of the server’s computing power. Here comes another question: can these engineers leverage esProc’s multi-threaded parallel computing capabilities to take full advantages of the server's quad dual core CPUs? Is it troublesome to shift from single-threaded to multiple-threaded? See “Computing the Online Time for users with esProc (III)”.