Cases for the statistics on users’ online time at web applications(II)
Posted by raqsoft Aug 24, 2014In last part we mentioned that the Operation Department of the Web Company brought about a new demand: adding new conditions to the way the online time are computed. As IT department was using esProc as the tool for computation, it’s easy to handle such changes in requirements. On the other hand, the increasing amount of data could be accommodated by out-of-memory computation with esProc’s file cursor functionality.
First, let’s review the way the user behavior information is recorded in the Web Company. Data was recorded in the log file. Everyday a separate log file is generated. For example, the following log file, “2014-01-07.log”, contains the users online actions on January 7, 2014. To compute the online time for user in the week of 2014-01-05 to 2014-01-11, we need to retrieve data from 7 log files:
logtime userid action
2014-01-07 09:27:56 258872799 login
2014-01-07 09:27:57 264484116 login
2014-01-07 09:27:58 264484279 login
2014-01-07 09:27:58 264548231 login
2014-01-07 09:27:58 248900695 login
2014-01-07 09:28:00 263867071 login
2014-01-07 09:28:01 264548400 login
2014-01-07 09:28:02 264549535 login
2014-01-07 09:28:02 264483234 login
2014-01-07 09:28:03 264484643 login
2014-01-07 09:28:05 308343890 login
2014-01-07 09:28:08 1210636885 post
2014-01-07 09:28:09 263786154 login
2014-01-07 09:28:12 263340514 get
2014-01-07 09:28:13 312717032 login
2014-01-07 09:28:16 263210957 login
2014-01-07 09:28:19 116285288 login
2014-01-07 09:28:22 311560888 login
2014-01-07 09:28:25 652277973 login
2014-01-07 09:28:34 310100518 login
2014-01-07 09:28:38 1513040773 login
2014-01-07 09:28:41 1326724709 logout
2014-01-07 09:28:45 191382377 login
2014-01-07 09:28:46 241719423 login
2014-01-07 09:28:46 245054760 login
2014-01-07 09:28:46 1231483493 get
2014-01-07 09:28:48 266079580 get
2014-01-07 09:28:51 1081189909 post
2014-01-07 09:28:51 312718109 login
2014-01-07 09:29:00 1060091317 login
2014-01-07 09:29:02 1917203557 login
2014-01-07 09:29:16 271415361 login
2014-01-07 09:29:18 277849970 login
Log files record, in chronological order, users’ operation (action), user ID (userid) and the time when the actions took place (logtime) in the application. Users operations include three different types, which are login, logout and get/post actions.
Previously, the Operation Department provided the following requirements for computation of users online time:
- Login should be considered as the starting point of online time, and overnight should be take into consideration.
- If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.
- If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.
4. If there is only login, without logout, then the last operation time should be treated as time for logout.
Over time, the operations department found that there are some "key point" in users behavior: between login and logout, user who conducted post actions are more loyal to the online application. Therefore, the Web Company plans to introduce an incentive: Based on the original rules, if a user conducted a post operation, his/her online time will be tripled in computation.
After receiving the task, the IT engineer considered the possibility for future adjustment in the way of computation, plus the need for added conditions. The decision is to use out-memory cursor and for loop to realize the computation.
After analysis, it’s found that most user behavior analysis are done for each user independently. Thus, if the log file are pre-sorted according to userid, the performance for various analysis computation will be raised, with reduced difficulty and shortened process time. The pre-processing programming are as following:
As we could see, pre-processing means that we sort and output the seven days log files to a binary file. This way we can eliminate the need for subsequent consolidation and sort.
Meanwhile, the binary files provided by esProc can also help to raise the data/write performance for data.
After pre-processing, the codes for online time computation could be written as following:
Note that:
- The volume of data for one user in seven days is not big. Thus in cell A5 we can retrieve all log data for a user into memory in one batch.
- In the one-loop-for-each-user cycle, the codes in red box implemented the computation of the new business logic: for every post operation conducted, the users’ current time online time will be tripled in computation. The removal of unqualified record is done in cell B9, and in B10 we calculate a serial number for every login (lognum). Records are grouped in B10 according to lognum, to compute the sum of onlinetime for each group. If there is at least one "post" action in the current group of operations, then the sum of onlinetime for current group will be tripled.
- Considering the relatively large data resulted, when the computation is done for 10,000 users, and the result also reach 10,000 lines, we’ll do a batch output of the data from memory to a result file. This improves the performance while avoiding the memory overflow at the same time.
After meeting this demand, the IT engineers in Web Company found that the single-threaded program does not take full advantage of the of the server’s computing power. Here comes another question: can these engineers leverage esProc’s multi-threaded parallel computing capabilities to take full advantages of the server's quad dual core CPUs? Is it troublesome to shift from single-threaded to multiple-threaded? See “Computing the Online Time for users with esProc (III)”.