Typical temporal profiles of Web users based on large web logs

Schrádi Tamás <>
BME AAIT

In this paper we discuss the technological problems of creating temporal Web user profiles from web log files, investigating two possible processing algorithms from the point of view of execution time, memory usage and other favourable properties (fault tolerance) and present the typical user profiles based on the clustered profiles.

During the everyday workflow a Web server stores elementary data connected to the request of the user at a lowest abstraction level. Thus before a further analysis we have to pre-process, transform, filter and prepare the data.

The large dataset size of web logs requires execution-time optimal algorithms, while the increased execution time requires fault tolerance in order to process the dataset efficiently. The relative notation of large dataset is defined in this paper in the following sense: any dataset that does not fit in the main memory at the same time is regarded as large.

We present a reference algorithm, which uses only main memory and we investigate the theoretical and test-based execution time in order to show the inefficiency of this approach to handle datasets more times bigger than the size of the main memory. In this article we present two possible extensions of the processing algorithm, both of them using secondary storage during the processing which makes it possible to process large datasets even on a computer with average memory size. Beside the theoretical analysis of the execution time of the approaches, their execution time based optimisation is discussed as well in the article with the constraints caused by the limited memory environment. The Periodic Partial Result Merging is a fault-tolerant processing algorithm, which propagates the processed log records on the secondary storage by using a merge step. The k-way based merge technique as a first step processes all the log files separately and from the created results generates the temporal profiles using a k-way merge. Due to the large amount of temporal profiles they are not suitable for human understanding, for human analysis. But on these unique profiles a data mining task can be done in order to detect the groups of Web users. Using clustering we can determine the groups in our log based dataset and from these clusters we can generate the typical temporal profiles. On the side of providers the resulting typical profiles can be a valuable.