
This work is licensed under a Creative Commons Attribution-Share Alike 2.0 France License.
Well ... it seems that planet ocaml is faster then light to index new pages... I often start editing one story to publish it few days later to avoid stupid mistakes. This time I published the page by accident and I think it stayed published for less then two minutes... This is the final versions ...
Today i did a small audit on my code to check which are the functions that are often used and can slow down my code. One in particular took my attention, ExtLib.List.unique . This function (below) takes quadratic time on the length of the input list. On big lists, this function is a killer. The algorithm is very simple. Since it accepts a cmp optional argument, it can be much faster if we use a monomorphic comparing function.
This is a small benchmark (benchmarking is addictive indeed !!) using, the polymorphic and monomorphic variant of List.unique :
However if you want to go even faster, you can use a stupid implementation based on hash tables.
The results are quite clear...
In this test with a long list and many repetition, the difference is remarkable. the function hash_unique is not stable, but if you don't care, it does the job pretty well. If you want an even faster implementation based on list that is also not stable, you can write a small function that remove duplicates on a sorted list.
Comments
BatList.sort_unique
is the unstable version in Batteries.
15 % improvement
Here is a slightly improved version (hash_unique2) that goes through the list only once:
let h = Hashtbl.create (List.length l) in
let add n =
if not(Hashtbl.mem h n) then
Hashtbl.add h n ()
in
List.iter add l;
Hashtbl.fold (fun k _ acc -> k::acc) h []
let hash_unique2 l =
let h = Hashtbl.create (List.length l) in
let rec add acc =
function
| hd :: tl when not (Hashtbl.mem h hd) ->
Hashtbl.add h hd ();
add (hd :: acc) tl
| _ :: tl ->
add acc tl
| [] ->
acc
in
add [] l
module SetString = Set.Make(String)
let set_unique l =
let rec add st acc =
function
| hd :: tl when not (SetString.mem hd st) ->
add (SetString.add hd st) (hd :: acc) tl
| _ :: tl ->
add st acc tl
| [] ->
acc
in
add SetString.empty [] l
let run () =
let rec gen_strings acc = function
|0 -> acc
|n -> gen_strings ((string_of_int(Random.int 100))::acc) (n-1)
in
let a = gen_strings [] 100000
in
let _ =
Benchmark.latencyN (Int64.of_int 100) [
("hash_unique",hash_unique,a);
("hash_unique2",hash_unique2,a);
("set_unique", set_unique,a);
]
in
let lst1 = hash_unique a in
let lst2 = hash_unique2 a in
Printf.eprintf "l1: %d; l2: %d;\n%!" (List.length lst1) (List.length lst2);
assert(List.length lst1 = List.length lst2)
;;
run ();;
And the result:
./test
Latencies for 100 iterations of "hash_unique", "hash_unique2", "set_unique":
hash_unique: 1.05 WALL ( 1.04 usr + 0.01 sys = 1.05 CPU) @ 95.41/s (n=100)
hash_unique2: 0.93 WALL ( 0.92 usr + 0.00 sys = 0.92 CPU) @ 108.22/s (n=100)
(warning: too few iterations for a reliable count)
set_unique: 2.72 WALL ( 2.72 usr + 0.00 sys = 2.72 CPU) @ 36.71/s (n=100)
l1: 100; l2: 100;
even faster ...
If you specialize the hashtbl...
open ExtLib ;;
[...]
let hash_unique2_mono l =
let module StringHash = OCAMLHashtbl.Make(
struct
type t = string
let equal (x : string) (y : string) = x = y
let hash t = Hashtbl.hash t
end)
in
let h = StringHash.create (List.length l) in
let rec add acc =
function
| hd :: tl when not (StringHash.mem h hd) ->
StringHash.add h hd ();
add (hd :: acc) tl
| _ :: tl ->
add acc tl
| [] ->
acc
in
add [] l