Web Document Clustering Data Sets

Anterior

DMOZ data sets for clustering and categorization

DMOZCC.xml is a XML File with all data sets. Each element is composed by: data set Identifier (DatId), document Url (DocUrl), document Title (DocTitulo), document text (DocTexto) and the Topic Name (TopNombre). A sample of a document in the first data set is:

<DMOZCC>
    <DatId>1</DatId>
    <DocUrl>http://euroflax.com/</DocUrl>
    <DocTitulo>Euroflax Industries, Ltd</DocTitulo>
    <DocTexto>India. International merchants in raw cotton, natural and man-made fibers, yarns and textiles. Part of the KKM Group.</DocTexto>
    <TopNombre>Top/Business/Textiles_and_Nonwovens/Fibers/Wholesale_and_Distribution</TopNombre>
  </DMOZCC>

Each DMOZ data set in Weka format

Each data set was preprocessed (remove special characters, lower case filtering, remove stop words and porter stemming algorithm) using Lucene.Net. Then each data set was organized as a Weka file. Download all preprocessed data sets in DMOZCCPP.rar

Statistics

Some important statistics:

Classification (%Instances Correctly Classified)

Classification (Weighted Precision)

DS

#Doc

#Top

#Att

J48

NB

SMO

J48

NB

SMO

1

121

4

559

95,868

93,388

96,694

96,633

93,418

96,865

2

133

7

583

69,173

82,707

89,474

73,624

84,654

90,183

3

129

5

658

75,194

88,372

93,023

82,216

89,288

94,358

4

130

9

689

67,692

73,846

83,846

74,139

74,447

87,601

5

108

6

578

75,926

84,259

81,481

78,218

84,485

84,453

6

131

7

694

84,733

94,656

93,893

88,485

94,775

94,252

7

144

6

675

76,389

94,444

93,056

79,574

94,716

93,984

8

161

7

822

75,155

86,957

91,304

81,542

87,54

92,501

9

135

5

614

82,222

91,852

91,111

83,981

92,059

92,131

10

110

6

650

78,182

87,273

89,091

81,391

86,466

92,629

11

139

7

739

83,453

93,525

93,525

85,509

94,044

94,427

12

131

6

731

81,679

90,076

89,313

82,794

90,269

90,037

13

141

6

732

57,143

69,286

66,429

59,157

69,912

67,174

14

111

5

540

81,081

97,297

93,694

87,322

97,387

95,118

15

112

5

629

83,929

86,607

94,643

89,123

87,013

94,943

16

140

4

624

82,857

92,143

93,571

86,807

92,486

93,786

17

116

5

609

78,448

93,966

92,241

79,049

94,254

92,92

18

136

4

796

86,765

90,441

94,118

89,077

90,693

94,443

19

116

7

623

76,724

89,655

93,966

83,045

89,991

94,31

20

116

6

614

75,862

78,448

84,483

76,879

80,124

88,61

21

118

8

575

63,559

71,186

78,814

68,224

72,925

83,678

22

104

5

495

87,5

89,423

93,269

89,151

89,521

93,477

23

128

7

579

82,031

87,5

90,625

87,689

88,774

92,07

24

128

6

684

71,875

82,031

85,156

75,179

82,188

86,416

25

147

7

808

77,551

85,714

87,075

84,191

85,834

88,835

26

119

4

498

71,429

84,034

88,235

72,095

84,507

89,428

27

121

4

497

75,207

91,736

88,43

80,046

91,962

90,062

28

125

8

507

79,2

88

89,6

83,492

88,97

90,757

29

151

8

763

70,861

84,768

89,404

75,598

86,043

90,694

30

133

6

703

77,444

80,451

85,714

78,744

81,538

87,352

31

164

6

616

82,927

93,293

95,732

84,936

93,181

96,048

32

121

6

609

80,165

92,562

90,909

84,449

92,89

91,982

33

134

6

681

73,881

88,06

88,06

75,183

88,689

87,574

34

141

7

703

79,433

85,816

89,362

82,934

87,222

91,357

35

135

5

636

84,444

96,296

97,778

86,107

96,293

97,9

36

122

4

679

85,246

86,885

95,902

87,645

86,705

96,006

37

118

7

641

70,339

72,034

79,661

78,867

70,982

85,089

38

129

7

601

72,868

82,171

86,822

78,072

82,035

88,219

39

136

5

598

80,882

90,441

94,853

84,782

91,361

95,18

40

153

7

761

76,471

85,621

89,542

82,487

86,072

89,747

41

112

3

585

80,357

91,964

91,071

83,152

93,116

92,695

42

140

8

655

79,286

87,143

87,143

85,913

87,068

87,875

43

119

5

564

83,193

90,756

94,118

85,029

90,558

94,624

44

131

4

593

86,26

94,656

93,13

89,405

94,79

94,445

45

108

5

674

75,926

80,556

80,556

80,313

81,866

80,426

46

129

6

679

84,496

87,597

89,147

87,543

87,499

91,334

47

125

7

606

73,6

83,2

86,4

76,133

83,594

87,263

48

137

8

767

81,022

86,861

90,511

86,617

87,232

91,094

49

138

5

648

78,986

89,13

88,406

86,775

90,085

89,577

50

132

10

632

68,182

78,788

76,515

79,373

81,774

88,509

Avg:

129,16

6,02

643,92

78,06

86,96

89,22

81,97

87,47

90,69

Min:

104,00

3,00

495,00

57,14

69,29

66,43

59,16

69,91

67,17

Max:

164,00

10,00

822,00

95,87

97,30

97,78

96,63

97,39

97,90

Cite

Cobos, Carlos. Web Document Clustering Data Sets. University of Cauca, June 2011.