Open-Set Web Genre Identification Using Distributional Features and Nearest Neighbors Distance Ratio.

The top k training documents are k-nearest neighbors of the new document and the k-nearest neighbors are used to predict the categories of the new document. ... Used for feature selection, the odds ratio of feature f and category c i captures the difference between the distribution of feature f on its positive class c i and the distribution of feature f on its ...

doi:10.1007/11362197_9 fatcat:4nouetuxezb6ri2thtfcbnqadq

Afterintroducing its building process and content, we present an exploratory data analysis with a quantitative description of its main features. ... Many datasets are published in English to get more engagement, popularity and reach within a research community. Indeed, most sciences are language-agnostic and thrive on publicly available data. ... The library uses Levenshtein Distance to calculate the differences between two strings. With a partial ratio set at 75%, the fuzzy string matching process generates an incomplete result. ...

doi:10.5753/jidm.2022.2349 fatcat:hbfbv5c46feptgkqq4tufaicb4

from ScienceDirect and Springer websites, we review the different machine learning algorithms used to categorize web pages. ... Web page classification has many applications, among them the construction of web directories and the building of focused crawlers. ... This is necessary to determine which samples are the nearest neighbors. Distance measures such as Euclidean distance are commonly used. ...

doaj:483a4b9f259046a29c57adc3021a50d0 fatcat:hdznsdeotnhwpgpuigi7iovhja

DOAJ

In a wide group of languages, the stop words, which have only grammatical roles and not contributing to information content, may be simply exposed by their relatively higher occurrence frequencies. ... The experiments are conducted on corpora of an agglutinative language, Turkish, in which the amount of inflection is high and a non-agglutinative language, English, in which the inflection is lower for ... k-Nearest Neighbor (k-nm) k-Nearest Neighbor algorithm is a non parametric lazy learning algorithm, originally proposed in [23] , in which when an instance in testing set (whose class is unknown) is to ...

doi:10.18038/aubtda.322136 fatcat:g7mkskb4vja7dkjx3y5shxkkde

DOAJ

WebSty does not require local installation by users, can be used via any web browser, offers rich set-up, and runs on a computing cluster. ... The techniques used for feature weighting and text similarity measuring are also concisely overviewed. ... Acknowledgements Works funded by the Polish Ministry of Science and Higher Education within CLARIN-PL Research Infrastructure. ...

doi:10.12921/cmst.2018.0000007 fatcat:pditv66ns5emzj6xbs4fi46ssu

Open Access

Experiments were conducted on the open-set identification of four coarse office document genres: technical paper, photo, slide, and table. ... These include selecting features that characterize genre-related information in office documents, examining the utility of text-based features and imagebased features, and proposing a simple ensemble method ... Conclusions and future directions are presented in Sect. 7. Related work Many sets of genre categories have been proposed for text genre identification and web genre identification. ...

doi:10.1007/s10032-011-0163-7 fatcat:ogby7vevq5h2lf5kwg55emlkb4

The performances of chosen classifiers, K-Nearest Neighbor, Naïve Bayes and Neural Networks, are compared. ... Mel-frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) coefficients, generates high dimensional feature vector, reduces dimensionality using Principle Component Analysis (PCA) and ... K-Nearest neighbour(kNN) classification [25] The k-nearest neighbor method is originated in 1950s. ...

doi:10.22266/ijies2017.0630.19 fatcat:2ahmaswhojbwxdgivnpxn42uj4

We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research. ... In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application ... Instead of 1-NN, the distance to the kth nearest neighbor or the average of the k distances to the first k neighbors can also be used. ...

doi:10.1017/s026988891300043x fatcat:djdcvpij7jhs7gygtrtihq3dia

Multiple Versions

This approach is the most predominent paradigm to extract high-level descriptions from music signals, such as their instrument, genre or mood, and can also be used to compute direct timbre similarity between ... We introduce 2 measures of "hubness", the number of n-occurrences and the mean neighbor angle. ... Acknowledgment The authors would like to thanks Anthony Beurivé for helping with the implementation of signal processing algorithms and database metadata management. ...

doi:10.1016/j.patcog.2007.04.012 fatcat:d5fkk4u2qrakbcfbjk3q4guani

Third, the best features have successfully been used to classify traditional and social media content in both types of content facets. ... Several proposed content facets have successfully been implemented in APA Labs, a Web-based framework for faceted search in traditional and social me- dia. ... To determine the distance between a test vector and its nearest neighbors, several distance measures have been proposed. ...

doi:10.5281/zenodo.1195993 fatcat:ce3ljnthjfhkpir3y4atnnlicy

Open Access

Third, the best features have successfully been used to classify traditional and social media content in both types of content facets. ... Several proposed content facets have successfully been implemented in APA Labs, a Web-based framework for faceted search in traditional and social me- dia. ... To determine the distance between a test vector and its nearest neighbors, several distance measures have been proposed. ...

doi:10.5281/zenodo.1196397 fatcat:udr3736ejbek5lzl34tu4g4ppq

Open Access

We present a set of smallscale but reasonable experiments in text genre detection, author identification as well as author verification tasks and show that the performance of the proposed method is better ... In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. ... Acknowledgement We would like to thank the anonymous CL reviewers for their valuable and insightful comments. Their suggestions have greatly improved an earlier draft of this paper. ...

doi:10.1162/089120100750105920 fatcat:ksreq6s6w5ewrgawtxvrbl5tje

DOAJ Szczepanski Multiple Versions

In the age of feature extractors, we present work on features to describe sounds and music, especially timbre and tonal aspects. ... Automatic classification of drum sounds: a comparison of feature selection methods and classification techniques. In C. Anagnostopoulou et al. (Eds), "Music and Artificial Intelligence". ... Project PHAROS IST-2006-045035. 21 A demo of such a music recommender/visualization system working on the proposed principles, but taking listening statistics instead of explicitly given preference set ...

doi:10.5281/zenodo.2278110 fatcat:uturvyw2gnfzdgtelvtxot3etq

Open Access

In the age of feature extractors, we present work on features to describe sounds and music, especially timbre and tonal aspects. ... Automatic classification of drum sounds: a comparison of feature selection methods and classification techniques. In C. Anagnostopoulou et al. (Eds), "Music and Artificial Intelligence". ... Project PHAROS IST-2006-045035. 21 A demo of such a music recommender/visualization system working on the proposed principles, but taking listening statistics instead of explicitly given preference set ...

doi:10.5281/zenodo.1882316 fatcat:6yhrlcyexrgyhhwayeau2gu7f4

Open Access

PICASSO makes use of genuine samples obtained from first-class contemporary movies. ... We have created a large training set consisting of over 40,000 image/soundtrack samples obtained from 28 movies and evaluated the suitability of PICASSO by means of a user study. ... Selecting the K-Nearest Neighbors (KNN) means that in a multi-dimensional feature space, for a given feature vector the K nearest feature vectors are selected (cf., [24] for an overview). ...

doi:10.1145/2009916.2010012 dblp:conf/sigir/StuparM11 fatcat:2ol5plw6p5fy7e7pds6vdwmnm4

Web Page Classification* [chapter]

Preserved Fulltext

Cross-collection Dataset of Public Domain Portuguese-language Works

Preserved Fulltext

Machine Learning for Web Page Classification: A Survey

Preserved Fulltext

STOP WORD DETECTION AS A BINARY CLASSIFICATION PROBLEM

Preserved Fulltext

Open Stylometric System WebSty: Integrated Language Processing, Analysis and Visualisation

Preserved Fulltext

Genre identification for office document search and browsing

Preserved Fulltext

Neural Network Based Indian Folk Dance Song Classification Using MFCC and LPC

Preserved Fulltext

One-class classification: taxonomy of study and review of techniques

Preserved Fulltext

Other Versions

A scale-free distribution of false positives for a large class of audio similarity measures

Preserved Fulltext

Content Facets For Individual Information Needs In Media

Preserved Fulltext

Content Facets For Individual Information Needs In Media

Preserved Fulltext

Automatic Text Categorization in Terms of Genre and Author

Preserved Fulltext

MIRages: an account of music audio extractors, semantic description and context-awareness, in the three ages of MIR

Preserved Fulltext

MIRages: an account of music audio extractors, semantic description and context-awareness, in the three ages of MIR

Preserved Fulltext

Picasso - to sing, you must close your eyes and draw

Preserved Fulltext