;

Tuesday, May 12, 2015

Algorithm design-related courses at the university on, surely we have learned to use dynamic progra


Algorithm design-related courses at the university on, surely we have learned to use dynamic programming algorithm for solving the minimum edit distance, formally defined as follows:
Minimum edit distance is usually calculated as a similarity function is used for a variety of practical applications, detailed as follows: (in particular, for the Chinese natural language processing, the general term for the basic processing unit) spelling ta correction (Spell Correction): they Spell checking (Spell Checker), will compare each word in the dictionary entries, the English word stemming etc. often need to do standardization process, if a word does not exist in the dictionary, it is considered an error, ta and then try to prompt N most likely to enter the word - spelling suggestions. Tips words commonly used method is to list the original word dictionary with the smallest edit distance entry.
Sure some people here have questions: (if length len) are calculating the minimum edit distance and dictionary entries for each word not in the dictionary, the time complexity is not too high? Indeed, it is generally required to add some pruning strategy, such as: because ta the general spell check applications only need to give the correct recommendations Top-N to (N usually take 10), then we can from the dictionary in accordance with the order of length len, len -1, len + 1, len-2, len-3, ... comparative terms; spelling suggestions terms defined minimum edit distance to this term can not exceed a certain threshold; if the minimum edit distance as a candidate 1 After the entry of more than N, the processing is terminated; cache common misspellings and recommendations to improve performance. DNA Analysis: A major theme of genomics is to compare the DNA sequence and try to find common portions of the two sequences. If the two DNA sequences have similar common subsequence, then these two sequences are likely to be homologous. When the two sequences, only to consider the exact match for the character, but also to consider a sequence of spaces or gaps (or, conversely, to consider another sequence insertion portion) and do not match, these two aspects It could mean mutation (mutation). In a sequence alignment, it is necessary to find the optimal alignment (optimal alignment generally refers to maximize the number of matches, will not match the number of spaces and minimized). If you want more formal, you can determine a score, add points to match the character, minus the fraction of the space and mismatched ta characters.
Specifically, the candidate can be a text string with the name of each entity in the dictionary to edit distance calculations, ta found that when editing text in a string distance is smaller than a given threshold, the candidate word as an entity name; acquiring entity After the name of the candidate word, according to the context in which the use of heuristic rules or classification method determines whether the candidate words indeed for the entity name. Coreferential entity (Entity Coreference): By calculating any minimum edit distance between the two entities to determine whether there were common refers to the relationship? Such as "IBM" and "IBM ta Inc.", "Stanford President John Hennessy" and "Stanford University President John Hennessy". MT (Machine Translation): identification of parallel pages: Parallelism pages generally have the same or similar interface structure, parallel structure on the HTML page should have a great degree of approximation. First page of HTML tags extracted, into a string, and then examine the degree of similarity of two strings ta with minimal editing distance. In practice, this policy generally and document length ratio, sentence alignment translation models and other methods used in combination to identify the ultimate parallel pages right. Auto Review: first storage Machine translation source text and multiple reference translations ta of different quality levels, when evaluating the translation automatic translation corresponds to its smallest edit distance reference ta translation, indirect estimation quality automatic translation, as shown below:
dylinshi126
Categories All blog articles (1600) Foreign (1) Http Web (17) Java (177) Operating Systems (2) algorithm (27) Computer (45) program (2) Performance ta (50) php (45) Test (12) Server (14) Linux (42) database (14) Management (9) network (3) Architecture (81) Safety (2) Data Mining (16) Analysis (9) Data structure (2) the Internet (6) Network Security (1) frame (9) Videos (2) computer, SEO (3) search engine (32) SEO (18) UML (1) tools (2) Maven (41) Other (7) object-oriented (5) Reflective (1) Design mode (6) in-memory database (2) NoSql (9) Cache (7) shell (9) IQ (1) Open (1) Js (23) HttpClient (2) excel (1) Spring (7) Debug (4) mysql (18) Ajax (3) JQuery (9) Comet (1) English (1) C # (1) HTML5 (3) Socket (2) Health (1) Principle ta (2) inverted index (4) Massive Data Processing (1 ) C (2) Git (59) SQL (3) LAMP (1) optimization (2) Mongodb (20) JMS (1) Json (15) Location (2) Google Maps (1) memcached (10) pressure measurement (4 ) php. Performance ta Optimization (1) inspirational (1) Python (7) sort (3) mathematics (3) voting algorithm (2) study (1) Cross-site attacks (1) front-end (8) SuperFish (1) CSS (2 ) Comments mining analysis (1) Google (13) Image analysis (1) Maps (1) Gzip (1) compression (1) Crawler (14) traffic statistics (1) acquisition (1) log analysis (2) Browser compatibility (1) image search engine technology (2) Space (1) User Experience (7) Free space (1) Social (2) Image Processing (2) front-end tool (1) Business (3) Search within ta Taobao (3) Station (1 ) Favorite Sites (1) theory (1) Data Warehouse (2) Ethereal (1) Hadoop (109) big data (6) Lucene (35) Solr (31) Drupal (1) Cluster (2) Lu (2) Mac (4) Index (9) Session Sharing (1) sorl (10) JVM (9) encoding (1) taobao (14) TCP / IP (4) You may be interested (3) Jokes (7), server consolidation ( 1) Nginx (9) SorlCloud (4) distributed search (1) ElasticSearch (30) Network Security (1) MapReduce (8) similarity (1) mathematics (1) Session (3) Dependency Injection (11) Nutch (10 ) Cloud computing (6) Virtualization ta (3) Financial Freedom (1) Open Source (23) Guice (1) Recommended system (2) artificial intelligence (1) Environment (2) Ucenter (1) Memcached-session-manager (1) Storm (54) wine (1) Ubuntu (23) Hbase (44) Google App Engine (1) SMS (2) matrix (1) MetaQ (34) GitHub & Git & private / public libraries (8) Zookeeper (28) Exception ( 24) Business (1) drcp (1) encryption & decryption (1) automatic code generation (1) rapid-framework (1) secondary development (1) Facebook ta (3) EhCache (1) OceanBase (1) Netlog (1) a large amount of data (2) Distributed (3) things (2) Services (2) JPA (2) Communication (1) math (1) Setting.xml (3) Network Drive (1) Mount (1) Agent (0 ) Japanese の (1) peanut shell (7) Windows (1) AWS (2) RPC (11) jar (2) finance (1) MongDB (2) Cygwin (1) Distribute (1) Cache (1) Gora ( 1) Spark (30) Memory Computing (1) Pig (2) Hive (22) Mahout (20) Machine Learning (44) Sqoop (1) ssh (1) Jstack (2) Business ta (1) MapReduce.Hadoop (1) monitor (1) Vi (1) high concurrency (6) massive amounts of data (2) Yslow (4) Slf4j (1) Log4j (1) Unix (3) twitter (2) yotube (0) Map-Reduce (2) Streaming ( 1) VMware (1) Things (1) YUI (1) LazyLoad ta (1) RocketMQ ta (17) WiKi (1) MQ (1) RabbitMQ (2) kafka (3) SSO (8) single sign-on (2) Hash (4) Redis (20) Memcache (2) Jmeter (1) Tsung (1) ZeroMQ (1) Communication (7) open source log analysis (1) HDFS (1) zero-copy (1) Zero Copy (1) Weka ( 1) I / O (1) NIO (13) Locks (3) Entrepreneurship (11) thread pool (1) Investment ta (3) pooling technology (4) collection (1) Mina (1) JSMVC (1) Powerdesigner (1 ) thrift (6) performance, architecture (0) Web (3) Enum (1) Spring MVC (15) interceptor (1) Web front-end ta (1) Multi-threaded (1) Jetty (1) emacs (1) Cookie (2 ) Tools (1) Distributed message queue (1) Project management (2) github (21) network disk (1) Warehouse (3) Dropbox (2) Tsar (1) Monitoring (3) Argo (2) Atmosphere (1) WebSocket (5) Node.js (6) Kraken (1) Cassandra (3) Voldemort (1) VoltDB (2) Netflix (2) Hystrix (1) psychology (1) user analysis (1) User Behavior Analysis (1) JFinal (1) J2EE (1) Lua (2) Velocity (1) Tomcat (3) Load Balancing (1) Rest (2) SerfJ (1) Rest.li (1) KrakenJS (1) Web framework (1) Jsp (2 ) Layout (2) NowJs (1) WebSoket (1) MRUnit (1) CouchDB (1) Hiibari (1) Tiger (1) Ebot (1) Distributed ta reptiles (1) Sphinx (1) Luke (1) Solandra (1 ) search engines (1) mysqlcft (1) IndexTank (1) Erlang (1) BeansDB (3) Bitcask (2) Riak (2) Bitbucket (4) Bitbuket (1) Tokyo Cabinet (2) TokyoCabinet (2) Tokyokyrant ( 1) Tokyo Tyrant (1) Memcached protocol (1) Jcrop (1) Thead (1) detailed ta design (1) Q (2) ROM (1) Calculation (1) epoll (2) libevent (1) BTrace (3) cpu (2) mem (1) Java template engine (1) funny (1) Htools (1) linu (1) node (3) Hosting (1) closure (1) Thread (1) blocks (1) LMAX (2 ) Jdon (1) optimistic locking (1) Disruptor (9) Concurrent (6) share (1) volatile (1) False Sharing (1) Ringbuffer (5) i18n (2) rsync (1) Deployment (1) stress test (1) ORM (2) N + 1 (1) Http (1) web development Scaffolding (1) Mybatis (1) international (2) Spring data (1) R (4) web crawler (1) bar (1) scaling, etc. (1) java, facing interface (1) Programming Specification (1)

No comments:

Post a Comment