??xml version="1.0" encoding="utf-8" standalone="yes"?>亚洲综合在线另类色区奇米,区三区激情福利综合中文字幕在线一区亚洲视频1 ,亚洲国产精品一区二区久久http://m.tkk7.com/conans/category/51839.html你越挣扎我就兴?/description>zh-cnSun, 17 Jun 2012 03:08:34 GMTSun, 17 Jun 2012 03:08:34 GMT60Solr 获取searcher实例分析(?http://m.tkk7.com/conans/articles/380686.htmlCONANCONANWed, 13 Jun 2012 06:17:00 GMThttp://m.tkk7.com/conans/articles/380686.html每一个搜索请求都?x)持有一个searcher的引用,而不是创Z个新的searcherQ处理完后会(x)释放掉这个引?/strong>?br />
Solr在初始化化时Q通过SolrCore核心(j)c要做很多的初始化工作,包过dsolrconfig.xml配置文g里的内容Q代码如下:(x)
  
 booleanQueryMaxClauseCount(); //讄布尔查询最多个数?br />    initListeners();  //d配置文g的search实例的监听器?br />    initDeletionPolicy();
    initIndex();
    initWriters();
    initQParsers();
    initValueSourceParsers();
    this.searchComponents = loadSearchComponents();
    // Processors initialized before the handlers
    updateProcessorChains = loadUpdateProcessorChains();
    reqHandlers = new RequestHandlers(this);
    reqHandlers.initHandlersFromConfig( solrConfig );
    highlighter = initHighLighter();
    // Handle things that should eventually go away
    initDeprecatedSupport();

loadSearchComponentsҎ(gu)是初始化indexSearch实例。详l说明如下:(x)
getSearcher – (forceNew, returnSearcher, waitSearcher-Futures)
x(chng)solr全局三个点调用getSearcher函数 : solrCore初始化时(false, false, null)QQueryComponent处理查询
h?false, true, null)QUpdateHandler在处理commith?true, false, new Future[1])
---------
1.solrCore初始化时
Ҏ(gu)solrconfig配置的IndexReaderFactory&DirectoryFactory获取索引的IndexReaderQ再使用q个reader
装一个SolrIndexReaderQ再使用q个SolrIndexReader装一个RefCounted(searcher的引用计数器Q当搜烦(ch)
lg获取一个组件后引用++Q用完后调用close引用--Q当引用Cؓ(f)0时将q个引用从core理的一个当前被使用?br />searcher的链表移除,同时调用searcher.close回收资源)Q将q个引用d到core理的一个当前被使用的searcher
的链表里如果firstSearcherListeners不ؓ(f)I则回调q些监听器,q个回调是交lcore的一个newSingleThreadExecutor?br />做的Q再往q个U程池里d一个Q?这个RefCounted讄为core当前最新的searcher的引用计数器
最后返回nullQ因为returnSearcher=false
在solrCore初始化时q样做的主要目的是在初始化时加载好IndexSearcherQ搜索请求来?jin)之后能立即q回Q?br />而不必等待加载IndexSearcher
---------
2.QueryComponent处理查询h?br />׃core当前最新的searcher的引用计数器不ؓ(f)null且这个获取IndexSearcher的请求不是强制要求获取最新的Q且
returnSearcher=true故直接返回core当前最新的searcher的引用计数器Q且q个引用计数器做++
q里面还有段当前searcher的引用计数器为null的逻辑Q但是没有发现有什么情况会(x)Dq种情况发生故不累述?br />---------
3.UpdateHandler在处理commith?br />首先到core理的一个当前被使用的searcher的链表里获取目前最新的searcherQ同时会(x)加蝲索引目录下的
index.properties文g(如果存在的话)Q拿到KEY=’index’的|其指明目前烦(ch)引的存放地方Q如果获取的目录和当?br />最新的searcher使用的目录一致且solrConfig.reopenReaders为true则获取通过searher.reader.reopen获取
最新的reader -> 装成searcherQ否则直接IndexReader.open获取reader?br />获取到searcher后的一D逻辑[RefCount装Q添加到searchers链表]和core初始化时是一L(fng)Q接下来的逻辑?br />如果solrConfig.useColdSearcher为TRUE其当前searcher的引用ؓ(f)null-D来自QueryComponent的请求阻?br />[现在q没发现什么情况会(x)Dsearcher的引用ؓ(f)null]
立即这个新的searcher的引用设|ؓ(f)core当前最新的searcher的引用计数器Q这h自QueryComponent的请?br />拿到q个引用后返回,当时q时q个新徏的searcher是没有经q其前一个searcher的cache热n的,同时q样?x)导致这?br />新徏的searcher不会(x)q行热nzd
如果solrConfig.useColdSearcher为FALSE则会(x)往U程池里d一个热w的d
如果newSearcherListeners不ؓ(f)I则回调q些监听器,也是l线E池的Q?br />最后如果先前没有做新的searcher的引用设|ؓ(f)core当前最新的searcher的引用计数器的行为的话,则往U程池添?br />一个Q?– 新的searcher的引用设|ؓ(f)core当前最新的searcher的引用计数器
最后返回nullQ因为returnSearcher=false

from:http://blog.sina.com.cn/s/blog_56fd58ab0100v3tp.html

CONAN 2012-06-13 14:17 发表评论
]]>
olr 性能调优 NO_NORMS(?http://m.tkk7.com/conans/articles/380685.htmlCONANCONANWed, 13 Jun 2012 06:16:00 GMThttp://m.tkk7.com/conans/articles/380685.html

indexed fields

    indexed fields 的数量将?x)?jing)响以下的一些性能Q?/p>

  •         索引时的时候的内存?sh)用?/li>
  •         索引D늚合ƈ旉
  •         优化旉
  •         索引的大?/li>

     我们可以通过 ?omitNorms=“true” 来减indexed fields数量增加所带来的媄(jing)响?/p>

   stored fields

      Retrieving the stored fields  实是一U开销。这个开销Q受每个文所存储的字节媄(jing)响很大。每个文档的所占用的空间越大,文显的更E疏,q样从硬盘(sh)d数据Q就需要更多的i/o操作Q通常Q我们在存储比较大的域的时候,׃(x)考虑q样的事情,比如存储一文章的文。)(j)

       可以考虑比较大的域攑ֈsolr外面来存储。如果你觉得q样做会(x)有些别扭的话Q可以考虑使用压羃的域Q但是这样会(x)加重cpu在存储和d域的时候的负担。不q这样却是可以较?yu)i/0的负担?/p>

       如果Q你q不是L使用 stored fields 的话Q可以用stored field的gq加载,q样可以节省很多的性能Q尤其是使用compressed field 的时候?/p>

 Configuration Considerations

      mergeFactor

         q个是合q因子,q个参数大概军_?jin)segment(索引D?的数量?/p>

         合ƈ因子q个值告诉luceneQ在什么时候,要将几个segment合ƈ成ؓ(f)一个segment, 合ƈ因子像是一个数字系l的基数一栗?/p>

         比如_(d)如果你将合ƈ因子设成10Q那么每往索引中添?000个文档的时候,׃(x)创徏一个新的烦(ch)引段。当W?0个大ؓ(f)1000的烦(ch)引段dq来的时候,q十个烦(ch)引段׃(x)被合q成一个大ؓ(f)10Q?00的烦(ch)引段。当十个大小?0Q?00的烦(ch)引段生成的时候,它们׃(x)被合q成一个大ؓ(f)100Q?00 的烦(ch)引段。如此类推下厅R?/p>

         q个值可以在 solrconfig.xml 中的 *mainIndex*中设|。(不用indexDefaults中设|)(j)

       mergeFactor Tradeoffs

         较高的合q因?/p>

  •         ?x)提高?ch)引速度
  •         较低频率的合qӞ?x)导?更多的烦(ch)引文Ӟq会(x)降低索引的搜索效?/li>

          较低的合q因?/p>

  •         较少数量的烦(ch)引文Ӟ能加快烦(ch)引的搜烦(ch)速度?/li>
  •         较高频率的合qӞ?x)降低?ch)引的速度?/li>

Cache autoWarm Count Considerations

      当一个新?searcher 打开的时候,它缓存可以被预热Q或者说使用从旧的searcher的缓存的数据?#8220;自动加热”。autowarmCount是这L(fng)一个参敎ͼ它表CZ旧缓存(sh)拯到新~存?sh)的对象数量。autowarmCountq个参数会(x)影响“自动预热”的时间。有些时候,我们需要一些折?sh)的考虑Qseacher启动的时间和~存加热的程度。当然啦Q缓存加热的E度好Q用的旉׃(x)长Q但往往Q我们ƈ不希望过长的seacher启动旉。这个autowarm 参数可以在solrconfig.xml文g中被讄?/p>

       详细的配|可以参考solr的wiki?/p>

Cache hit rateQ缓存命中率Q?/h2>

       我们可以通过solr的admin界面来查看缓存的状态信息。提高solr~存的大往往是提高性能的捷径。当你用面搜烦(ch)的时候,你或许可以注意一下filterCache,q个是由solr实现的缓存?/p>

      

Explicit Warming of Sort Fields 

       如果你有许多域是Z排序的,那么你可以在"newSearcher"?firstSearcher"event listeners中添加一些明N要预热的查询Q这样FieldCache ׃(x)~存q部分内宏V?/p>

 Optimization Considerations

        优化索引Q是我们l常?x)做的事情,比如Q当我们建立好烦(ch)引,然后q个索引不会(x)再变更的情况Q我们就?x)做一ơ优化了(jin)?/p>

        但,如果你的索引l常?x)改变,那么你就需要好好的考虑下面的因素的?/p>

  •        当越来越多的索引D被加进索引Q查询的性能׃(x)降低Q?lucene对烦(ch)引段的数量有一个上限的限制Q当过q个限制的时候,索引D可以自动合q成Z个?/li>
  • 在同h有缓存的情况下,一个没有经q优化的索引的性能?x)比l过优化的烦(ch)引的性能?0%……
  • 自动加热的时间将?x)变长,因?f)它依赖于搜烦(ch)?/li>
  •  优化会(x)对烦(ch)引的分发产生影响?/li>
  •  在优化期_(d)文g的大将?x)是索引的两倍,不过最l将?x)回到它原来的大,或者会(x)更小一炏V?/li>

      优化Q会(x)所有的索引D合q成Z个烦(ch)引段Q所以,优化q个操作其实可以帮助避免“too many files”q个问题Q这个错误是由文件系l抛出的?/p>

Updates and Commit Frequency Tradeoffs

         如果从机太经总L更新的话Q从机的性能是会(x)受到影响的。ؓ(f)?jin)避免,׃q个问题而引L(fng)性能下降Q我们还必须?jin)解从机是怎样执行更新的,q样我们才能更准去调节一些相关的参数Qcommit的频率,spappullers,autowarming/autocountQ?q样Q从机的更新才不?x)太频繁?/p>

  1.      执行commit操作?x)让solr新生成一个snapshot。如果将postCommit参数设成true的话Qoptimization也会(x)执行snapShot.
  2.      slave上的SnappullerE序一般是在crontab上面执行的,它会(x)去master询问Q有没有新版的snapshot。一旦发现新的版本,slave׃(x)把它下蝲下来Q然后snapinstall.
  3.      每次当一个新的searcher被open的时候,?x)有一个缓存预热的q程Q预热之后,新的索引才会(x)交付?sh)用?/li>

      q里讨论三个有关的参敎ͼ(x)

  •      number/frequency of snapshots   ----snapshot的频率?/li>
  •      snappullers ?/strong>   在crontab中的Q它当然可以每秒一ơ、每天一ơ、或者其他的旉间隔一ơ运行。它q行的时候,只会(x)下蝲slave上没有的Qƈ且最新的版本?/li>
  •      Cache autowarming 可以在solrconfig.xml文g中配|?/li>

           如果Q你惌的效果是频繁的更新slave上的索引Q以便这L(fng)h比较?#8220;实时索引”。那么,你就需要让snapshot可能频J的q行Q然后也?snappuller频繁的运行。这P我们或许可以?分钟更新一ơ,q且q能取得不错的性能Q当然啦Qcach的命中率是很重要的,恩,~存的加热时间也会(x)影响到更新的频繁度?/p>

       cacheҎ(gu)能是很重要的。一斚wQ新的缓存必L有够的~存量,q样接下来的的查询才能够从缓存(sh)受益。另一斚wQ缓存的预热可能占用很长一D|_(d)其是,它其实是只用一个线E,和一个cpu在工作。snapinstaller太频J的话,solr slave会(x)处于一个不太理想的状态,可能它还在预热一个新的缓存,然而一个更新的searcher被opern?jin)?/p>

         怎么解决q样的一个问题呢Q我们可能会(x)取消W一个seacherQ然后去处理一个更新seacherQ也x(chng)W二个。然而有可能W二个seacher q没有被使用上的时候,W三个又q来?jin)。看吧,一个恶性的循环Q不是。当然也有可能,我们刚刚预热好的时候就开始新一轮的~存预热Q其实,q样~存的作用压根就没有能体现出来。出现这U情늚时候,降低snapshot的频率才是硬道理?/p>

    Query Response Compression

        在有些情况下Q我们可以考虑solr xml response 压羃后才输出。如果response非常大,׃(x)触及(qing)NIc i/o限制?/p>

        当然压羃q个操作会(x)增加cpu的负担,其实Qsolr一个典型的依赖于cpu处理速度的服务,增加q个压羃的操作,无疑会(x)降低查询性能。但是,压羃后的数据会(x)是压~前的数据的6分之一的大。然而solr的查询性能也会(x)?5%左右的消耗?/p>

       至于怎样配置q个功能Q要看你使用的什么服务器而定Q可以查阅相关的文?/p>

     Embedded vs HTTP Post

         使用embeded 来徏立烦(ch)引,会(x)比用xml格式来徏立烦(ch)引快50%?/p>

     RAM Usage ConsiderationsQ内存方面的考虑Q?/h2>

        OutOfMemoryErrors

           如果你的solr实例没有被指定够多的内存的话,java virtual machine也许?x)抛outof memoryErrorQ这个ƈ不对索引数据产生影响。但是这个时候,M?adds/deletes/commits操作都是不能够成功的?/p>

         Memory allocated to the Java VM

            最单的解决q个Ҏ(gu)是Q当然前提是java virtual machine q没有用掉你全部的内存Q增加运行solr的java虚拟机的内存?/p>

           Factors affecting memory usageQ媄(jing)响内存(sh)用量的因素)(j)

             我想Q你或许也会(x)考虑怎样dsolr的内存(sh)用量?/p>

              其中的一个因素就是input document的大?/p>

              当我们用xml执行add操作的时候,׃(x)有两个限制?/p>

  •      document中的field都是?x)被存进内存的,field有个属性叫maxFieldLengthQ它或许能帮上忙?/li>
  •      每增加一个域Q也是会(x)增加内存的用的?/li>


CONAN 2012-06-13 14:16 发表评论
]]>
Solr Cache使用介绍?qing)分??http://m.tkk7.com/conans/articles/380684.htmlCONANCONANWed, 13 Jun 2012 06:12:00 GMThttp://m.tkk7.com/conans/articles/380684.html每个core通常?同一时刻只由当前的SolrIndexSearcher供上层的handler使用

Q当切换SolrIndexSearcher时可能会(x)有两个同时提供服务)(j)Q而Solr的各UCache是依附于SolrIndexSearcher的,SolrIndexSearcher在则C(j)ache 生,SolrIndexSearcher亡则C(j)ache被清Iclose掉?/p>

Solr中的应用Cache有filterCache?queryResultCache、documentCache{,q些Cache都是SolrCache的实现类Q?/p>

q且?SolrIndexSearcher的成员变量,各自有着不同的逻辑和命,下面分别予以介绍和分析?/p>

1、SolrCache接口实现c?/h2>

Solr提供?jin)两USolrCache接口实现c:(x)solr.search.LRUCache和solr.search.FastLRUCache?/p>

FastLRUCache?.4版本中引入的Q其速度在普遍意义上要比LRUCache更fast些?br />下面是对SolrCache接口主要Ҏ(gu)的注释:(x)

public interface SolrCache{publicObjectinit(Mapargs,Objectpersistence, CacheRegenerator regenerator);
publicintsize();
publicObjectput(Objectkey,Objectvalue);
publicObjectget(Objectkey);publicvoidclear();voidwarm(SolrIndexSearcher searcher, SolrCache old)throwsIOException;
publicvoidclose();}

1.1、solr.search.LRUCache

LRUCache可配|参数如下:(x)

1QsizeQcache中可保存的最大的Ҏ(gu)Q默认是1024
2QinitialSizeQcache初始化时的大,默认?024?br />3QautowarmCountQ?br />当切换SolrIndexSearcherӞ可以Ҏ(gu)生成的SolrIndexSearcher做autowarmQ预热)(j)处理?br />autowarmCount表示从旧的SolrIndexSearcher中取多少Ҏ(gu)在新的SolrIndexSearcher中被重新生成Q?/p>

如何重新生成由CacheRegenerator实现。在当前?.4版本的Solr中,q个autowarmCount只能取预热的Ҏ(gu)Q?/p>

来?.0版本可以指定为已有cacheҎ(gu)的百分比Q以便能更好的^衡autowarm的开销?qing)效果?/p>

如果不指定该参数Q则表示不做autowarm处理。实CQLRUCache直接使用LinkedHashMap来缓存数据,

由initialSize来限定cache的大,淘汰{略也是使用LinkedHashMap的内|的LRU方式Q?/p>d操作都是对map的全局锁,所以ƈ发性效果方面稍差?

1.2、solr.search.FastLRUCache

在配|方面,FastLRUCache除了(jin)需要LRUCache的参敎ͼq可有选择性的指定下面的参敎ͼ(x)

1QminSizeQ当cache辑ֈ它的最大数Q淘汰策略光到minSize大小Q默认是0.9*size?br />2QacceptableSizeQ当淘汰数据Ӟ期望能降到minSizeQ但可能?x)做不到Q则可勉为其隄降到acceptableSizeQ?/p>

默认?.95*size?/p>

3QcleanupThreadQ相比LRUCache是在put操作中同步进行淘汰工作,FastLRUCache可选择q立的U程来做Q?/p>

也就是配|cleanupThread的时候。当cache大小很大Ӟ每一ơ的淘汰数据可能会(x)p较长旉Q?/p>

q对于提供查询请求的U程来说׃太合适,q立的后台U程来做很有必要。实CQ?/p>

FastLRUCache内部使用?jin)ConcurrentLRUCache来缓存数据,它是个加?jin)LRU淘汰{略的ConcurrentHashMapQ?/p>

所以其q发性要好很多,q也是多数Java版Cache的极典型实现?/p>

2、filterCache

filterCache存储?jin)无序的lucene document id集合Q该cache?U用途:(x)

1QfilterCache
存储?jin)filter queries(“fq”参数)得到的document id集合l果。Solr中的query参数有两U,即q和fq。如果fq存在Q?/p>

Solr是先查询fqQ因为fq可以多个Q所以多个fq查询是个取结果交?的过E)(j)Q之后将fql果和ql果取ƈ?/p>

在这一q程中,filterCache是key为单个fqQ类型ؓ(f)QueryQ,value为documentid集合Q类型ؓ(f)DocSetQ的cache?/p>

对于fq为range query来说QfilterCache表现出其有h(hun)值的一面?br />2QfilterCache
q可用于facet查询Qhttp://wiki.apache.org/solr/SolrFacetingOverviewQ,facet查询中各
facet的计数是通过Ҏ(gu)query条g的document
id集合Q可涉及(qing)到filterCacheQ的处理得到的。因为统计各facet计数可能?x)涉及(qing)到所有的doc
idQ所以filterCache的大需要能容下索引的文数?br />3Q如果solfconfig.xml中配|了(jin)<useFilterForSortedQuery/>Q?/p>

那么如果查询有filterQ此filter是一需要过滤的DocSetQ而不是fqQ我未见得它有什么用Q,

则用filterCache?br />

下面是filterCache的配|示例:(x)

<!-- Internal cache used by SolrIndexSearcher for filters (DocSets),unordered sets of *all* documents
that match a query.When a new searcher is opened, its caches may be prepopulatedor "autowarmed"
 using data from caches in the old searcher.autowarmCount is the number of items to prepopulate.
For LRUCache,the prepopulated items will be the most recently accessed items.-->
<filterCacheclass="solr.LRUCache"size="16384"initialSize="4096"/>

对于是否使用filterCache?qing)如何配|filterCache大小Q需要根据应用特炏V统计、效果、经验等各方面来评估?/p>

对于使用fq、facet的应用,对filterCache的调优是很有必要的?/p>

3、queryResultCache

֐思义QqueryResultCache是对查询l果的缓存(SolrIndexSearcher中的cache~存的都是document id setQ,
q个l果是针对查询条g的完全有序的l果?下面是它的配|示例:(x)
<!-- queryResultCache caches results of searches - ordered lists ofdocument ids (DocList) based on a query, a sort, and the rangeof documents requested.-->
<queryResultCacheclass="solr.LRUCache"size="16384"initialSize="4096"/>
~存的key是个什么结构呢Q就是下面的c(key的hashcode是QueryResultKey的成员变量hcQ:(x)
publicQueryResultKey(Query query, List<Query>filters, Sort sort,intnc_flags)
{
this.query=query;
this.sort=sort;
this.filters=filters;
this.nc_flags=nc_flags;
inth=query.hashCode();
if(filters!=null)h^=filters.hashCode();
sfields=(this.sort!=null)?this.sort.getSort():defaultSort;
for(SortField sf:sfields)
{ // mix the bits so that sortFields are position dependent
// so that a,b won't hash to the same value as b,ah^=(h<<8)|(h>>>25);
// reversible hashif(sf.getField()!=null)h+=sf.getField().hashCode();h+=sf.getType();
if(sf.getReverse())h=~h;if(sf.getLocale()!=null)h+=sf.getLocale().hashCode();
if(sf.getFactory()!=null)h+=sf.getFactory().hashCode();}hc=h;
}
因ؓ(f)查询参数是有start和rows的,所以某个QueryResultKey可能命中?jin)cacheQ但start和rows却不在cache?br />document id set范围内。当?dng)document id
set是越大命中的概率大Q但q也?x)很费内存Q这需要个参数QqueryResultWindowSize来指定document id
set的大。Solr中默认取gؓ(f)50,可配|,W(xu)IKI上的解释很深单明?jin)?x)
<!-- An optimization for use with the queryResultCache. When a searchis requested, a superset of the requested number
of document idsare collected.  For example, of a search for a particular queryrequests matching documents 10 
through 19, and queryWindowSize is 50,then documents 0 through 50 will be collected and cached.
Any furtherrequests in that range can be satisfied via the cache.-->
<queryResultWindowSize>50</queryResultWindowSize>
相比filterCache来说QqueryResultCache内存?sh)用上要更少一些,但它的效果如何就很难说?br /> q(ch)引数据来_(d)通常我们只是在烦(ch)引上存储应用主键idQ再从数据库{数据源获取其他需要的字段?br /> q得查询过E变成,首先通过solr得到document id setQ再由Solr得到应用id集合Q?br /> 最后从外部数据源得到完成的查询l果。如果对查询l果正确性没有苛ȝ要求Q可以在Solr之外独立的缓存完整的

查询l果Q定时作废)(j)Q这时queryResultCache׃是很有必要,否则可以考虑使用queryResultCache。当?dng)如果发现?br /> queryResultCache生命周期内,query重合度很低,也不是很有必要开着它?

4、documentCache

又顾名思义QdocumentCache用来保存<doc_id,document>对的。如果用documentCacheQ就可能开?br />

些,臛_要大q?lt;max_results> *<max_concurrent_queries>Q否则因为cache的淘汎ͼ
一ơ请求期间还需要重新获取document一ơ。也要注意document中存储的字段的多,避免大量的内存消耗?br /> 下面是documentCache的配|示例:(x)<!-- documentCache caches Lucene Document objects (the stored fields for each document).-->
<documentCacheclass="solr.LRUCache"size="16384"initialSize="16384"/>
5、User/Generic Caches 
Solr支持自定义CacheQ只需要实现自定义的regenerator卛_Q下面是配置CZQ?lt;!-- Example of a generic cache. These caches may be accessed by namethrough SolrIndexSearcher.getCache(),
cacheLookup(), and cacheInsert().The purpose is to enable easy caching of user/application level data.
The regenerator argument should be specified as an implementationof solr.search.CacheRegenerator if
 autowarming is desired.--><!--<cache name="yourCacheNameHere"class="solr.LRUCache"size="4096"
initialSize="2048"regenerator="org.foo.bar.YourRegenerator"/>-->
6、The Lucene FieldCache 
lucene中有相对低别的FieldCacheQSolrq不对它做管理,所以,lucene的FieldCacheq是由lucene的IndexSearcher来搞?

7、autowarm

上面有提到autowarmQautowarm触发的时机有两个Q一个是创徏W一个SearcherӞfirstSearcherQ,一个是创徏个新

SearcherQnewSearcherQ来代替当前的Searcher。在Searcher提供h服务前,Searcher中的各个Cache可以
做warm处理Q处理的地方通常是SolrCache的initҎ(gu)Q而不同cache的warm{略也不一栗?br /> 1QfilterCacheQfilterCache注册?jin)下面的CacheRegeneratorQ就是由旧的key查询索引得到新值put到新cache中。solrConfig.filterCacheConfig.setRegenerator(newCacheRegenerator(){publicbooleanregenerateItem
(SolrIndexSearcher newSearcher, SolrCache newCache, SolrCache oldCache,ObjectoldKey,ObjectoldVal)
throwsIOException{newSearcher.cacheDocSet((Query)oldKey,null,false);returntrue;}});
2QqueryResultCacheQqueryResultCache的autowarm不在SolrCache的initQ也是_(d)不是去遍历已
有的queryResultCache中的query key执行查询Q,而是通过SolrEventListener接口的void
newSearcher(SolrIndexSearcher newSearcher, SolrIndexSearcher
currentSearcher)Ҏ(gu)Q来执行配置中特定的query查询Q达到显C的预热lucene FieldCache的效果?br /> queryResultCache的配|示例如下:(x)
<listenerevent="newSearcher"class="solr.QuerySenderListener"><arrname="queries"><!-- seed common sort fields --><lst>
<strname="q">anything</str><strname="sort">name desc price desc populartiy desc</str></lst></arr>
</listener><listenerevent="firstSearcher"class="solr.QuerySenderListener"><arrname="queries">
<!-- seed common sort fields --><lst><strname="q">anything</str><strname="sort">
name desc, price desc, populartiy desc</str></lst><!-- seed common facets and filter queries -->
<lst><strname="q">anything</str><strname="facet.field">category</str>
<strname="fq">inStock:true</str><strname="fq">price:[0 TO 100]</str></lst></arr></listener>
3QdocumentCacheQ因为新索引的document id和烦(ch)引文的对应关系发生变化Q所以documentCache没有warm的过E,
落得白茫茫一片真q净。尽autowarm很好Q也要注意autowarm带来的开销Q这需要在实际中检验其warm的开销Q?br /> 也要注意Searcher的切换频率,避免因ؓ(f)warm和切换媄(jing)响Searcher提供正常的查询服务?br />
8、参考文?
http://wiki.apache.org/solr/SolrCaching


CONAN 2012-06-13 14:12 发表评论
]]>solr 的客L(fng)调用solrj 建烦(ch)?分页查询 http://m.tkk7.com/conans/articles/379556.htmlCONANCONANWed, 30 May 2012 07:05:00 GMThttp://m.tkk7.com/conans/articles/379556.html阅读全文

CONAN 2012-05-30 15:05 发表评论
]]>
solr的facet查询http://m.tkk7.com/conans/articles/379555.htmlCONANCONANWed, 30 May 2012 06:52:00 GMThttp://m.tkk7.com/conans/articles/379555.html
比如搜烦(ch)数码相机, 在搜索结果栏?x)根据厂? 分L率等l度列出, q里厂商, 分L率就是一个个facet.

然后在厂商下面会(x)有nikon, canon, sony{品? q个叫约?constraints)

接下来是Ҏ(gu)选择, 列出当前的导航\? q个叫面包屑(breadcrumb).

solr有几Ufacet:
普通facet, 比如从厂商品牌的l度建立fact
查询facet, 比如Ҏ(gu)h查询? 根据h(hun)? 讄多个区间, 比如0-10, 10-20, 20-30{?
日期facet, 也是一U特D的范围查询, 比如按照月䆾q行facet.

facet的主要好处就是可以Q意对搜烦(ch)条gq行l合, 避免无效搜烦(ch), 改善搜烦(ch)体验.

facet都是在查询时通过参数指定. 比如
在http api中这样写:
"&facet=true&facet.field=manu" 
java代码q样写:(x)
new SolrQuery("*:*").setFacet(true).addFacetField("manu");
而xmlq回的结果ؓ(f)q样Q?
<lst name="facet_fields">
            
<lst name="manu">
               
<int name="Canon USA">17</int>
               
<int name="Olympus">12</int>
               
<int name="Sony">12</int>
               
<int name="Panasonic">9</int>
               
<int name="Nikon">4</int>
            
</lst>
</lst>
通过java代码可以q样获取facetl果Q?
List<FacetField> facetFields = queryResponse.getFacetFields();
在已有的查询基础上增加facet query,可以q样写:(x)
solrQuery.addFacetQuery("quality:[* TO 10]")
比如对h(hun)格按照指定的区间q行facet, 可以q样加上facet后缀:

&facet=true&facet.query=price:[* TO 100]
&facet.query=price:[100 TO 200];&facet.query=[price:200 TO 300]
&facet.query=price:[300 TO 400];&facet.query=[price:400 TO 500]
&facet.query=price:[500 TO *]

如果要对h?00?00期间的品做q一步的搜烦(ch), 那么可以q样?使用?jin)solr的过滤查?:
引用
http://localhost:8983/solr/select?q=camera &facet=on&facet.field=manu&facet.field=camera_type &fq=price:[400 to 500]


注意q里的facet field不再包含price?

如果q里对类型做q一步的查询, 那么query语句可以q样?
引用
http://localhost:8983/solr/select?q=camera &facet=on&facet.field=manu &fq=price:[400 to 500] &fq=camera_type:SLR


facet的用场?
1.cȝD
2.自动提示, 需要借助一个支持多值的tag field.
3.热门关键词排? 也需要借助一个tag field








CONAN 2012-05-30 14:52 发表评论
]]>
新版SolrCloud概述http://m.tkk7.com/conans/articles/379553.htmlCONANCONANWed, 30 May 2012 06:47:00 GMThttp://m.tkk7.com/conans/articles/379553.html
目前SolrCloud已经成熟, 可以支持分布式烦(ch)引和分布式搜? 下面是我们一个项目采用新的SolrCloud的部|结构图:

看v来是否非常简? 下面我们看看内部的一些实现细?

SolrCloud功能和架?/strong>
下面是SolrCloud一些不错的功能:
  • 中心(j)化集配|?
  • 自动容灾
  • q实时搜?
  • 领导选D
  • 索引持久?

另外SolrCloud也能被配|成:
分片(shard)索引
每个shard可以有一个或多个副本(replica)

多个shard和replica可以l成一个Collection(从图中可以看出就是一个SolrCloud), 多个Collection可以部vC个SolrCloud集群. 而一个搜索请求可以同时搜索多个Collection. 其工作流E就像下图中那样.


SolrCloud Shard, Replica, Replication
像上图那样, 一个新的doc发送到一个SolrCloud集群中Q何一个节? doc能自动选择发送到哪一个Shard, 如果Shard有多个副? doc?x)自动进行同? 与原来的master/slavel构有所不同, 数据同步是实时的(原来则是定期扚w同步).

集群配置
SolrCloud集群的所有的配置存储在ZooKeeper. 一旦一个SolrCloud节点启动, 该节点的配置信息发送到ZooKeeper上存?

Shard Replica除了(jin)作ؓ(f)容灾备䆾存在, 另外一个作用就是分散查询请? 提高整个集群的查询能?

索引处理
索引文的更新在Shard和Replica之间是自动和实时? 因ؓ(f)不存在master server, doc可以发送到M一个SolrCloud(也就是一个Collection), 然后由SolrCloud完成剩下的事? q样׃再存在以前master/slave的单炚w?

搜烦(ch)方式
有三U不同的搜烦(ch)方式:
在单个Solr实例上搜?
在单个Collection上搜?卛_一个Collection的多个Shard上搜?
在指定的Shard上搜?
在多个Collection上搜? q将最后merge的结果返?

q维理
除了(jin)原来的标准core admin, q增加了(jin)其他方式:
在一个Collection上创Z个Shard
新徏一个Collection
增加节点.

下一步计?/strong>
http://wiki.apache.org/solr/NewSolrCloudDesign
有新的SolrCloud设计Ҏ(gu).

CONAN 2012-05-30 14:47 发表评论
]]>
[译]lucene&solr 2011q盘?/title><link>http://m.tkk7.com/conans/articles/379552.html</link><dc:creator>CONAN</dc:creator><author>CONAN</author><pubDate>Wed, 30 May 2012 06:44:00 GMT</pubDate><guid>http://m.tkk7.com/conans/articles/379552.html</guid><description><![CDATA[原文:<a target="_blank">http://java.dzone.com/articles/lucene-solr-year-2011-review</a> <br /><br />2011q已l过? 在这里针Ҏ(gu)qlucene和solr领域发生的点Ҏ(gu)滴进行一下回? 也算是对lucene和solr的一个盘? <br /><br />lucene成ؓ(f)apache基金?x)项目已逑֍q?实际上lucene存在的历史已过10q?, solr 作ؓ(f)apache基金目也差不多度过?jin)六个春U? 而这两个目的发展离不开Otis(<a target="_blank">http://twitter.com/otisg</a> )的长期努? <br /><br />在这一q里, solr和lucene发生?jin)非常显著的变? 增加?jin)大量新的功? 而这个变化可以说过以往M一q? <br /><br />其中最Ȁ动h?j)的功能莫过于近实时搜?ch)功能(Near Real-Time search <a target="_blank">http://search-lucene.com/?q=NRT</a> )的实? 卛_文档的修改会(x)立马出现在搜索结果中. 虽然NRT依然q在l箋(hu)改进? 但是很多用户已经开始用该功能. <br /><br />字段折叠(Field Collapsing <a target="_blank">http://wiki.apache.org/solr/FieldCollapsing</a> ) 也是solrC֌中长期以来期待的一个功? q个功能已在今年实现. 现在solr和lucene用户可以Z字段和查询条件对l果集进行进行分l? q实C(jin)对分l进行控? 此外q可以基于分l进行facetq算(而以前只能基于文?. <br /><br />在这一q? lucene也引入了(jin)faceting module(<a target="_blank">https://issues.apache.org/jira/browse/LUCENE-3079</a> ), 从此以后, facet不再是solr的专? lucene用户可以q行facetq算? <br /><br />从今q开? 你可以通过使用Join module(<a target="_blank">http://wiki.apache.org/solr/Join</a> ) 对父子关联的文建烦(ch)? q样我们可以在查询的q程中根据文档烦(ch)引将父子文q行q接. <br /><br />2011q? 在多语言支持斚w(<a target="_blank">http://wiki.apache.org/solr/LanguageAnalysis#Stemming</a> ) ,solr和lucene也取得了(jin)重大H破: 加入?jin)KStemFilter English stemmer(<a target="_blank">http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.KStemFilterFactory</a> ) , 提供?jin)对Unicode 4完整的支? 增加?jin)对中文和日文的支? 增加?jin)一个新的stemmer保护机制. 降低?jin)synonym filter对内存的消? 其中最大的一个增强是集成?jin)Hunspell(<a target="_blank">http://wiki.apache.org/solr/LanguageAnalysis#Notes_about_solr.HunspellStemFilterFactory</a> ), q样可以使用OpenOffice所支持的语aq行stemming处理. <br /><br />lucene 3.5.0的发? 大幅度的降低?jin)term词典的内存消?在对term词典处理? 比以前减了(jin)3~5?. <br /><br />以前在用lucene的时? 如果对大数据量的搜烦(ch)l果q行分页处理, 从头d会(x)出现问题. 而在lucene 3.5.0q个版本, 通过引入searchAfterҎ(gu)q行?jin)彻底的解? <br /><br />在这一q? lucene和solr提供?jin)一个新? 更高? 更可靠的ZTerm Vector的高?sh)功? <br /><br />在这一q? solr集成?jin)扩展的Dismax查询解析?<a target="_blank">http://search-lucene.com/?q=Extended+Dismax</a> ), q一步提高(sh)(jin)搜烦(ch)l果的质? <br /><br />q一q? 你可以用函?<a target="_blank">http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function</a> )Ҏ(gu)索结果进行排?比如Ҏ(gu)某个值到指定点的距离q行排序), q且提供?jin)一个新的根据空间搜索过滤器. <br /><br />solr也提供了(jin)一个新? ZFST机器?可以显著的降低内存消?的suggest (<a target="_blank">http://wiki.apache.org/solr/Suggester</a> )/自动完成搜烦(ch)功能, 如果你对q个功能感兴? 可以x(chng)一下Sematext (<a target="_blank">http://sematext.com/products/autocomplete/index.html</a> )提供的自动完成搜索功? <br /><br />q里q需要提到的是solr卛_提供的新的事务日?transaction log <a target="_blank">https://issues.apache.org/jira/browse/SOLR-2700</a> )支持, 该支持将实现实时q回(real-time get <a target="_blank">https://issues.apache.org/jira/browse/SOLR-2656</a> )的功? 卛_d一个文之后你能立x(chng)据idq回该文? 事务日志也将用于SolrCloud分布式节点的恢复. <br /><br />说到SolrCloud(<a target="_blank">http://wiki.apache.org/solr/SolrCloud</a> ) q里(<a target="_blank">http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/</a> )q有一介l? 对于SolrCloud, 用一句话来概? 是q用最新的设计原则q借助其他软g模块(比如zookeeper)更快速的搭徏一套更强大solr分布式集? 其核?j)思想是拒绝单点故障, 采用中心(j)化的集群和配|管? 打破原有的master-slave架构, 做到容灾自动切换和动态调? <br /><br />2010q将两个目的开发进行整合之? q两个项目的发展非常q猛. ?011q? lucene和solr在众多committer们的大力支持下发布了(jin)5个版? 三月, lucene和solr 3.1版本发布, 3个月后的6?? 3.2版本发布. 一个月之后, 7?? lucene和solr 3.3版本发布. 9?4? 3.4版本发布, 11? 3.5.0版本利发布. <br /><br />?011q? lucene和solr相关的会(x)议也不少, 首先d是是5月䆾在旧金山举行的Lucene Revolution, otis在大?x)上做?jin)题(sh)ؓ(f)"Search Analytics: What? Why? How?"(<a target="_blank">http://java.dzone.com/articles/lucene-solr-year-2011-review</a> )的演? 其他q货猛击q里 (<a target="_blank">http://lucenerevolution.com/2011/agenda</a> ) . 在六月䆾的Buzzwords大会(x)? otis在大?x)上做?jin)"Search Analytics: What? Why? How?"的升U版演讲. 相关资料可参考官方网? <a target="_blank">http://berlinbuzzwords.de</a> . 10月䆾, 在巴塞罗那D行了(jin)专门针对lucene和solr?Lucene Eurocon 2011 大会(x). Otis 在大?x)上做?jin)主题?sh)?Search Analytics: Business Value & BigData NoSQL Backend"(<a target="_blank">http://www.lucidimagination.com/sites/default/files/file/Eurocon2011/otis_gospodnetic_search_analytics_lucene_eurocon_2011.ppt</a> )的主题演? 而Rafał(<a target="_blank">http://twitter.com//kucrafal</a> )在大?x)上做?jin)"Explaining & Visualizing Solr 'explain' information"(<a target="_blank">http://www.lucidimagination.com/sites/default/files/file/Eurocon2011/Understanding%20and%20Visualizing%20Solr%20Explain%20information%20-%20Solr.pl%20-%20version%202.pdf</a> )的演? <br /><br />?011q? lucene和solr又迎来了(jin)一Ҏ(gu)的志同道合? <br />•Andi Vajda <br />•Chris Male <br />•Dawid Weiss <br />•Erick Erickson <br />•Jan Høydahl <br />•Martin van Groningen <br />•Stanisław Osiński <br /><br />对于一个成功的开源项? 相关的图书对使用者也是必不可? 虽然今年Lucene in Action没有推出新的版本, 但是Rafał Kuć在今q?月给我们带来?jin)它的新?Solr 3.1 Cookbook". 在该书中,  决solr的一些常见问? Rafałl出?jin)他的答? 而David Smiley ?Eric Pugh在今q十一月推Z(jin)"Apache Solr 3 Enterprise Search Server"的新版本. <br /><br />至于2012q? lucene和solr?x)带来什么新的惊? 让我们拭目以? <img src ="http://m.tkk7.com/conans/aggbug/379552.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://m.tkk7.com/conans/" target="_blank">CONAN</a> 2012-05-30 14:44 <a href="http://m.tkk7.com/conans/articles/379552.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>使用SolrJ生成索引http://m.tkk7.com/conans/articles/379551.htmlCONANCONANWed, 30 May 2012 06:43:00 GMThttp://m.tkk7.com/conans/articles/379551.htmlq里. q个例子使用两种方式来演C如何生成全量烦(ch)?
一个是从db中通过sql生成全量索引
一个是通过tika解析文g生成全量索引
package SolrJExample;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.UpdateResponse;
import org.apache.solr.common.SolrInputDocument;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.sql.*;
import java.util.ArrayList;
import java.util.Collection;

/* Example class showing the skeleton of using Tika and
   Sql on the client to index documents from
   both structured documents and a SQL database.

   NOTE: The SQL example and the Tika example are entirely orthogonal.
   Both are included here to make a
   more interesting example, but you can omit either of them.

 
*/
public class SqlTikaExample {
  
private StreamingUpdateSolrServer _server;
  
private long _start = System.currentTimeMillis();
  
private AutoDetectParser _autoParser;
  
private int _totalTika = 0;
  
private int _totalSql = 0;

  
private Collection _docs = new ArrayList();

  
public static void main(String[] args) {
    
try {
      SqlTikaExample idxer 
= new SqlTikaExample("http://localhost:8983/solr");

      idxer.doTikaDocuments(
new File("/Users/Erick/testdocs"));
      idxer.doSqlDocuments();

      idxer.endIndexing();
    } 
catch (Exception e) {
      e.printStackTrace();
    }
  }

  
private SqlTikaExample(String url) throws IOException, SolrServerException {
      
// Create a multi-threaded communications channel to the Solr server.
      
// Could be CommonsHttpSolrServer as well.
      
//
    _server = new StreamingUpdateSolrServer(url, 104);

    _server.setSoTimeout(
1000);  // socket read timeout
    _server.setConnectionTimeout(1000);
    _server.setMaxRetries(
1); // defaults to 0.  > 1 not recommended.
         
// binary parser is used by default for responses
    _server.setParser(new XMLResponseParser()); 

      
// One of the ways Tika can be used to attempt to parse arbitrary files.
    _autoParser = new AutoDetectParser();
  }

    
// Just a convenient place to wrap things up.
  private void endIndexing() throws IOException, SolrServerException {
    
if (_docs.size() > 0) { // Are there any documents left over?
      _server.add(_docs, 300000); // Commit within 5 minutes
    }
    _server.commit(); 
// Only needs to be done at the end,
                      
// commitWithin should do the rest.
                      
// Could even be omitted
                      
// assuming commitWithin was specified.
    long endTime = System.currentTimeMillis();
    log(
"Total Time Taken: " + (endTime - _start) +
         
" milliseconds to index " + _totalSql +
        
" SQL rows and " + _totalTika + " documents");
  }

  
// I hate writing System.out.println() everyplace,
  
// besides this gives a central place to convert to true logging
  
// in a production system.
  private static void log(String msg) {
    System.out.println(msg);
  }

  
/**
   * ***************************Tika processing here
   
*/
  
// Recursively traverse the filesystem, parsing everything found.
  private void doTikaDocuments(File root) throws IOException, SolrServerException {

    
// Simple loop for recursively indexing all the files
    
// in the root directory passed in.
    for (File file : root.listFiles()) {
      
if (file.isDirectory()) {
        doTikaDocuments(file);
        
continue;
      }
        
// Get ready to parse the file.
      ContentHandler textHandler = new BodyContentHandler();
      Metadata metadata 
= new Metadata();
      ParseContext context 
= new ParseContext();

      InputStream input 
= new FileInputStream(file);

        
// Try parsing the file. Note we haven't checked at all to
        
// see whether this file is a good candidate.
      try {
        _autoParser.parse(input, textHandler, metadata, context);
      } 
catch (Exception e) {
          
// Needs better logging of what went wrong in order to
          
// track down "bad" documents.
        log(String.format("File %s failed", file.getCanonicalPath()));
        e.printStackTrace();
        
continue;
      }
      
// Just to show how much meta-data and what form it's in.
      dumpMetadata(file.getCanonicalPath(), metadata);

      
// Index just a couple of the meta-data fields.
      SolrInputDocument doc = new SolrInputDocument();

      doc.addField(
"id", file.getCanonicalPath());

      
// Crude way to get known meta-data fields.
      
// Also possible to write a simple loop to examine all the
      
// metadata returned and selectively index it and/or
      
// just get a list of them.
      
// One can also use the LucidWorks field mapping to
      
// accomplish much the same thing.
      String author = metadata.get("Author");

      
if (author != null) {
        doc.addField(
"author", author);
      }

      doc.addField(
"text", textHandler.toString());

      _docs.add(doc);
      
++_totalTika;

      
// Completely arbitrary, just batch up more than one document
      
// for throughput!
      if (_docs.size() >= 1000) {
          
// Commit within 5 minutes.
        UpdateResponse resp = _server.add(_docs, 300000);
        
if (resp.getStatus() != 0) {
          log(
"Some horrible error has occurred, status is: " +
                  resp.getStatus());
        }
        _docs.clear();
      }
    }
  }

    
// Just to show all the metadata that's available.
  private void dumpMetadata(String fileName, Metadata metadata) {
    log(
"Dumping metadata for file: " + fileName);
    
for (String name : metadata.names()) {
      log(name 
+ ":" + metadata.get(name));
    }
    log(
"\n\n");
  }

  
/**
   * ***************************SQL processing here
   
*/
  
private void doSqlDocuments() throws SQLException {
    Connection con 
= null;
    
try {
      Class.forName(
"com.mysql.jdbc.Driver").newInstance();
      log(
"Driver Loaded");

      con 
= DriverManager.getConnection("jdbc:mysql://192.168.1.103:3306/test?"
                
+ "user=testuser&password=test123");

      Statement st 
= con.createStatement();
      ResultSet rs 
= st.executeQuery("select id,title,text from test");

      
while (rs.next()) {
        
// DO NOT move this outside the while loop
        
// or be sure to call doc.clear()
        SolrInputDocument doc = new SolrInputDocument();&nbsp;
        String id 
= rs.getString("id");
        String title 
= rs.getString("title");
        String text 
= rs.getString("text");

        doc.addField(
"id", id);
        doc.addField(
"title", title);
        doc.addField(
"text", text);

        _docs.add(doc);
        
++_totalSql;

        
// Completely arbitrary, just batch up more than one
        
// document for throughput!
        if (_docs.size() > 1000) {
             
// Commit within 5 minutes.
          UpdateResponse resp = _server.add(_docs, 300000);
          
if (resp.getStatus() != 0) {
            log(
"Some horrible error has occurred, status is: " +
                  resp.getStatus());
          }
          _docs.clear();
        }
      }
    } 
catch (Exception ex) {
      ex.printStackTrace();
    } 
finally {
      
if (con != null) {
        con.close();
      }
    }
  }
}




CONAN 2012-05-30 14:43 发表评论
]]>
Solr调优参?/title><link>http://m.tkk7.com/conans/articles/379550.html</link><dc:creator>CONAN</dc:creator><author>CONAN</author><pubDate>Wed, 30 May 2012 06:40:00 GMT</pubDate><guid>http://m.tkk7.com/conans/articles/379550.html</guid><description><![CDATA[<div><font style="background-color: #cce8cf">转自Q?a >http://rdc.taobao.com/team/jm/archives/1753</a><br />共整理三部分Q第一部分Solr常规处理Q第二部分针Ҏ(gu)性处理,前者比较通用Q后者有局限性。务必根据具体应用特性,具体调节参数Q对比性能。第三部?br />solr查询相关? <p> </p> <p>具体应用需要全面去把控Q各个因素一赯v作用?/p> <p><span style="font-weight: bold">W一部分<Solr常规的调?gt;</span><br />E文连?http://wiki.apache.org/solr/SolrPerformanceFactors</p> <h2 style="margin: 0cm 0cm 0pt"><span lang="EN-US">Schema Design Considerations</span></h2> <h3><span lang="EN-US">indexed fields</span></h3> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">   indexed fields</span> 的数量将?x)?jing)响以下的一些性能Q?/p> <ul style="margin-top: 0cm" type="disc"><li><span style="font-family: ?hu)?>索引时的时候的内存?sh)用?/span></li><li><span style="font-family: ?hu)?>索引D늚合ƈ旉</span></li><li><span style="font-family: ?hu)?>优化旉</span></li><li><span style="font-family: ?hu)?>索引的大?/span></li></ul> <p><span lang="EN-US">    </span>我们可以通过?span lang="EN-US">omitNorms=“true”</span>来减?span lang="EN-US">indexed fields</span>数量增加所带来的媄(jing)响?/p> <h3><span lang="EN-US">stored fields</span></h3> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">    Retrieving the stored fields </span>实是一U开销。这个开销Q受每个文档所存储的字节媄(jing)响很大?strong>每个文的所占用的空间越大,文显的更E?/strong>Q这样从盘?sh)读取数据,需要更多的<span lang="EN-US">i/o</span>操作Q通常Q我们在存储比较大的域的时候,׃(x)考虑q样的事情,比如存储一文章的文。)(j)</p> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">    </span>可以考虑比较大的域攑ֈ<span lang="EN-US">solr</span>外面来存储。如果你觉得q样做会(x)有些别扭的话Q可以考虑使用压羃的域Q但是这样会(x)加重<span lang="EN-US">cpu</span>在存储和d域的时候的负担。不q这样却是可以较?yu)?span lang="EN-US">i/0</span>的负担?/p> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">    </span>如果Q你q不是L使用<span lang="EN-US">stored fields</span>的话Q可以?span lang="EN-US">stored field</span>的gq加载,q样?strong>以节省很多的性能</strong>Q尤其是使用<span lang="EN-US">compressed field</span> 的时候?/p> <h2><span lang="EN-US">Configuration Considerations</span></h2> <h3><span lang="EN-US">mergeFactor</span></h3> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">    </span>q个是合q因子,q个参数<strong>大概</strong>军_?span lang="EN-US">segment(</span>索引D?span lang="EN-US">)</span>的数量?/p> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">    </span>合ƈ因子q个值告?span lang="EN-US">lucene</span>Q在什么时候,要将几个<span lang="EN-US">segment</span>合ƈ成ؓ(f)一?span lang="EN-US">segment,</span> 合ƈ因子像是一个数字系l的<strong>基数</strong>一栗?/p> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">    </span>比如_(d)如果你将合ƈ因子设成<span lang="EN-US">10</span>Q那么每往索引中添?span lang="EN-US">1000</span>个文档的时候,׃(x)创徏一个新的烦(ch)引段。当W?span lang="EN-US">10</span>个大ؓ(f)<span lang="EN-US">1000</span>的烦(ch)引段dq来的时候,q十个烦(ch)引段׃(x)被合q成一个大ؓ(f)<span lang="EN-US">10</span>Q?span lang="EN-US">000</span>的烦(ch)引段。当十个大小?span lang="EN-US">10</span>Q?span lang="EN-US">000</span>的烦(ch)引段生成的时候,它们׃(x)被合q成一个大ؓ(f)<span lang="EN-US">100</span>Q?span lang="EN-US">000</span>的烦(ch)引段。如此类推下厅R?/p> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">    </span><br />q个值可以在<span lang="EN-US">solrconfig.xml</span> 中的<br /><span lang="EN-US">*<strong>mainIndex</strong>*</span>中设|。(不用?span lang="EN-US">indexDefaults</span>中设|)(j)</p> <h3><span lang="EN-US"> mergeFactor Tradeoffs</span></h3> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">  </span>较高的合q因?/p> <ul style="margin-top: 0cm" type="disc"><li><span lang="EN-US">  </span><span style="font-family: ?hu)?>?x)提高?ch)引速度</span></li><li><span lang="EN-US">  </span><span style="font-family: ?hu)?>较低频率的合qӞ?x)导致更多的索引文gQ这?x)降低?ch)引的搜烦(ch)效率</span></li></ul> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">  </span>较低的合q因?/p> <ul style="margin-top: 0cm" type="disc"><li><span lang="EN-US">  </span><span style="font-family: ?hu)?>较少数量的烦(ch)引文Ӟ能加快烦(ch)引的搜烦(ch)速度?/span></li><li><span lang="EN-US">  </span><span style="font-family: ?hu)?>较高频率的合qӞ?x)降低?ch)引的速度?/span></li></ul> <h2><span lang="EN-US">HashDocSet Max Size Considerations</span></h2> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US"><span> </span></span><span lang="EN-US"><span> </span>hashDocSet</span><span>?/span><span lang="EN-US">solrconfig.xml</span><span>中自定义优化选项</span><span lang="EN-US">,</span> <span><br />使用?/span><span lang="EN-US">filters(docSets)</span> <span><br />中,更小?/span><span lang="EN-US">sets</span><span>Q表明更的内存消耗、遍历、插入?/span></p> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US"><span>  </span><br />hashDocSet</span><span>参数值最后基于烦(ch)引文L来定Q烦(ch)引集合越大,</span><span lang="EN-US">hashDocSet</span><span>g大?/span></p> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">Calulate 0.005 of the total number of documents that you are going to store.  Try values on either ‘side’ of that value to arrive at the best query times. ?When query times seem to plateau, and performance doesn’t show much difference between the higher number and the lower, use the higher. </span></p> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">Note: hashDocSet is no longer part of Solr as of version 1.4.0, see <a ><span style="color: windowtext; text-decoration: none">SOLR-1169</span></a>.</span></p> <h2><span lang="EN-US">Cache autoWarm Count Considerations</span></h2> <p><span lang="EN-US">    </span>当一个新?span lang="EN-US">searcher</span> 打开的时候,它缓存可以被预热Q或者说<strong>使用从旧?span lang="EN-US">searcher</span>的缓存的数据?span lang="EN-US">“</span>自动加热<span lang="EN-US">”</span></strong>?span lang="EN-US">autowarmCount</span>是这L(fng)一个参敎ͼ它表CZ旧缓存(sh)拯到新~存?sh)的对象数量?span lang="EN-US">autowarmCount</span>q个参数会(x)影响<span lang="EN-US">“</span><strong>自动预热<span lang="EN-US">”</span>的时?/strong>。有些时候,我们需要一些折?sh)的考虑Q?span lang="EN-US">seacher</span>启动的时间和~存加热的程度。当然啦Q缓存加热的E度好Q用的旉׃(x)长Q但往往Q我们ƈ不希望过长的<span lang="EN-US">seacher</span>启动旉。这?span lang="EN-US">autowarm</span> 参数可以?span lang="EN-US">solrconfig.xml</span>文g中被讄?/p> <p><span lang="EN-US">   </span>详细的配|可以参?span lang="EN-US">solr</span>?span lang="EN-US">wiki</span>?/p> <h2><span lang="EN-US">Cache hit rate</span>Q缓存命中率Q?/h2> <p><span lang="EN-US">    </span>我们可以通过<span lang="EN-US">solr</span>?span lang="EN-US">admin</span>界面来查看缓存的状态信息?strong>提高<span lang="EN-US">solr</span>~存的大往往是提高性能的捷?/strong>。当你?strong>面搜索的时?/strong>Q你或许可以注意一?span lang="EN-US">filterCache,</span>q个是由<span lang="EN-US">solr</span>实现的缓存?/p> <p><span lang="EN-US">   </span><br />详细的内容可以参?span lang="EN-US">solrCaching</span>q篇<span lang="EN-US">wiki</span>?/p> <h2><span lang="EN-US">Explicit Warming of Sort Fields </span></h2> <p><span lang="EN-US">      </span>如果你有许多域是Z排序的,那么你可以在<span lang="EN-US">“newSearcher”</span>?span lang="EN-US">“firstSearcher”event<br />listeners</span>中添加一些明N要预热的查询Q这?strong><span lang="EN-US">FieldCache</span> ׃(x)~存q部分内?/strong>?/p> <h2><span lang="EN-US">Optimization Considerations</span></h2> <p><span lang="EN-US">    </span>优化索引Q是我们l常?x)做的事情,比如Q当我们建立好烦(ch)引,然后q个索引<strong>不会(x)再变更的情况</strong>Q我们就?x)做一ơ优化了(jin)?/p> <p><span lang="EN-US">    </span>但,如果你的索引l常?x)改变,那么你就需要好好的考虑下面的因素的?/p> <ul type="disc"><li><span style="font-family: ?hu)?>当越来越多的索引D被加进索引Q查询的性能׃(x)降低Q?/span> <span lang="EN-US">lucene</span><span style="font-family: ?hu)?>对烦(ch)引段的数量有一个上限的限制Q当过q个限制的时候,索引D可以自动合q成Z个?/span></li><li><span style="font-family: ?hu)?>在同h有缓存的情况下,一个没有经q优化的索引的性能?x)比l过优化的烦(ch)引的性能?/span><span lang="EN-US">10%……</span></li><li><span style="font-family: ?hu)?>自动加热的时间将?x)变长,因?f)它依赖于搜烦(ch)?/span></li><li><span lang="EN-US"> </span><strong><span style="font-family: ?hu)?>优化会(x)对烦(ch)引的分发产生影响</span></strong><span style="font-family: ?hu)?>?/span></li><li><span lang="EN-US"> </span><span style="font-family: ?hu)?>在优化期_(d)文g的大将?x)?strong>索引的两?/strong>Q不q最l将?x)回到它原来的大,或者会(x)更小一炏V?/span></li></ul> <p><span lang="EN-US">    </span>优化Q会(x)所有的索引D合q成Z个烦(ch)引段Q所以,优化q个操作其实可以帮助避免<span lang="EN-US">“too many files”</span>q个问题Q这个错误是由文件系l抛出的?/p> <h2><span lang="EN-US">Updates and Commit Frequency Tradeoffs</span></h2> <p><span lang="EN-US">   </span><br />如果从机 l常?L更新的话Q从机的性能是会(x)受到影响的。ؓ(f)?jin)避免,׃q个问题而引L(fng)性能下降Q我们还必须?jin)解从机是怎样执行更新的,q样我们才能更准去调节一些相关的参数Q?span lang="EN-US">commit</span>的频率,<span lang="EN-US">spappullers, autowarming/autocount</span>Q?span lang="EN-US">,</span>q样Q从机的更新才不?x)太频繁?/p> <ol type="1"><li><span style="font-family: ?hu)?><br />执行</span><span lang="EN-US">commit</span><span style="font-family: ?hu)?>操作?x)?/span><span lang="EN-US">solr</span><span style="font-family: ?hu)?>新生成一?/span><span lang="EN-US">snapshot</span><span style="font-family: ?hu)?>。如果将</span><span lang="EN-US">postCommit</span><span style="font-family: ?hu)?>参数设成</span><span lang="EN-US">true</span><span style="font-family: ?hu)?>的话Q?/span><span lang="EN-US">optimization</span><span style="font-family: ?hu)?>也会(x)执行</span><span lang="EN-US">snapShot.</span></li><li><span lang="EN-US">slave</span><span style="font-family: ?hu)?>上的</span><span lang="EN-US">Snappuller</span><span style="font-family: ?hu)?>E序一般是?/span><span style="background: yellow" lang="EN-US">crontab</span><span style="font-family: ?hu)?>上面执行的,它会(x)?/span><span lang="EN-US">master</span><span style="font-family: ?hu)?>询问Q有没有新版?/span><span lang="EN-US">snapshot</span><span style="font-family: ?hu)?>。一旦发现新的版本,</span><span lang="EN-US">slave</span><span style="font-family: ?hu)?>׃(x)把它下蝲下来Q然?/span><span lang="EN-US">snapinstall.</span></li><li><span style="font-family: ?hu)?>每次当一个新?/span><span lang="EN-US">searcher</span><span style="font-family: ?hu)?>?/span><span lang="EN-US">open</span><span style="font-family: ?hu)?>的时候,?x)有一个缓存预热的q程Q?strong>预热之后Q新的烦(ch)引才?x)交付?sh)用?/strong></span></li></ol> <p style="margin: 0cm 0cm 0pt"><span lang="EN-US">   </span>q里讨论三个有关的参敎ͼ(x)</p> <ul style="margin-top: 0cm" type="disc"><li><span lang="EN-US"> <strong><span>number/frequency of snapshots</span></strong>  —-snapshot</span><span style="font-family: ?hu)?>的频率?/span></li><li><strong><span lang="EN-US">snappullers </span></strong><strong><span>?/span></strong><span lang="EN-US"> </span> <span style="font-family: ?hu)?>?/span><span lang="EN-US">crontab</span><span style="font-family: ?hu)?>中的Q它当然可以每秒一ơ、每天一ơ、或者其他的旉间隔一ơ运行。它q行的时候,只会(x)下蝲</span><span lang="EN-US">slave</span><span style="font-family: ?hu)?>上没有的Qƈ且最新的版本?/span></li><li><strong><span lang="EN-US">Cache autowarming</span></strong> <span style="font-family: ?hu)?>可以?/span><span lang="EN-US">solrconfig.xml</span><span style="font-family: ?hu)?>文g中配|?/span></li></ul> <p><span lang="EN-US">     </span>如果Q你惌的效果是频繁的更?span lang="EN-US">slave</span>上的索引Q以便这L(fng)h比较?span lang="EN-US">“</span>实时索引<span lang="EN-US">”</span>。那么,你就需要让<span lang="EN-US">snapshot</span>可能频J的q行Q然后也?span lang="EN-US">snappuller</span>频繁的运行。这P我们或许可以?span lang="EN-US">5</span>分钟更新一ơ,q且q能取得不错的性能Q当然啦Q?span lang="EN-US">cach</span>的命中率是很重要的,恩,~存的加热时间也会(x)影响到更新的频繁度?/p> <p><span lang="EN-US">   <strong> <span style="background: yellow">cache</span></strong></span><strong><span style="background: yellow">Ҏ(gu)能是很重要?/span></strong>。一斚wQ新的缓存必L有够的~存量,q样接下来的的查询才能够从缓存(sh)受益。另一斚wQ缓存的预热可能占用很长一D|_(d)其是,它其实是只用一个线E,和一?span lang="EN-US">cpu</span>在工作?span lang="EN-US">snapinstaller</span>太频J的话,<span lang="EN-US">solr<br />slave</span>会(x)处于一个不太理想的状态,可能它还在预热一个新的缓存,然而一个更新的<span lang="EN-US">searcher</span>?span lang="EN-US">opern</span>?jin)?/p> <p><span lang="EN-US">    </span>怎么解决q样的一个问题呢Q我们可能会(x)取消W一?span lang="EN-US">seacher</span>Q然后去处理一个更?span lang="EN-US">seacher</span>Q也x(chng)W二个。然而有可能W二?span lang="EN-US">seacher</span> q没有被使用上的时候,W三个又q来?jin)。看吧,一个恶性的循环Q不是。当然也有可能,我们刚刚预热好的时候就开始新一轮的~存预热Q其实,q样~存的作用压根就没有能体现出来。出现这U情늚时候,降低<span lang="EN-US">snapshot</span>的频率才是硬道理?/p> <h2><span lang="EN-US">Query Response Compression</span></h2> <p><span lang="EN-US">    </span>在有些情况下Q我们可以考虑?span lang="EN-US">solr xml response</span> <strong>压羃后才输出</strong>。如?span lang="EN-US">response</span>非常大,׃(x)触及(qing)<span lang="EN-US">NIc i/o</span>限制?/p> <p><span lang="EN-US">    </span>当然压羃q个操作会(x)增加<span lang="EN-US">cpu</span>的负担,其实Q?span lang="EN-US">solr</span>一个典型的依赖?span lang="EN-US">cpu</span>处理速度的服务,增加q个压羃的操作,无疑会(x)降低查询性能。但是,压羃后的数据会(x)是压~前的数据的<strong><span lang="EN-US">6</span>分之一的大?/strong>。然?span lang="EN-US">solr</span>的查询性能也会(x)?span lang="EN-US">15%</span>左右的消耗?/p> <p><span lang="EN-US">  </span>至于怎样配置q个功能Q要看你使用的什么服务器而定Q可以查阅相关的文?/p> <h2><span lang="EN-US">Embedded vs HTTP Post</span></h2> <p><span lang="EN-US"> </span>使用<span lang="EN-US">embeded</span> 来徏立烦(ch)引,会(x)比?span lang="EN-US">xml</span>格式来徏立烦(ch)引快<span lang="EN-US">50%</span>?/p> <h2><span lang="EN-US">RAM Usage Considerations</span>Q内存方面的考虑Q?/h2> <h3><span lang="EN-US"> OutOfMemoryErrors</span></h3> <p><span lang="EN-US">    </span>如果你的<span lang="EN-US">solr</span>实例没有被指定够多的内存的话,<span lang="EN-US">java virtual machine</span>也许?x)?span lang="EN-US">outof memoryError</span>Q这?strong>q不对烦(ch)引数据生媄(jing)?/strong>。但是这个时候,M?span lang="EN-US">adds/deletes/commits</span>操作都是不能够成功的?/p> <h3><span lang="EN-US"> Memory allocated to the Java VM</span></h3> <p><span lang="EN-US">    </span>最单的解决q个Ҏ(gu)是Q当然前提是<span lang="EN-US">java virtual machine</span>q没有用掉你全部的内存Q增加运?span lang="EN-US">solr</span>?span lang="EN-US">java</span>虚拟机的内存?/p> <h4><span lang="EN-US"> Factors affecting memory usage</span><span style="font-family: ?hu)?>Q媄(jing)响内存(sh)用量的因素)(j)</span></h4> <p style="margin: 0cm 0cm 0pt">我想Q你或许也会(x)考虑怎样d?span lang="EN-US">solr</span>的内存(sh)用量。其中的一个因素就?span lang="EN-US">input document</span>的大。当我们使用<span lang="EN-US">xml</span>执行<span lang="EN-US">add</span>操作的时候,׃(x)有两个限制?/p> <ul style="margin-top: 0cm" type="disc"><li><span lang="EN-US">document</span><span style="font-family: ?hu)?>中的</span><span lang="EN-US">field</span><span style="font-family: ?hu)?>都是?x)被存进内存的?/span><span lang="EN-US">field</span><span style="font-family: ?hu)?>有个属性叫</span><span lang="EN-US">maxFieldLength</span><span style="font-family: ?hu)?>Q它或许能帮上忙?/span></li><li><span style="font-family: ?hu)?>每增加一个域Q也是会(x)增加内存的用的?/span></li></ul> <p><span style="font-weight: bold">W二部分<SolrҎ(gu)调优></span></p> <p>1. 多core的时?/p> <p>多core 如果同一旉q行core 切换Q会(x)D内存、cpu压力q大Q可以扩展Solr代码Q限制最多同时core<br />切换的执行个数。保证不?x)出现高load或者高cpu 风险</p> <p>2Q应用较高安?/p> <p>最后不低于2个结点工作,q且最?个结Ҏ(gu)跨机器的?br />offline与online切换的时候,如果数据量不是很多,可以考虑index与search合一Q如果数据量较大Q超q?000w的时候,index<br />offline或者searchl点之外的其他结点上执行index</p> <p>3.cache参数配置</p> <p>如果更新很频J,Dcommit和reopen频繁Q如果可以的话,关闭cache.<br />如果讉K中依赖cache提示性能Q那么最好关闭cache warmQno facet 需?br />或者开开启cache warm  有facet需要,对fieldvalue cache很依赖的话?br />实时更新的话Q通常document cache命中率比较低Q完全可以不开启这个配|?/p> <p>4.reopen 和commit</p> <p>如果可以的话Q主盘索引Q不参入segment合ƈQ新的烦(ch)引段C同的目录。ƈ且reopen的时候,ȝ(ch)引的不变动?/p> <p>commit与reopen异步?/p> <p>5.有一部分数据如果不变动,可以考虑使用memory cache 或者locale cache q性能和空间开销Q同旉免FGC</p> <p>6.中间变量压羃、单例化</p> <p>所有查询或者徏索引q程中,量创建对象,而通过set改变对象|以及(qing)单例化,提升性能。一些较大中间变量,如果可以的话Q采取一些整数压~?/p> <p>7.对象表示重定?br />例如日期、地区、url、byte{一些对象,可以考虑差倹{区位码、可别部分、压~等l构Q得内存开销降低间接使得内存?sh)用率提高,获得更好性能?/p> <p>8.index与store 隔离<br />是index发挥它的查询性能Qstore发挥它的存储、响应性能?br />也就是不要将所有的内容都放在index中,量使得field的属性stored=false</p> <p>9. 使用solr、lucene最新版?/p> <p>10. ׃n分词实例<br />自定义的分词Q务必用单例。千万不要一个document创徏一个分词对?/p> <p><span style="font-weight: bold">W三部分 Solr查询</span></p> <p>1. Ҏ(gu)指定域排?br />展示的时候,对于数字的徏议,展示最q?或?个月数据。例如h(hun)|防止作弊<br />dump或者徏索引的时候,Ҏ(gu)字加以上下界(g),?qing)早发现数字本n正确Q而实际意义不合理的数?/p> <p>2. 排序可变?br />默认的排序务必有自己的相兛_敎ͼq且q各方面需求?br />排序要变Q但是不至于大的波动。排序的l节不公开Q但是排序的l果可以解释的清楚?/p> <p>3.U上U下<br />有些分值可以线下完成,有些分值线上完成。看需求?/p> <p>4.多域查询<br />如果默认查询多个域,不妨多个域合成一个域Q只差一个域</p> <p>5.高(sh)<br />高(sh)可以在solr里面或者外面执行的Q不一定在solr里面执行Q可以在solr之外执行<br />同理Q分词可以在U下执行好,dump只执行简单的I格分词卛_</p> <p>6.l计<br />facetl计可以先上与线下相l合Q不一定完全依赖线上即时计数?/p> <p>7.d搜烦(ch)<br />d搜烦(ch)查询串务必严格处理,既要L效查询串Q也要适当扩展查询丌Ӏ?br />明确查询路径和hit=0的对应处理?/p><br /></font></div><img src ="http://m.tkk7.com/conans/aggbug/379550.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://m.tkk7.com/conans/" target="_blank">CONAN</a> 2012-05-30 14:40 <a href="http://m.tkk7.com/conans/articles/379550.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>solr学习(fn)W记-linux下配|solr(?http://m.tkk7.com/conans/articles/379549.htmlCONANCONANWed, 30 May 2012 06:38:00 GMThttp://m.tkk7.com/conans/articles/379549.html本文地址Q?/span>

http://zhoujianghai.iteye.com/blog/1540176

 

首先介绍一下solrQ?/span>

Apache Solr (读音: SOLer) 是一个开源?/span>高性能、采用Java开发?/span>ZLucene的全文搜索服务器Q?/span>文通过Http利用XML加到一个搜索集合中Q查询该集合也是通过 http收到一个XML/JSON响应来实现?/span>Solr 中存储的资源是以 Document 为对象进行存储的。每个文由一pd?Field 构成Q每?Field 表示资源的一个属性。Solr 中的每个 Document 需要有能唯一标识其自w的属性,默认情况下这个属性的名字?idQ在 Schema 配置文gQschema.xmlQ中使用Q?code style="padding-bottom: 0px; margin: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px"><uniqueKey>id</uniqueKey>q行描述。solr有两个核?j)文Ӟsolrconfig.xml和schema.xml?/span>solrconfig.xml是solr的基文gQ里面配|了(jin)各种webh处理器、请求响应处理器、日志、缓存等;schema.xml配置映射?jin)各U数据类型的索引Ҏ(gu)Q分词器的配|、烦(ch)引文中包含的字D也在此配置?/span>

工作中主要用来分词和搜烦(ch)Q简单的工作原理是:(x)利用分词器对数据源进行分词处理,然后Ҏ(gu)分词l果建立索引?查询的时候,利用分词器对查询语句q行分词Q根据查询语句分词的l果在烦(ch)引库中进行匹配,最后返回结果?/span>


废话说Q下面开始solr之旅吧:(x)

一.安装JDK和Tomcat

Q?Q:(x)安装jdk  下蝲jdk安装包,解压到jdk-1.x目录

Q?Q:(x)安装tomcatQ下载tomcat安装包,解压到apache-tomcat目录?/span>

修改tomcat安装目录下的conf目录的server.xml

扑ֈ<Connector port="8080" .../>Q加?span style="line-height: 18px">URIEncoding="UTF-8"Qؓ(f)?/span>支持中文?/span>

讄Java和tomcat环境变量


上面两步比较单,q里只单描qC下,不明白的可以|上查资料?/span>


? 安装solr

下蝲solr包,http://labs.renren.com/apache-mirror/lucene/solr/3.5.0/apache-solr-3.5.0.zip


解压~到apache-solr目录Q把apache-solr/dist目录下的apache-solr-3.5.0.war 复制?TOMCAT_HOME/webapps目录下,重命名ؓ(f)solr.war

复制apache-solr/example/solr到tomcat根目录下Q如果你想配|多coreQ实例)(j)Q就复制apache-solr /example/multicore到tomcat根目录下Q不用复制solr?jin)?j)Q作为solr/homeQ以后也可以往该目录添?coreQ每个core下面都可以有自己的配|文件?/span>

在apache-tomcat/conf/Catalina/localhost/下创建solr.xmlQ跟webapps下的solr目同名Q,指定solr.war和solr/home的位|,让tomcat启动时就自动加蝲该应用?/span>

solr.xml内容如下Q?/span>

<?xml version="1.0" encoding="UTF-8"?>

<Context docBase="/home/zhoujh/java/apache-tomcat7/webapps/solr.war" debug="0" crossContext="true" >

   <Environment name="solr/home" type="java.lang.String" value="/home/zhoujh/java/apache-tomcat7/solr" override="true" />

</Context>

然后在tomcat的bin目录下执?/startup.shQ启动tomcat

在地址栏访问http://localhost:8080/solr/

会(x)出现solrƢ迎界面和admin入口

注:(x)如果出现org.apache.solr.common.SolrException: Error loading class 'solr.VelocityResponseWriter' 异常Q最单的解决Ҏ(gu)Q?/span>扑ֈ$TOMCAT_HOME/solr/conf/solrconfig.xmlQ把<queryResponseWriter name="velocity" class="solr.VelocityResponseWriter" enable="${solr.velocity.enabled:true}"/>注释掉或?/span>enable:false卛_?/span>如果一切顺利的话,现在可以看到solr的web理界面?jin)。不q要惛_现分词的功能Q得安装一个中文分词器Q这里推?span>IKAnalyzer?/span>mmseg4j?/span>

IKAnalyzer是一个开源的Q基于java语言开发的轻量U的中文分词工具包,采用?jin)特有?#8220;正向q代最l粒度切分算?#8220;Q具?0万字/U的高速处理能力,采用?jin)多子处理器分析模式Q支持:(x)英文字母QIP地址、Email、URLQ、数字(日期Q常用中文数量词Q罗马数字,U学计数法)(j)Q中文词汇(姓名、地名处理)(j){分词处理?/span>优化的词典存储,更小的内存占用。支持用戯典扩展定?/span>

mmseg4j ?Chih-Hao Tsai ?MMSeg 法(http://technology.chtsai.org/mmseg/ )实现的中文分词器Qƈ实现 lucene ?analyzer ?solr 的TokenizerFactory 以方便在Lucene和Solr中用?/span>MMSeg 法有两U分词方法:(x)Simple和ComplexQ都是基于正向最大匹配。Complex 加了(jin)四个规则q虑。官方说Q词语的正确识别率达C(jin) 98.41%。mmseg4j 已经实现?jin)这两种分词法?/span>

? 配置中文分词?/span>

下面分别安装q两个中文分词器Q当焉择安装其中一个也是可以的?/span>

Q?Q?span style="color: #ff0000">安装IKAnalyzer

下蝲地址Q?span style="font-size: 13px"> http://code.google.com/p/ik-analyzer/downloads/list

在当前目录下新徏IKAnalyzer目录Q解压到该目录下Qunzip IKAnalyzer2012_u5.zip -d ./IKAnalyzer

把IKAnalyzer目录下的IKAnalyzer2012.jar文g拯?$TOMCAT_HOME/webapps/solr/WEB-INF/lib/?/span>

配置schema.xmlQ编?TOMCAT_HOME/solr/conf/schema.xmlQ在文g中添加下面这个fieldtype

注:(x)下面的代码中多了(jin)很多“<span style="font-size: x-small;">”标签Q这个是讄字体时iteye~辑器自q成的?/p>

<span style="font-size: x-small;"><span style="font-size: x-small;"><span style="font-size: small;"><fieldType name="text" class="solr.TextField" positionIncrementGap="100">
        
<analyzer type="index">
            
<tokenizer class = "org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false" /> 
            
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />  
            
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />                
            
<filter class="solr.LowerCaseFilterFactory" />  
            
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />      
            
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />  
        
</analyzer>
        
<analyzer type="query">
            
<tokenizer class = "org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true" />  
            
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />  
            
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />  
            
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />  
            
<filter class="solr.LowerCaseFilterFactory" />  
            
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />  
            
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />  
        
</analyzer>
    
</fieldType></span></span></span>

d一个烦(ch)引字DfieldQƈ应用上面配置的fieldtype

<field name="game_name" type="text" indexed="true" stored="true" required="true" />

 

然后扑ֈq一句:(x)<defaultSearchField>text</defaultSearchField>把它Ҏ(gu)<defaultSearchField>game_name</defaultSearchField>

在浏览器打开http://localhost:8080/solr/admin/analysis.jspQ就可以q行分词处理?jin)?/span>

IKAnalyzerd自定义分词词典:(x)词典文g格式为无BOM的UTF-8~码的文本文?文g扩展名不限,一ơ可以添加多个词库,每个词库?;"分开。把IKAnalyzer 目录下的IKAnalyzer.cfg.xml和stopword.dic拯?TOMCAT_HOME/webapps/solr/WEB_INF /classes目录下,可以自己新徏一个mydic.dic文gQ然后在IKAnalyzer.cfg.xml里进行配|?br />

Q?Q?span style="color: #ff0000">安装mmseg4j

 下蝲地址Q?/span>http://code.google.com/p/mmseg4j/downloads/list

在当前目录下新徏mmseg4j目录Q解压到该目录下Qunzip mmseg4j-1.8.5.zip -d ./mmseg4j

把mmseg4j目录下的mmseg4j-all-1.8.5.jar文g拯?$TOMCAT_HOME/webapps/solr/WEB-INF/lib/?/span>


配置schema.xmlQ编?TOMCAT_HOME/solr/conf/schema.xmlQ在文g中添加下面这个fieldtype


<fieldtype name="textComplex" class="solr.TextField" positionIncrementGap="100">
        
<analyzer>
            
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/zhoujh/java/apache-tomcat7/solr/dict">
            
</tokenizer>
        
</analyzer>
    
</fieldtype>
    
<fieldtype name="textMaxWord" class="solr.TextField" positionIncrementGap="100">
        
<analyzer>
            
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="/home/zhoujh/java/apache-tomcat7/solr/dict">
            
</tokenizer>
        
</analyzer>
    
</fieldtype>
    
<fieldtype name="textSimple" class="solr.TextField" positionIncrementGap="100">
        
<analyzer>
            
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="/home/zhoujh/java/apache-tomcat7/solr/dict">
            
</tokenizer>
        
</analyzer>
    
</fieldtype>

注意Q?span style="white-space: pre">dicPath的值改成你自己机器上相应的目录?/span>

然后修改之前d的filedQ让其用mmseg4j分词?/span>

<field name="game_name" type="textComplex" indexed="true" stored="true" required="true" />


配置mmseg4j分词词典Q?span style="font-size: 13px">MMSEG4J的词库是可以动态加载的Q?/span>词库的编码必LUTF-8Q?/span>mmseg4j 默认从当前目录下?data 目录d上面的文Ӟ当然也可以指定别的目录,比如我就攑֜自定义的dict目录?/span>?/span>自定义词库文件名必需?"words" 为前~?".dic" 为后~?/span>如:(x)/data/words-my.dic?/span>

q里直接把mmseg4j/data目录下的所?dic文g拯?TOMCAT_HOME/solr/dict目录下。共有:(x)4个dic文gQchars.dic、units.dic?words.dic?words-my.dic。下面简单解释一下这几个文g的作用?/span>

1、chars.dicQ是单个字,和对应的频率Q一行一对,字在全面Q频率在后面Q中间用I格分开。这个文件的信息?complex 模式要用到的。在最后一条过虑规则中使用?jin)频率信息?/span>

2、units.dicQ是单位的字Q如Q分、秒、年?/span>

3、words.dicQ是核心(j)的词库文Ӟ一行一条,不需要其它Q何数据(如词长)(j)?/span>

4、words-my.dicQ是自定义词库文?/span>

在浏览器打开http://localhost:8080/solr/admin/analysis.jspQ就可以看到分词效果?jin)?/span>

现在Q这两种分词Ҏ(gu)都已配置好了(jin)Q想用哪U就把查询的filed的type讄成哪U?/span>




CONAN 2012-05-30 14:38 发表评论
]]>
Solr 创徏索引 From DataBasehttp://m.tkk7.com/conans/articles/379547.htmlCONANCONANWed, 30 May 2012 06:33:00 GMThttp://m.tkk7.com/conans/articles/379547.html阅读全文

CONAN 2012-05-30 14:33 发表评论
]]>
使用Apache SolrҎ(gu)据库建立索引Q包括处理CLOB、CLOBQ?/title><link>http://m.tkk7.com/conans/articles/379546.html</link><dc:creator>CONAN</dc:creator><author>CONAN</author><pubDate>Wed, 30 May 2012 06:23:00 GMT</pubDate><guid>http://m.tkk7.com/conans/articles/379546.html</guid><description><![CDATA[     摘要: 以下资料整理自网l,觉的有必要合q在一Pq样方便查看。主要分Z部分Q第一部分是对《db-data-config.xml》的配置内容的讲解(属于高内容Q,W二部分是DataImportHandlerQ属于基Q?W三部分是对db-data-config.xml的进Ӟq个国内可能q没有h写过啊,我在google、baidu上都没有搜烦(ch)刎ͼ最后可是拔代码Q看solr的英文文找的)(j) W一部分?..  <a href='http://m.tkk7.com/conans/articles/379546.html'>阅读全文</a><img src ="http://m.tkk7.com/conans/aggbug/379546.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://m.tkk7.com/conans/" target="_blank">CONAN</a> 2012-05-30 14:23 <a href="http://m.tkk7.com/conans/articles/379546.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>关于solr schema.xml 和solrconfig.xml的解?/title><link>http://m.tkk7.com/conans/articles/379545.html</link><dc:creator>CONAN</dc:creator><author>CONAN</author><pubDate>Wed, 30 May 2012 06:18:00 GMT</pubDate><guid>http://m.tkk7.com/conans/articles/379545.html</guid><description><![CDATA[<div class="3dn5flt" id="blog_content" class="blog_content"> <p><strong><span style="font-size: medium">一、字D配|(schemaQ?/span> </strong></p> <p> </p> <p>schema.xml位于solr/conf/目录下,cM于数据表配置文gQ?/p> <p>定义?jin)加入?ch)引的数据的数据类型,主要包括type、fields和其他的一些缺省设|?/p> <p> </p> <p>1、先来看下type节点Q这里面定义FieldType子节点,包括name,class,positionIncrementGap{一些参数?/p> <ul><li>nameQ就是这个FieldType的名U?/li><li>classQ指向org.apache.solr.analysis包里面对应的class名称Q用来定义这个类型的行ؓ(f)?/li></ul> <div> <div> <div><a title="view plain" ></a></div></div> <ol><li><span><</span> <span>schema</span> <span> </span> <span>name</span> <span>=</span> <span>"example"</span> <span> </span> <span>version</span> <span>=</span> <span>"1.2"</span> <span>></span> <span>  </span></li><li><span>  <span><</span> <span>types</span> <span>></span> <span>  </span> </span></li><li><span>    <span><</span> <span>fieldType</span> <span> </span> <span>name</span> <span>=</span> <span>"string"</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.StrField"</span> <span> </span> <span>sortMissingLast</span> <span>=</span> <span>"true"</span> <span> </span> <span>omitNorms</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>fieldType</span> <span> </span> <span>name</span> <span>=</span> <span>"boolean"</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.BoolField"</span> <span> </span> <span>sortMissingLast</span> <span>=</span> <span>"true"</span> <span> </span> <span>omitNorms</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>fieldtype</span> <span> </span> <span>name</span> <span>=</span> <span>"binary"</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.BinaryField"</span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>fieldType</span> <span> </span> <span>name</span> <span>=</span> <span>"int"</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.TrieIntField"</span> <span> </span> <span>precisionStep</span> <span>=</span> <span>"0"</span> <span> </span> <span>omitNorms</span> <span>=</span> <span>"true"</span> <span>   </span> </span></li><li><span>                                                                <span>positionIncrementGap</span> <span>=</span> <span>"0"</span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>fieldType</span> <span> </span> <span>name</span> <span>=</span> <span>"float"</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.TrieFloatField"</span> <span> </span> <span>precisionStep</span> <span>=</span> <span>"0"</span> <span> </span> <span>omitNorms</span> <span>=</span> <span>"true"</span> <span>   </span> </span></li><li><span>                                                                <span>positionIncrementGap</span> <span>=</span> <span>"0"</span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>fieldType</span> <span> </span> <span>name</span> <span>=</span> <span>"long"</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.TrieLongField"</span> <span> </span> <span>precisionStep</span> <span>=</span> <span>"0"</span> <span> </span> <span>omitNorms</span> <span>=</span> <span>"true"</span> <span>   </span> </span></li><li><span>                                                                <span>positionIncrementGap</span> <span>=</span> <span>"0"</span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>fieldType</span> <span> </span> <span>name</span> <span>=</span> <span>"double"</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.TrieDoubleField"</span> <span> </span> <span>precisionStep</span> <span>=</span> <span>"0"</span> <span> </span> <span>omitNorms</span> <span>=</span> <span>"true"</span> <span>   </span> </span></li><li><span>                                                                <span>positionIncrementGap</span> <span>=</span> <span>"0"</span> <span>/></span> <span>  </span> </span></li><li><span>  ...  </span></li><li><span>  <span></</span> <span>types</span> <span>></span> <span>  </span> </span></li><li><span>  ...  </span></li><li><span></</span> <span>schema</span> <span>></span> <span>  </span> </li></ol></div> <p> </p> <p>必要的时候fieldTypeq需要自己定义这个类型的数据在徏立烦(ch)引和q行查询的时候要使用的分析器analyzerQ包括分词和qo(h)Q如下:(x)</p> <div> <div> <div><a title="view plain" >view plain</a> <a title="print" >print</a> <a title="?" >?</a> </div></div> <ol><li><span><</span> <span>fieldType</span> <span> </span> <span>name</span> <span>=</span> <span>"text_ws"</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.TextField"</span> <span> </span> <span>positionIncrementGap</span> <span>=</span> <span>"100"</span> <span>></span> <span>  </span></li><li><span>  <span><</span> <span>analyzer</span> <span>></span> <span>  </span> </span></li><li><span>    <span><</span> <span>tokenizer</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.WhitespaceTokenizerFactory"</span> <span>/></span> <span>  </span> </span></li><li><span>  <span></</span> <span>analyzer</span> <span>></span> <span>  </span> </span></li><li><span></</span> <span>fieldType</span> <span>></span> <span>  </span></li><li><span><</span> <span>fieldType</span> <span> </span> <span>name</span> <span>=</span> <span>"text"</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.TextField"</span> <span> </span> <span>positionIncrementGap</span> <span>=</span> <span>"100"</span> <span>></span> <span>  </span></li><li><span>  <span><</span> <span>analyzer</span> <span> </span> <span>type</span> <span>=</span> <span>"index"</span> <span>></span> <span>  </span> </span></li><li><span>    <!--q个分词包是I格分词Q在向烦(ch)引库dtextcd的烦(ch)引时QSolr?x)首先用I格q行分词  </span></li><li><span>         然后把分词结果依ơ用指定的qo(h)器进行过滤,最后剩下的l果Q才?x)加入到索引库中以备查询?nbsp; </span></li><li><span>      注意:Solr的analysis包ƈ没有带支持中文的包,需要自己添加中文分词器Qgoogle下?nbsp;   </span></li><li><span>     --<span>></span> <span>  </span> </span></li><li><span>    <span><</span> <span>tokenizer</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.WhitespaceTokenizerFactory"</span> <span>/></span> <span>  </span> </span></li><li><span>        <!-- in this example, we will only use synonyms at query time  </span></li><li><span>        <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.SynonymFilterFactory"</span> <span> </span> <span>synonyms</span> <span>=</span> <span>"index_synonyms.txt"</span> <span>   </span> </span></li><li><span>                                                  <span>ignoreCase</span> <span>=</span> <span>"true"</span> <span> </span> <span>expand</span> <span>=</span> <span>"false"</span> <span>/></span> <span>  </span> </span></li><li><span>        --<span>></span> <span>  </span> </span></li><li><span>        <!-- Case insensitive stop word removal.  </span></li><li><span>          add <span>enablePositionIncrements</span> <span>=</span> <span>true</span> <span> in both the index and query  </span> </span></li><li><span>          analyzers to leave a 'gap' for more accurate phrase queries.  </span></li><li><span>        --<span>></span> <span>  </span> </span></li><li><span>      <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.StopFilterFactory"</span> <span>  </span> </span></li><li><span>                <span>ignoreCase</span> <span>=</span> <span>"true"</span> <span>  </span> </span></li><li><span>                <span>words</span> <span>=</span> <span>"stopwords.txt"</span> <span>  </span> </span></li><li><span>                <span>enablePositionIncrements</span> <span>=</span> <span>"true"</span> <span>  </span> </span></li><li><span>                <span>/></span> <span>  </span> </span></li><li><span>      <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.WordDelimiterFilterFactory"</span> <span> </span> <span>generateWordParts</span> <span>=</span> <span>"1"</span> <span>   </span> </span></li><li><span>              <span>generateNumberParts</span> <span>=</span> <span>"1"</span> <span> </span> <span>catenateWords</span> <span>=</span> <span>"1"</span> <span> </span> <span>catenateNumbers</span> <span>=</span> <span>"1"</span> <span>   </span> </span></li><li><span>              <span>catenateAll</span> <span>=</span> <span>"0"</span> <span> </span> <span>splitOnCaseChange</span> <span>=</span> <span>"1"</span> <span>/></span> <span>  </span> </span></li><li><span>      <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.LowerCaseFilterFactory"</span> <span>/></span> <span>  </span> </span></li><li><span>      <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.SnowballPorterFilterFactory"</span> <span> </span> <span>language</span> <span>=</span> <span>"English"</span> <span>   </span> </span></li><li><span>                                                       <span>protected</span> <span>=</span> <span>"protwords.txt"</span> <span>/></span> <span>  </span> </span></li><li><span>    <span></</span> <span>analyzer</span> <span>></span> <span>  </span> </span></li><li><span>    <span><</span> <span>analyzer</span> <span> </span> <span>type</span> <span>=</span> <span>"query"</span> <span>></span> <span>  </span> </span></li><li><span>      <span><</span> <span>tokenizer</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.WhitespaceTokenizerFactory"</span> <span>/></span> <span>  </span> </span></li><li><span>        <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.SynonymFilterFactory"</span> <span> </span> <span>synonyms</span> <span>=</span> <span>"synonyms.txt"</span> <span> </span> <span>ignoreCase</span> <span>=</span> <span>"true"</span> <span>   </span> </span></li><li><span>                                                                          <span>expand</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span> </span></li><li><span>        <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.StopFilterFactory"</span> <span>  </span> </span></li><li><span>                <span>ignoreCase</span> <span>=</span> <span>"true"</span> <span>  </span> </span></li><li><span>                <span>words</span> <span>=</span> <span>"stopwords.txt"</span> <span>  </span> </span></li><li><span>                <span>enablePositionIncrements</span> <span>=</span> <span>"true"</span> <span>  </span> </span></li><li><span>                <span>/></span> <span>  </span> </span></li><li><span>        <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.WordDelimiterFilterFactory"</span> <span> </span> <span>generateWordParts</span> <span>=</span> <span>"1"</span> <span>   </span> </span></li><li><span>                <span>generateNumberParts</span> <span>=</span> <span>"1"</span> <span> </span> <span>catenateWords</span> <span>=</span> <span>"0"</span> <span> </span> <span>catenateNumbers</span> <span>=</span> <span>"0"</span> <span>   </span> </span></li><li><span>                                        <span>catenateAll</span> <span>=</span> <span>"0"</span> <span> </span> <span>splitOnCaseChange</span> <span>=</span> <span>"1"</span> <span>/></span> <span>  </span> </span></li><li><span>        <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.LowerCaseFilterFactory"</span> <span>/></span> <span>  </span> </span></li><li><span>        <span><</span> <span>filter</span> <span> </span> <span>class</span> <span>=</span> <span>"solr.SnowballPorterFilterFactory"</span> <span> </span> <span>language</span> <span>=</span> <span>"English"</span> <span>   </span> </span></li><li><span>                                                         <span>protected</span> <span>=</span> <span>"protwords.txt"</span> <span>/></span> <span>  </span> </span></li><li><span>      <span></</span> <span>analyzer</span> <span>></span> <span>  </span> </span></li><li><span></</span> <span>fieldType</span> <span>></span> <span>  </span> </li></ol></div> <p> </p> <p>2、再来看下fields节点内定义具体的字段Q类似数据库的字D)(j)Q含有以下属性:(x)</p> <ul><li>nameQ字D名</li><li>typeQ之前定义过的各UFieldType</li><li>indexedQ是否被索引</li><li>storedQ是否被存储Q如果不需要存储相应字D|量设ؓ(f)falseQ?/li><li>multiValuedQ是否有多个|对可能存在多值的字段量讄为trueQ避免徏索引时抛出错误)(j)</li></ul> <div> <div> <div><a title="view plain" >view plain</a> <a title="print" >print</a> <a title="?" >?</a> </div></div> <ol><li><span><</span> <span>fields</span> <span>></span> <span>  </span></li><li><span>    <span><</span> <span>field</span> <span> </span> <span>name</span> <span>=</span> <span>"id"</span> <span> </span> <span>type</span> <span>=</span> <span>"integer"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span> </span> <span>stored</span> <span>=</span> <span>"true"</span> <span> </span> <span>required</span> <span>=</span> <span>"true"</span> <span> </span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>field</span> <span> </span> <span>name</span> <span>=</span> <span>"name"</span> <span> </span> <span>type</span> <span>=</span> <span>"text"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span> </span> <span>stored</span> <span>=</span> <span>"true"</span> <span> </span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>field</span> <span> </span> <span>name</span> <span>=</span> <span>"summary"</span> <span> </span> <span>type</span> <span>=</span> <span>"text"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span> </span> <span>stored</span> <span>=</span> <span>"true"</span> <span> </span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>field</span> <span> </span> <span>name</span> <span>=</span> <span>"author"</span> <span> </span> <span>type</span> <span>=</span> <span>"string"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span> </span> <span>stored</span> <span>=</span> <span>"true"</span> <span> </span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>field</span> <span> </span> <span>name</span> <span>=</span> <span>"date"</span> <span> </span> <span>type</span> <span>=</span> <span>"date"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"false"</span> <span> </span> <span>stored</span> <span>=</span> <span>"true"</span> <span> </span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>field</span> <span> </span> <span>name</span> <span>=</span> <span>"content"</span> <span> </span> <span>type</span> <span>=</span> <span>"text"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span> </span> <span>stored</span> <span>=</span> <span>"false"</span> <span> </span> <span>/></span> <span>  </span> </span></li><li><span>    <span><</span> <span>field</span> <span> </span> <span>name</span> <span>=</span> <span>"keywords"</span> <span> </span> <span>type</span> <span>=</span> <span>"keyword_text"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span> </span> <span>stored</span> <span>=</span> <span>"false"</span> <span> </span> <span>multiValued</span> <span>=</span> <span>"true"</span> <span> </span> <span>/></span> <span>  </span> </span></li><li><span>    <span><!--拯字段--></span> <span>  </span> </span></li><li><span>    <span><</span> <span>field</span> <span> </span> <span>name</span> <span>=</span> <span>"all"</span> <span> </span> <span>type</span> <span>=</span> <span>"text"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span> </span> <span>stored</span> <span>=</span> <span>"false"</span> <span> </span> <span>multiValued</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span> </span></li><li><span></</span> <span>fields</span> <span>></span> <span>  </span> </li></ol></div> <p> </p> <p>3、徏议徏立一个拷贝字D,所有的 全文?字段复制C个字D中Q以便进行统一的检索:(x)</p> <p>     以下是拷贝设|:(x)</p> <div> <div> <div><a title="view plain" >view plain</a> <a title="print" >print</a> <a title="?" >?</a> </div></div> <ol><li><span><</span> <span>copyField</span> <span> </span> <span>source</span> <span>=</span> <span>"name"</span> <span> </span> <span>dest</span> <span>=</span> <span>"all"</span> <span>/></span> <span>  </span></li><li><span><</span> <span>copyField</span> <span> </span> <span>source</span> <span>=</span> <span>"summary"</span> <span> </span> <span>dest</span> <span>=</span> <span>"all"</span> <span>/></span> <span>  </span> </li></ol></div> <p> </p> <p>4、动态字D,没有具体名称的字D,用dynamicField字段</p> <p>如:(x)name?_iQ定义它的type为intQ那么在使用q个字段的时候,d以_il果的字D都被认为符合这个定义。如name_i, school_i</p> <div> <div> <div><a title="view plain" >view plain</a> <a title="print" >print</a> <a title="?" >?</a> </div></div> <ol><li><span><</span> <span>dynamicField</span> <span> </span> <span>name</span> <span>=</span> <span>"*_i"</span> <span>  </span> <span>type</span> <span>=</span> <span>"int"</span> <span>    </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span>  </span> <span>stored</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span></li><li><span><</span> <span>dynamicField</span> <span> </span> <span>name</span> <span>=</span> <span>"*_s"</span> <span>  </span> <span>type</span> <span>=</span> <span>"string"</span> <span>  </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span>  </span> <span>stored</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span></li><li><span><</span> <span>dynamicField</span> <span> </span> <span>name</span> <span>=</span> <span>"*_l"</span> <span>  </span> <span>type</span> <span>=</span> <span>"long"</span> <span>   </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span>  </span> <span>stored</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span></li><li><span><</span> <span>dynamicField</span> <span> </span> <span>name</span> <span>=</span> <span>"*_t"</span> <span>  </span> <span>type</span> <span>=</span> <span>"text"</span> <span>    </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span>  </span> <span>stored</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span></li><li><span><</span> <span>dynamicField</span> <span> </span> <span>name</span> <span>=</span> <span>"*_b"</span> <span>  </span> <span>type</span> <span>=</span> <span>"boolean"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span>  </span> <span>stored</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span></li><li><span><</span> <span>dynamicField</span> <span> </span> <span>name</span> <span>=</span> <span>"*_f"</span> <span>  </span> <span>type</span> <span>=</span> <span>"float"</span> <span>  </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span>  </span> <span>stored</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span></li><li><span><</span> <span>dynamicField</span> <span> </span> <span>name</span> <span>=</span> <span>"*_d"</span> <span>  </span> <span>type</span> <span>=</span> <span>"double"</span> <span> </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span>  </span> <span>stored</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span></li><li><span><</span> <span>dynamicField</span> <span> </span> <span>name</span> <span>=</span> <span>"*_dt"</span> <span> </span> <span>type</span> <span>=</span> <span>"date"</span> <span>    </span> <span>indexed</span> <span>=</span> <span>"true"</span> <span>  </span> <span>stored</span> <span>=</span> <span>"true"</span> <span>/></span> <span>  </span> </li></ol></div> <p> </p> <p> </p> <p> </p> <p> </p> <p><strong><span style="font-size: medium">schema.xml文档注释中的信息Q?/span> </strong></p> <p> </p> <p> </p> <p> </p> <p>1、ؓ(f)?jin)改q性能Q可以采取以下几U措施:(x)</p> <ul><li>所有只用于搜烦(ch)的,而不需要作为结果的fieldQ特别是一些比较大的fieldQ的stored讄为false</li><li>不需要被用于搜烦(ch)的,而只是作为结果返回的field的indexed讄为false</li><li>删除所有不必要的copyField声明</li><li>Z(jin)索引字段的最化和搜索的效率Q将所有的 text fields的index都设|成fieldQ然后用copyField他们都复制C个ȝ text field上,然后对他q行搜烦(ch)?/li><li>Z(jin)最大化搜烦(ch)效率Q用java~写的客L(fng)与solr交互Q用流通信Q?/li><li>在服务器端运行JVMQ省ȝl通信Q,使用可能高的Log输出{Q减日志量?/li></ul> <p>2?span style="color: #0000ff"><</span> <span style="color: #990000"><span>schema</span> <span>name</span> </span><span style="color: #0000ff">="</span> <strong>example</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">version</span> <span style="color: #0000ff">="</span> <strong>1.2</strong> <span style="color: #0000ff"><span>"</span> <span>></span> </span></p> <ul><li>nameQ标识这个schema的名?/li><li>versionQ现在版本是1.2</li></ul> <p>3、filedType</p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">fieldType</span> <span style="color: #990000">name</span> <span style="color: #0000ff">="</span> <strong>string</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">class</span> <span style="color: #0000ff">="</span> <strong>solr.StrField</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">sortMissingLast</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">omitNorms</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <ul><li>nameQ标识而已?/li><li>class和其他属性决定了(jin)q个fieldType的实际行为。(class以solr开始的Q都是在org.appache.solr.analysis包下Q?/li></ul> <p>可选的属性:(x)</p> <ul><li>sortMissingLast和sortMissingFirst两个属性是用在可以内在使用String排序的类型上Q包括:(x)string,boolean,sint,slong,sfloat,sdouble,pdateQ?/li><li>sortMissingLast="true"Q没有该field的数据排在有该field的数据之后,而不请求时的排序规则?/li><li>sortMissingFirst="true"Q跟上面倒过来呗?/li><li>2个值默认是讄成false</li></ul> <p> </p> <p>StrFieldcd不被分析Q而是被逐字地烦(ch)?存储?/p> <p>StrField和TextField都有一个可选的属?#8220;compressThreshold”Q保证压~到不小于一个大(单位QcharQ?/p> <p> </p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000"><span>fieldType</span> <span>name</span> </span><span style="color: #0000ff">="</span> <strong>text</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">class</span> <span style="color: #0000ff">="</span> <strong>solr.TextField</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">positionIncrementGap</span> <span style="color: #0000ff">="</span> <strong>100</strong> <span style="color: #0000ff"><span>"</span> <span>></span> </span></p> <p> </p> <p>solr.TextField 允许用户通过分析器来定制索引和查询,分析器包括一个分词器QtokenizerQ和多个qo(h)器(filterQ?/p> <p> </p> <ul><li>positionIncrementGapQ可选属性,定义在同一个文中此类型数据的I白间隔Q避免短语匹配错误?/li></ul> <p>name:    字段cd?nbsp; <br />class:    javacd  <br />indexed:    ~省true?说明q个数据应被搜烦(ch)和排序,如果数据没有indexedQ则stored应是true?nbsp; <br />stored:    ~省true。说明这个字D被包含在搜索结果中是合适的。如果数据没有stored,则indexed应是true?nbsp; <br />sortMissingLast:    指没有该指定字段数据的document排在有该指定字段数据的document的后?nbsp; <br />sortMissingFirst:    指没有该指定字段数据的document排在有该指定字段数据的document的前?nbsp; <br />omitNorms:    字段的长度不影响得分和在索引时不做boostӞ讄它ؓ(f)true。一般文本字D不讄为true?nbsp; <br />termVectors:    如果字段被用来做more like this 和highlight的特性时应设|ؓ(f)true?nbsp; <br />compressed:    字段是压~的。这可能D索引和搜索变慢,但会(x)减少存储I间Q只有StrField和TextField是可以压~,q通常适合字段的长度超q?00个字W?nbsp; <br />multiValued:    字段多于一个值的时候,可设|ؓ(f)true?nbsp; <br />positionIncrementGap:    和multiValued<br />一起用,讄多个g间的虚拟I白的数?<br /></p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">tokenizer</span> <span style="color: #990000">class</span> <span style="color: #0000ff">="</span> <strong>solr.WhitespaceTokenizerFactory</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <p>I格分词Q精匹配?/p> <p> </p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">filter</span> <span style="color: #990000">class</span> <span style="color: #0000ff">="</span> <strong>solr.WordDelimiterFilterFactory</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">generateWordParts</span> <span style="color: #0000ff">="</span> <strong>1</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">generateNumberParts</span> <span style="color: #0000ff">="</span> <strong>1</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">catenateWords</span> <span style="color: #0000ff">="</span> <strong>1</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">catenateNumbers</span> <span style="color: #0000ff">="</span> <strong>1</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">catenateAll</span> <span style="color: #0000ff">="</span> <strong>0</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">splitOnCaseChange</span> <span style="color: #0000ff">="</span> <strong>1</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <p>在分词和匚wӞ考虑 "-"q字W,字母数字的界限,非字母数字字W,q样 "wifi"?wi fi"都能匚w"Wi-Fi"?/p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">filter</span> <span style="color: #990000">class</span> <span style="color: #0000ff">="</span> <strong>solr.SynonymFilterFactory</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">synonyms</span> <span style="color: #0000ff">="</span> <strong>synonyms.txt</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">ignoreCase</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">expand</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <p><span style="color: #000000">同义?nbsp;</span> </p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">filter</span> <span style="color: #990000">class</span> <span style="color: #0000ff">="</span> <strong>solr.StopFilterFactory</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">ignoreCase</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">words</span> <span style="color: #0000ff">="</span> <strong>stopwords.txt</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">enablePositionIncrements</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <p>在禁用字QstopwordQ删除后Q在短语间增加间?/p> <p>stopwordQ即在徏立烦(ch)引过E中Q徏立烦(ch)引和搜烦(ch)Q被忽略的词Q比如is this{常用词。在conf/stopwords.txtl护?/p> <p> </p> <p> </p> <p> </p> <p>4、fields</p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">field</span> <span style="color: #990000">name</span> <span style="color: #0000ff">="</span> <strong>id</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">type</span> <span style="color: #0000ff">="</span> <strong>string</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">indexed</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">stored</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">required</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <ul><li>nameQ标识而已?/li><li>typeQ先前定义的cd?/li><li>indexedQ是否被用来建立索引Q关pd搜烦(ch)和排序)(j)</li><li>storedQ是否储?/li><li>compressedQ[false]Q是否用gzip压羃Q只有TextField和StrField可以压羃Q?/li><li>mutiValuedQ是否包含多个?/li><li>omitNormsQ是否忽略掉NormQ可以节省内存空_(d)只有全文本field和need an index-time boost的field需要norm。(具体没看懂,注释里有矛盾Q?/li><li>termVectorsQ[false]Q当讄trueQ会(x)存储 term vector。当使用MoreLikeThisQ用来作为相D的field应该存储h?/li><li>termPositionsQ存?term vector中的地址信息Q会(x)消耗存储开销?/li><li>termOffsetsQ存?term vector 的偏U量Q会(x)消耗存储开销?/li><li>defaultQ如果没有属性需要修改,可以用q个标识下?/li></ul> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">field</span> <span style="color: #990000">name</span> <span style="color: #0000ff">="</span> <strong>text</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">type</span> <span style="color: #0000ff">="</span> <strong>text</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">indexed</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">stored</span> <span style="color: #0000ff">="</span> <strong>false</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">multiValued</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <p>包罗万象Q有点夸张)(j)的fieldQ包含所有可搜烦(ch)的text fieldsQ通过copyField实现?/p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">copyField</span> <span style="color: #990000">source</span> <span style="color: #0000ff">="</span> <strong>cat</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">dest</span> <span style="color: #0000ff">="</span> <strong>text</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <div> <div><span><strong><span style="color: #ff0000"> </span> </strong></span><span style="color: #0000ff"><</span> <span style="color: #990000">copyField</span> <span style="color: #990000">source</span> <span style="color: #0000ff">="</span> <strong>name</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">dest</span> <span style="color: #0000ff">="</span> <strong>text</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></div></div> <div> <div><span><strong><span style="color: #ff0000"> </span> </strong></span><span style="color: #0000ff"><</span> <span style="color: #990000">copyField</span> <span style="color: #990000">source</span> <span style="color: #0000ff">="</span> <strong>manu</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">dest</span> <span style="color: #0000ff">="</span> <strong>text</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></div></div> <div> <div><span><strong><span style="color: #ff0000"> </span> </strong></span><span style="color: #0000ff"><</span> <span style="color: #990000">copyField</span> <span style="color: #990000">source</span> <span style="color: #0000ff">="</span> <strong>features</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">dest</span> <span style="color: #0000ff">="</span> <strong>text</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></div></div> <div> <div><span><strong><span style="color: #ff0000"> </span> </strong></span><span style="color: #0000ff"><</span> <span style="color: #990000">copyField</span> <span style="color: #990000">source</span> <span style="color: #0000ff">="</span> <strong>includes</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">dest</span> <span style="color: #0000ff">="</span> <strong>text</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></div></div> <p>在添加烦(ch)引时Q将所有被拯fieldQ如catQ中的数据拷贝到text field?/p> <p>作用Q?/p> <ul><li>多个field的数据放在一起同时搜索,提供速度</li><li>一个field的数据拷贝到另一个,可以?U不同的方式来徏立烦(ch)引?/li></ul> <p> </p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">dynamicField</span> <span style="color: #990000">name</span> <span style="color: #0000ff">="</span> <strong>*_i</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">type</span> <span style="color: #0000ff">="</span> <strong>int</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">indexed</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">stored</span> <span style="color: #0000ff">="</span> <strong>true</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <p> </p> <p>如果一个field的名字没有匹配到Q那么就?x)用动态field试图匚w定义的各U模式?/p> <ul><li>"*"只能出现在模式的最前和最?/li><li>较长的模式会(x)被先d匚w</li><li>如果2个模式同时匹配上Q最先定义的优先</li></ul> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">dynamicField</span> <span style="color: #990000">name</span> <span style="color: #0000ff">="</span> <strong>*</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">type</span> <span style="color: #0000ff">="</span> <strong>ignored</strong> <span style="color: #0000ff">"</span> <span style="color: #990000">multiValued<span style="color: #0000ff">="</span> <span style="color: #000000"><strong>true</strong> </span><span style="color: #0000ff">"</span> </span><span style="color: #0000ff"><span>/></span> </span></p> <p><span style="color: #0000ff"><span>如果通过上面的匹配都没找刎ͼ可以定义q个Q然后定义个typeQ当String处理。(一般不?x)发生?j)</span> </span></p> <p><span style="color: #0000ff"><span>但若不定义,找不到匹配会(x)报错?/span> </span></p> <p> </p> <p> </p> <p>5、其他一些标{?/p> <p> </p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">uniqueKey</span> <span style="color: #0000ff">></span> <span><strong>id</strong> </span><span style="color: #0000ff"></</span> <span style="color: #990000">uniqueKey</span> <span style="color: #0000ff">></span> </p> <p>文的唯一标识Q?nbsp;必须填写q个fieldQ除非该field被标记required="false"Q,否则solr建立索引报错?/p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">defaultSearchField</span> <span style="color: #0000ff">></span> <span><strong>text</strong> </span><span style="color: #0000ff"></</span> <span style="color: #990000">defaultSearchField</span> <span style="color: #0000ff">></span> </p> <p>如果搜烦(ch)参数中没有指定具体的fieldQ那么这是默认的域?/p> <p><span style="color: #0000ff"><</span> <span style="color: #990000">solrQueryParser</span> <span style="color: #990000">defaultOperator</span> <span style="color: #0000ff">="</span> <strong>OR</strong> <span style="color: #0000ff"><span>"</span> <span>/></span> </span></p> <p>配置搜烦(ch)参数短语间的逻辑Q可以是"AND|OR"?/p> <p> </p> <p> </p> <p> </p> <p><strong><span style="font-size: medium">二、solrconfig.xml</span> </strong></p> <p> </p> <p>1、烦(ch)引配|?/p> <p> </p> <p>mainIndex 标记D定义了(jin)控制Solr索引处理的一些因?</p> <ul><li> <p>useCompoundFileQ通过很?Lucene 内部文g整合到单一一个文件来减少使用中的文g的数量。这可有助于减少 Solr 使用的文件句柄数目,代h(hun)是降低了(jin)性能。除非是应用E序用完?jin)文件句柄,否?<code>false</code> 的默认值应该就已经_?/p></li><li>useCompoundFileQ通过很多Lucene内部文g整合C个文Ӟ来减用中的文件的数量。这可有助于减少Solr使用的文件句柄的数目Q代h降低?jin)性能。除非是应用E序用完?jin)文件句柄,否则false的默认值应该就已经_?jin)?/li><li>mergeFacorQ决定LuceneD被合ƈ的频率。较?yu)的|最ؓ(f)2Q用的内存较少但导致的索引旉也更慢。较大的值可使烦(ch)引时间变快但?x)牺牲较多的内存。(典型的时间与I间 的^衡配|)(j)</li><li>maxBufferedDocsQ在合ƈ内存?sh)文和创徏新段之前Q定义所需索引的最文档数。段是用来存储烦(ch)引信息的Lucene文g。较大的值可使烦(ch)引时间变快但?x)牺牲较多内存?/li><li>maxMergeDocsQ控制可由Solr合ƈ?Document 的最大数。较?yu)的|<10,000Q最适合于具有大量更新的应用E序?/li><li>maxFieldLengthQ对于给定的DocumentQ控制可d到Field的最大条目数Q进而阶D该文。如果文可能会(x)很大Q就需要增加这个数倹{然后,若将q个D|得q高?sh)(x)导致内存?sh)错误?/li><li>unlockOnStartupQ告知Solr忽略在多U程环境中用来保护烦(ch)引的锁定机制。在某些情况下,索引可能?x)由于不正确的关机或其他错误而一直处于锁定,q就妨碍?jin)添加和更新。将其设|ؓ(f)true可以用启动索引Q进而允许进行添加和更新。(锁机Ӟ(j)</li></ul> <p> </p> <p> 2、查询处理配|?/p> <p> </p> <p>query标记D中以下一些与~存无关的特性:(x)</p> <ul><li>maxBooleanClausesQ定义可l合在一起Ş成以个查询的字句数量的上限。正常情?024已经_。如果应用程序大量用了(jin)通配W或范围查询Q增加这个限制将能避免当D出时Q抛出TooMangClausesException?/li><li>enableLazyFieldLoadingQ如果应用程序只?x)检索Document上少数几个FieldQ那么可以将q个属性设|ؓ(f) true。懒散加载的一个常见场景大都发生在应用E序q回一些列搜烦(ch)l果的时候,用户常常?x)单d中的一个来查看存储在此索引中的原始文。初始的现实常常只需要现实很短的一D信息。若是检索大型的DocumentQ除非必需Q否则就应该避免加蝲整个文?/li></ul> <p> </p> <p>query部分负责定义与在Solr中发生的旉相关的几个选项Q?/p> <p> </p> <p> </p> <p>概念QSolrQ实际上是LuceneQ用称为Searcher的JavacL处理Query实例。Searcher烦(ch)引内容相关的数据加蝲到内存(sh)。根据烦(ch)引、CPU已经可用内存的大,q个q程可能需要较长的一D|间。要改进q一设计和显著提高性能QSolr引入?jin)一?#8220;温暖”{略Q即把这些新的Searcher联机以便为现场用h供查询服务之前,先对它们q行“热n”?/p> <ul><li>newSearcher和firstSearcher事gQ可以用这些事件来制定实例化新Searcher或第一个SearcherӞ应该执行哪些查询。如果应用程序期望请求某些特定的查询Q那么在创徏新Searcher或第一个Searcher时就应该反注释这些部分ƈ执行适当的查询?/li></ul> <p> </p> <p>query中的~存Q?/p> <p> </p> <ul><li>filterCacheQ通过存储一个匹配给定查询的文 id 的无序集Q过滤器?Solr 能够有效提高查询的性能。缓存这些过滤器意味着对Solr的重复调用可以导致结果集的快速查找。更常见的场景是~存?sh)个过滤器Q然后再发v后箋(hu)的精炼查询,q种查询能用过滤器来限制要搜烦(ch)的文档数?/li><li>queryResultCacheQؓ(f)查询、排序条件和所h文档的数量缓存文?id 的有序集合?/li><li>documentCacheQ缓存Lucene DocumentQ用内部Lucene文档idQ以便不与Solr唯一id相؜淆)(j)。由于Lucene的内部Document id 可以因烦(ch)引操作而更改,q种~存?sh)能自热?/li><li>Named cachesQ命名缓存是用户定义的缓存,可被 Solr定制插g 所使用?/li></ul> <p>其中filterCache、queryResultCache、Named cachesQ如果实C(jin)org.apache.solr.search.CacheRegeneratorQ可以自热?/p> <p>每个~存声明都接受最多四个属性:(x)</p> <ul><li>classQ是~存实现的Java?/li><li>sizeQ是最大的条目?/li><li>initialSizeQ是~存的初始大?/li><li>autoWarmCountQ是取自旧缓存(sh)预热新缓存的条目数。如果条目很多,意味着~存的hit?x)更多,只不q需要花更长的预热时间?/li></ul> <p>对于所有缓存模式而言Q在讄~存参数Ӟ都有必要在内存、cpu和磁盘访问之间进行均衡。统计信息管理页Q管理员界面的StatisticsQ对于分析缓存的 hit-to-miss 比例以及(qing)微调~存大小的统计数据都非常有用。而且Qƈ非所有应用程序都?x)从~存受益。实际上Q一些应用程序反而会(x)׃需要将某个永远也用不到的条目存储在~存?sh)这一额外步骤而受到媄(jing)响?/p></div><img src ="http://m.tkk7.com/conans/aggbug/379545.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://m.tkk7.com/conans/" target="_blank">CONAN</a> 2012-05-30 14:18 <a href="http://m.tkk7.com/conans/articles/379545.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>DataImportHandler--remove data from indexhttp://m.tkk7.com/conans/articles/379544.htmlCONANCONANWed, 30 May 2012 06:11:00 GMThttp://m.tkk7.com/conans/articles/379544.htmlDeleting data from an index using DIH incremental indexing, on Solr wiki, is residually treated as something that works similarly to update the records. Similarly, in a previous article, I used this shortcut, the more that I have given an example of indexing wikipedia data that does not need to delete data.

Having at hand a sample data of the albums and performers, I decided to show my way of dealing with such cases. For simplicity and clarity, I assume that after the first import, the data can only decrease.

Test data

My test data are located in the PostgreSQL database table defined as follows:

Table "public.albums"
Column |  Type   |                      Modifiers
--------+---------+-----------------------------------------------------
id     | integer | not null default nextval('albums_id_seq'::regclass)
name   | text    | not null
author | text    | not null
Indexes:
"albums_pk" PRIMARY KEY, btree (id)

The table has 825,661 records.

Test installation

For testing purposes I used the Solr instance having the following characteristics:

Definition at schema.xml:

<fields>
 
<field name="id" type="string" indexed="true" stored="true" required="true" />
 
<field name="album" type="text" indexed="true" stored="true" multiValued="true"/>
 
<field name="author" type="text" indexed="true" stored="true" multiValued="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>album</defaultSearchField>

 

Definition of DIH in solrconfig.xm
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
 
<lst name="defaults">
  
<str name="config">db-data-config.xml</str>
 
</lst>
</requestHandler>

And the file DIH db-data-config.
<dataConfig>
 
<dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/shardtest" user="solr" password="secret" />
 
<document>
  
<entity name="album" query="SELECT * from albums">
   
<field column="id" name="id" />
   
<field column="name" name="album" />
   
<field column="author" name="author" />
  
</entity>
 
</document>
</dataConfig>


Deleting Data

Looking at the table shows that when we remove the record, he is deleted without leaving a trace, and the only way to update our index would be to compare the documents identifiers in the index to the identifiers in the database and deleting those that no longer exist in the database. Slow and cumbersome. Another way is adding a column deleted_at: instead of physically deleting the record, only add information to this column. DIH can then retrieve all records from the set date later than the last crawl. The disadvantage of this solution may be necessary to modify the application to take such information into consideration.

I apply a different solution, transparent to applications. Let’s create a new table:

1 CREATE TABLE deletes
2 (
3 id serial NOT NULL,
4 deleted_id bigint,
5 deleted_at timestamp without time zone NOT NULL,
6 CONSTRAINT deletes_pk PRIMARY KEY (id)
7 );

This table will automagically add an identifier of those items that were removed from the table albums and information when they were removed.

Now we add the function:

01 CREATE OR REPLACE FUNCTION insert_after_delete()
02 RETURNS trigger AS
03 $BODY$BEGIN
04 IF tg_op = 'DELETE' THEN
05 INSERT INTO deletes(deleted_id, deleted_at)
06 VALUES (old.id, now());
07 RETURN old;
08 END IF;
09 END$BODY$
10 LANGUAGE plpgsql VOLATILE;

and a trigger:

1 CREATE TRIGGER deleted_trg
2 BEFORE DELETE
3 ON albums
4 FOR EACH ROW
5 EXECUTE PROCEDURE insert_after_delete();

How it works

Each entry deleted from the albums table should result in addition to the table deletes. Let’s check it out. Remove a few records:

1 => DELETE FROM albums where id < 37;
2 DELETE 2
3 => SELECT * from deletes;
4 id | deleted_id |         deleted_at
5 ----+------------+----------------------------
6 26 |         35 | 2010-12-23 13:53:18.034612
7 27 |         36 | 2010-12-23 13:53:18.034612
8 (2 rows)

So the database part works.

We fill up the DIH configuration file so that the entity has been defined as follows:

1 <entity name="album" query="SELECT * from albums"
2   deletedPkQuery="SELECT deleted_id as id FROM deletes WHERE deleted_at > '${dataimporter.last_index_time}'">

This allows the import DIH incremental import to use the deletedPkQuery attribute to get the identifiers of the documents which should be removed.

A clever reader will probably begin to wonder, are you sure we need the column with the date of deletion. We could delete all records that are found in the table deletes and then delete the contents of this table. Theoretically this is true, but in the event of a problem with the Solr indexing server we can easily replace it with another – the degree of synchronization with the database is not very important – just the next incremental imports will sync with the database. If we would delete the contents of the deletes table such possibility does not exist.

We can now do the incremental import by calling the following address:  /solr/dataimport?command=delta-import
In the logs you should see a line similar to this:
INFO: {delete=[35, 36],optimize=} 0 2
Which means that DIH properly removed from the index the documents, which were previously removed from the database.













CONAN 2012-05-30 14:11 发表评论
]]>
Solr 使用 Log4jhttp://m.tkk7.com/conans/articles/379541.htmlCONANCONANWed, 30 May 2012 06:01:00 GMThttp://m.tkk7.com/conans/articles/379541.html

大家知道在解压开solr的webE序Qapache-solr-3.2.0.warQ时Q在其WEB-INF/lib目录下有slf4j- api-1.5.5.jarQslf4j-jdk14-1.5.5.jarq两个jar包,故可知其默认使用的是jdk的日志数据,其日志都是输入到 tomcat的logs中;再看其是l合slf4jq行jdk的日志数据;slf4jq不是一U具体的日志pȝQ而是一个用h志系l的facadeQ允许在部v最l应用时方便的变更其日志pȝ。故solr使用log4j也是ok的,即采用log4j替换jdk的日志输入;做法如下Q?br />1.  solr/WINF-WEB/lib中的slf4j-api-1.5.5.jarQslf4j-jdk14-1.5.5.jar删除Q新加入 log4j-1.2.15.jar  slf4j-api-1.5.0.jar  slf4j-log4j12-1.5.0.jar或是其对应的jar包;
2.在solr/WEB-INF/下创建classes目录Q因为默认的包中没有该目录,光是用jsp操作Q?br />3. 写好的log4j.properties攑ֈsolr/WEB-INF/classes? 其内容如下,

log4j.rootLogger=INFO
log4j.logger.org.apache.solr=INFO,ROLLING_FILE

log4j.appender.ROLLING_FILE=org.apache.log4j.RollingFileAppender
log4j.appender.ROLLING_FILE.Append=false
log4j.appender.ROLLING_FILE.File=/var/log/solr.log
log4j.appender.ROLLING_FILE.MaxBackupIndex=50
log4j.appender.ROLLING_FILE.MaxFileSize=200MB
log4j.appender.LOGFILE.Threshold=INFO
log4j.appender.ROLLING_FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.ROLLING_FILE.layout.ConversionPattern=%d{yyyy-MM-dd HH\:mm\:ss} %p [%c]\:%L Line – %m%n

4.重启tomcat卛_
PSQ如果是采用JNDI部vQ最好将以上的重新打包warQ在替换旧的



CONAN 2012-05-30 14:01 发表评论
]]>
վ֩ģ壺 ޳avƬ߿Ƭ| ҹƷ侫֮ѹۿ| ֳִӲƵ| ޹Ʒ99þþþþ| Ƶ| ձɱ˹ۿ| һ߲ѹۿİƵ | M ŷSSSS222| ޹ƷþþþϼС | һɫþۺ޾Ʒ| ޾ƷɫƵ߹ۿԴ | רҳ| ŮվɫƵѹ| ŷպĸwww777| վ߹ۿ| ޳avƬwwwѼ| ȫվ| 97Ʒѹۿ| պһ| eeussӰԺֱ| ɫݺݰվ| Ʒһ24Ƶ| ޾Ʒ߹ۿƵ| ձƬѹۿһ| ŷ ͼƬۺ| ձƵ| Ӱ߹ۿ| AV볱߹ۿ| Ļ߹ۿƵ| ˳ɵӰ߲| 츾һ| պһƷ߲ƵһƷ| ҹɫ˽ӰԺվӰ| һƷ| պƷרѲ| ޹Ʒ˾Ʒ| Ƶ߹ۿ| ޳aƬ߹ۿʦ| þþƷѹۿ97| ŮƵһ| ҹƵ|