[size=17.3333339691162px]This page gives you information on how content and language detection works with Apache Tika, and how to tune the behaviour of Tika.
By looking for special ("magic") patterns of bytes near the start of the file, it is often possible to detect the type of the file. For some file types, this is a simple process. For others, typically container based formats, the magic detection may not be enough. (More detail on detecting container formats below)
Tika is able to make use of a a mime magic info file, in the Freedesktop MIME-info format to peform mime magic detection. (Note that Tika supports a few more match types than Freedesktop does)
This is provided within Tika by org.apache.tika.detect.MagicDetector. It is most commonly access via org.apache.tika.mime.MimeTypes, normally sourced from the tika-mimetypes.xml and custom-mimetypes.xml files. For more information on defining your own custom mimetypes, see the new parser guide.
Where the name of the file is known, it is sometimes possible to guess the file type from the name or extension. Within the tika-mimetypes.xml file is a list of patterns which are used to identify the type from the filename.
However, because files may be renamed, this method of detection is quick but not always as accurate.
This is provided within Tika by org.apache.tika.detect.NameDetector.
The simplest way to detect is through the Tika Facade class, which provides methods to detect based on File, InputStream, InputStream and Filename, Filename or a few others. It works best with a File or TikaInputStream.
Alternately, detection can be performed on a specific Detector, or using DefaultDetector to have all available Detectors used. A typical pattern would be something like: