Skip to content

mjsignup/sanskrit

 
 

Repository files navigation

Table of contents

What

Code to identify the metre of a Sanskrit verse.

Web version currently serving at http://sanskritmetres.appspot.com/

Can also be used as a Python library.

Examples

In the web version, try the following inputs.

kāṣṭhād agnir jāyate mathya-mānād-
bhūmis toyaṃ khanya-mānā dadāti|
sotsāhānāṃ nāsty asādhyaṃ narāṇāṃ
mārgārabdhāḥ sarva-yatnāḥ phalanti||

or (note that this one intentionally has many typos):

काष्ठाद् अग्नि जायते
मथ्यमानाद्भूमिस्तोय खन्यमाना ददाति।
सोत्साहानां नास्त्यसाध्यं
नराणां मार्गारब्धाः सवयत्नाः फलन्ति॥

If using as a library (TODO: document this better):

import identifier_pipeline

verse = r'''kāṣṭhād agnir jāyate mathya-mānād-
bhūmis toyaṃ khanya-mānā dadāti|
sotsāhānāṃ nāsty asādhyaṃ narāṇāṃ
mārgārabdhāḥ sarva-yatnāḥ phalanti||'''

identifier = identifier_pipeline.IdentifierPipeline()
match_results = identifier.IdentifyFromText(verse)

How

The design of the program is as follows.

Transform the input (Read, Scan)

The input passes through the following representations.

The raw input

For the web form, whatever is typed into the textarea. Consider the examples above.

The input in slp1

Whatever the input script (transliteration scheme) used, the input is cleaned up and “read” into a limited Sanskrit alphabet (slp1). For instance, the examples above are read as the following:

kAzWAdagnirjAyatemaTyamAnAd
BUmistoyaMKanyamAnAdadAti
sotsAhAnAMnAstyasADyaMnarARAM
mArgArabDAHsarvayatnAHPalanti

and

kAzWAdagnijAyate
maTyamAnAdBUmistoyaKanyamAnAdadAti
sotsAhAnAMnAstyasADyaM
narARAMmArgArabDAHsavayatnAHPalanti

respectively.

The metrical signature of the input

We next scan the input, to reduce it to a pattern of laghu (denoted L) and guru (denoted G) syllables.

Our two examples above are scanned into the lists:

['GGGGGLGGLGG',
 'GGGGGLGGLGL',
 'GGGGGLGGLGG',
 'GGGGGLGGLGL']

and

['GGGLGLG',
 'GLGGGGGLGLGGLGL',
 'GGGGGLGG',
 'LGGGGGGLLGGLGL']

respectively.

Identify

Finally, we compare this metrical signature (or “pattern lines”) against a database of known patterns.

For example, in our database have the information that Śālinī is a sama-vṛtta metre consisting of 4 lines (pāda-s / quarters) each having the pattern

GGGG—GLGGLGG

Thus Śālinī is recognized as the (probable, best-guess) metre of the input verse.

Note that in the second example, even though no line matches a line of Śālinī, the program is still clever enough to detect a match.

This is because apart from the full verse, we also try matching:

  • each “half” (half of the the syllables),
  • each “quarter”,
  • each line of the input. This is why we take verse lines as input, rather than a single blob of text as a string.

Thus the code can detect partial matches: if there are metrical errors in the verse, but some parts of it are in some metre, then that metre still has a chance of being recognized.

We might also multiple results when we have multiple metres guessed, such as when different lines are in different metres.

TODO: Describe the matching heuristics in more detail here.

Display

The detected metre is displayed, along with how the verse fits the metre, and information about the metre.

TODO: Describe this.


(Everything below this line needs even more rewriting.)

Code organization

See deps.png for the dependency graph.

Read

Covered by the files in read and their dependencies.

Detecting the transliteration format of the input, removing junk characters that are not part of the verse, and transliterating the input to SLP1 (the encoding we use internally).

Scan

Determining the pattern of gurus and laghus.

The functions in scan.py take this cleaned-up verse, and convert it to a pattern of laghus and gurus. A “pattern” means a sequence over the alphabet {‘L’, ‘G’}.

Identify

Identification algorithm: Given a verse,

  1. Look for the full verse’s pattern in known_metre_patterns.
  2. Loop through known_metre_regexes and see if any match the full verses’s pattern.
  3. Look in known_partial_patterns (then known_partial_regexes) for: – whole verse, – each line, – each half, – each quarter.
  4. [TODO/Maybe] Look for substrings, find closest match, etc.? Might have to restrict to the popular metres for efficiency.

Metrical data

  • A “pattern” means a sequence over the alphabet {‘L’, ‘G’}.
  • A “regex” (for us) is a regular expression that matches some patterns.

(TODO: This is obsolete.) We use the following data structures:

  • known_metre_patterns, a dict mapping a pattern to a MatchResult.
  • known_metre_regexes, a list of pairs (regex, MatchResult).
  • known_partial_patterns, a dict mapping a pattern to MatchResult-s.
  • known_partial_regexes, a list of pairs (regex, MatchResult).

    A MatchResult is usually arrived at by looking at a pattern (or list of patterns), and can be seen as a tuple (metre_name, match_type):

    metre_name - name of the metre, match_type - used to distinguish between matching one pāda (quarter) or one ardha (half) of a metre. Or, in ardha-sama metres, it can distinguish between odd and even pādas.

Display

Display the list of metres found as possible guesses. For vrtta metres, we also try to “align” the input verse to the metre, so that it’s more clear where to break it, etc. (And when the input verse has metrical errors, it’s clear what they are.)

About

Sanskrit-related code / data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 51.8%
  • HTML 47.8%
  • Shell 0.4%