xtract - NCBI Entrez Direct XML conversion and transformation tool
Contents
Description
xtract converts an XML document into a table of data values according to user-specified rules.
Name
xtract - NCBI Entrez Direct XML conversion and transformation tool
Notes
String constraints use case-insensitive comparisons.
Numeric constraints and selection arguments use integer values.
-num and -len selections are synonyms for Object Count (#) and Item Length (%).
-words, -pairs, and -indices convert to lower case.
Options
ProcessingFlags-strict
Remove HTML and MathML tags.
-mixed Allow mixed content XML.
-self Allow detection of empty self-closing tags.
-accent
Delete Unicode accents and diacritical marks.
-ascii Convert Unicode to numeric HTML character entities.
-compress
Compress runs of spaces.
-stops Retain stop words in selected phrases.
DataSource-inputfilename
Read XML from file instead of standard input.
-transformfilename
File of substitutions for -translate.
-aliasesfilename
Mappings file for -classify operation.
ExplorationArgumentHierarchy-patternexpr-groupexpr-blockexpr-subsetexpr
Name of record within set. Use of different argument names allows command-line control of nested
looping.
PathNavigation-pathpath
Explore by list of adjacent object names.
ExplorationConstructs
Object DateRevised
Parent/Child Book/AuthorList
Path MedlineCitation/Article/Journal/JournalIssue/PubDate
Heterogeneous "PubmedArticleSet/*"
Exhaustive "History/**"
Nested "*/Taxon"ConditionalExecution-ifexpr [constraint]
Element (or @attribute) must exist and satisfy any specified constraint.
-unlessexpr [constraint]
Skip if element matches.
-andcondition
Preceding and following tests must both pass.
-orcondition
Any passing test suffices.
-else Execute if conditional test failed.
-positionposfirst/last/outer/inner/even/odd/all.
StringConstraints-equalsstr
String must match exactly.
-containsstr
Substring must be present.
-includesstr
Substring must match at word boundaries.
-is-withinstr
String must be present.
-starts-withstr
Substring must be at beginning.
-ends-withstr
Substring must be at end.
-is-notstr
String must not match.
-is-beforestr
First string < second string.
-is-afterstr
First string > second string.
-matchesstr
Matches without commas or semicolons.
-resemblesstr
Requires all words, but in any order.
ObjectConstraints-is-equal-toexpr
Object values must match.
-differs-fromexpr
Object values must differ.
NumericConstraints-gtN Greater than.
-geN Greater than or equal to.
-ltN Less than to.
-leN Less than or equal to.
-eqN Equal to.
-neN Not equal to.
FormatCustomization-retstr
Override line break between patterns.
-tabstr
Replace tab character between fields.
-sepstr
Separator between group members.
-pfxstr
Prefix to print before group.
-sfxstr
Suffix to print after group.
-rst Reset -sep through -elg.
-clr Clear queued tab separator.
-pfcstr
Preface combines -clr and -pfx.
-deqstr
Delete and replace queued tab separator.
-defstr
Default placeholder for missing fields.
-lblstr
Insert arbitrary text.
XMLGeneration-settag
XML tag for entire set.
-rectag
XML tag for each record.
-wrptag
Wrap elements in XML object.
-enctag
Encase instance in XML object.
-plgstr
Prologue to print before instance.
-elgstr
Epilogue to print after instance.
-pkgtag
Package subset in XML object.
-fwdstr
Foreword to print before subset.
-awdstr
Afterword to print after subset.
TagandAttributeConstruction-tagtag
Start with <tag.
-attkeyvalue
Attribute key and value.
-cls Close with >.
-slf Self-close with />.
-endtag
End contents with </tag>.
ElementSelection-elementelement
Print all items that match tag name.
-firstelement
Only print value of first item.
-lastelement
Only print value of last item.
-backwardelement
Print values in reverse order.
-NAME Record value in named variable.
--STATS
Accumulate values into variable.
-elementConstructs
Tag Caption
Group Initials,LastName
Parent/Child MedlineCitation/PMID
Recursive "**/Gene-commentary_accession"
Unrestricted PubDate/*
Attribute DescriptorName@MajorTopicYN
Range MedlineDate[1:4]
Substring "Title[phospholipase|rattlesnake]"
Object Count "#Author"
Item Length "%Title"
Element Depth "^PMID"
Variable "&NAME"Special-elementOperations
Parent Index "+"
Object Name "?"
Object Value "~"
XML Subtree "*"
Children "$"
Attributes "@"
ASN.1 Record "."
JSON Record "%"NumericProcessing-numelement
Count.
-lenelement
Length.
-sumelement
Sum.
-accelement
Accumulator.
-minelement
Minimum.
-maxelement
Maximum.
-incelement
Increment.
-decelement
Decrement.
-subelement
Difference.
-avgelement
Average.
-develement
Deviation.
-medelement
Median.
-mulelement
Product.
-divelement
Quotient.
-modelement
Remainder.
-binelement
Binary.
-octelement
Octal.
-hexelement
Hexadecimal.
-bitelement
Bit count.
-padelement
Zero-pad to eight digits.
CharacterProcessing-encodeelement
XML-encode <, >, &, ", and ' characters.
-upperelement
Convert text to uppercase.
-lowerelement
Convert text to lowercase.
-chainelement
Change spaces to underscores.
-titleelement
Capitalize initial letters of words.
-mirrorelement
Reverse order of letters.
-alnumelement
Non-alphanumeric characters to space.
StringProcessing-basicelement
Convert superscripts and subscripts.
-plainelement
Remove embedded mixed-content markup tags.
-simpleelement
Normalize accented letters; spell Greek letters.
-authorelement
Multi-step author cleanup.
-proseelement
Text conversion to ASCII.
TextProcessing-termselement
Partition text at spaces.
-wordselement
Split at punctuation marks.
-pairselement
Adjacent informative words.
-orderelement
Rearrange words in sorted order.
-reverseelement
Reverse words in string.
-letterselement
Separate individual letters.
-clauseselement
Break at phrase separators.
CitationFunctions-yearelement
Extract first 4-digit year from string.
-monthelement
Match first month name and return a corresponding integer.
-dateelementYYYY/MM/DD from -unit"PubDate"-date"*"-pageelement
Get digits (and letters) of first page number.
-authelement
Change GenBank authors to Medline form.
-initialselement
Parse initials from forename or given name.
-jourelement
Clean up journal name punctuation.
-trimelement
Remove extra spaces and leading zeros.
-wctelement
Count number of -words in a string.
-doielement
Add https://doi.org/ prefix, URL encode.
ValueTransformation-translateelement
Substitute values with -transform table.
-classifyelement
Substring word or phrase matches to -aliases table.
RegularExpression-replace
Substitute text using regular expressions.
-regtarget Target expression.
-exppattern Replacement pattern.
SequenceProcessing-revcomp
Reverse complement nucleotide sequence.
-nucleic
Subrange determines forward or revcomp.
-fasta Split sequence into blocks of 70 uppercase letters.
-ncbi2na
Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)
-ncbi4na
Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)
-molwt Calculate molecular weight of peptide.
SequenceCoordinates-0-basedelement
Zero-based.
-1-basedelement
One-based.
-ucsc-basedelement
Half-open.
CommandGenerator-insdarg ...
Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as
part of a pipeline. Requires one or more arguments, which may appear in the following order:
Descriptor(s) INSDSeq_sequence/INSDSeq_definition/INSDSeq_division/... [...]
Completeness complete/partial
Feature(s) CDS/mRNA/...[,...]
Qualifier(s) INSDFeature_key/"#INSDInterval"/gene/product/feat_location/sub_sequence/... [...]
FrequencyTable-histogram
Collects data for sort-uniq-count(1) on entire set of records.
EntrezIndexing-e2index [extras]
Create Entrez index XML. extras (true or false; false by default) indicates whether to index ex‐
tra fields.
-indiceselement
Index normalized words.
-articleelement
Title positional index.
-abstractelement
Abstract positional index.
-paragraphelement
Index text paragraphs.
-stemmedelement
Apply Porter2 algorithm.
OutputOrganization-headstr
Print before everything else.
-tailstr
Print after everything else.
-hdstr
Print before each record.
-tlstr
Print after each record.
RecordSelection-selectcondition
Select record subset by conditions.
-infilename
File of identifiers to use for selection.
RecordRearrangement-sort[-fwd] element
Element to use as sort key.
-sort-revelement
Sort records in reverse order.
Reformatting-formatfmtcopy Fast block copy (still applies processing flags).
compact Compress runs of spaces.
flush Suppress line indentation.
indent Indent according to nesting depth.
expand Place each attribute on a separate line.
Validation-verify
Report XML data integrity problems.
Summary-outline
Display outline of XML structure.
-synopsis
Display individual XML paths.
-contour [delimiter]
Display XML paths to leaf nodes (delimited by / by default).
FullExplorationCommandPrecedence-pattern-path-division-group-branch-block-section-subset-unitDocumentation-help Print usage information and some example argument combinations.
-examples
Complete usage examples, involving additional Entrez Direct tools.
-unix Illustrate common Unix command arguments.
-version
Print version number.
See Also
archive-pmc(1), archive-pubmed(1), custom-index(1), disambiguate-nucleotides(1), download-ncbi-data(1), ds2pme(1), esample(1), fetch-pmc(1), fetch-pubmed(1), find-in-gene(1), fuse-segments(1), gene2range(1), hgvs2spdi(1), index-extras(1), index-pubmed(1), pma2pme(1), rchive(1), snp2hgvs(1), snp2tbl(1), sort-uniq-count(1), spdi2tbl(1), tbl2prod(1), transmute(1), uniq-table(1), xml2fsa(1), xml2tbl(1), xy-plot(1). NCBI 2023-03-31 XTRACT(1)
Synopsis
xtract [-help] [-strict] [-mixed] [-self] [-accent] [-ascii] [-compress] [-stops] [-inputfilename]
[-transformfilename] [-aliasesfilename] [-patternexpr] [-groupexpr] [-blockexpr] [-subsetexpr]
[-pathpath] [-ifexpr [constraint]] [-unlessexpr [constraint]] [-andcondition] [-orcondition] [-else]
[-positionpos] [-equalsstr] [-containsstr] [-includesstr] [-is-withinstr] [-starts-withstr]
[-ends-withstr] [-is-notstr] [-is-beforestr] [-is-afterstr] [-matchesstr] [-resemblesstr]
[-is-equal-toexpr] [-differs-fromexpr] [-gtN] [-geN] [-ltN] [-leN] [-eqN] [-neN] [-retstr]
[-tabstr] [-sepstr] [-pfxstr] [-sfxstr] [-rst] [-clr] [-pfcstr] [-deqstr] [-defstr] [-lblstr]
[-settag] [-rectag] [-wrptag] [-enctag] [-plgstr] [-elgstr] [-pkgtag] [-fwdstr] [-awdstr]
[-tagtag] [-attkeyvalue] [-cls] [-slf] [-endtag] [-elementelement] [-firstelement] [-lastelement]
[-backwardelement] [-NAME] [--STATS] [-numelement] [-lenelement] [-sumelement] [-accelement]
[-minelement] [-maxelement] [-incelement] [-decelement] [-subelement] [-avgelement] [-develement]
[-medelement] [-mulelement] [-divelement] [-modelement] [-binelement] [-octelement] [-hexelement]
[-bitelement] [-padelement] [-encodeelement] [-upperelement] [-lowerelement] [-chainelement]
[-titleelement] [-mirrorelement] [-alnumelement] [-basicelement] [-plainelement] [-simpleelement]
[-authorelement] [-proseelement] [-termselement] [-wordselement] [-pairselement] [-orderelement]
[-reverseelement] [-letterselement] [-clauseselement] [-yearelement] [-monthelement] [-dateelement]
[-pageelement] [-authelement] [-initialselement] [-jourelement] [-trimelement] [-wctelement]
[-doielement] [-translateelement] [-classifyelement] [-replace-regtarget-expreplacement]
[-revcomp] [-nucleic] [-fasta] [-ncbi2na] [-ncbi4na] [-molwt] [-0-basedelement] [-1-basedelement]
[-ucsc-basedelement] [-insdarg ...] [-histogram] [-e2index [extras]] [-indiceselement]
[-articleelement] [-abstractelement] [-paragraphelement] [-stemmedelement] [-headstr] [-tailstr]
[-hdstr] [-tlstr] [-selectcondition] [-infilename] [-sort[-fwd] element] [-sort-revelement]
[-formatfmt [-unicodestyle]] [-verify] [-outline] [-synopsis] [-contour [delimiter]] [-examples]
[-unix] [-version]
