[Application-profiles-ig] Python implementation(s)

Sun Dec 6 18:53:51 GMT 2020

Hi Tom, Thank you for the explanation. I revisited the documentation and realized that I had a wrong understanding of 'restval'; probably, I was smitten by this [1]. I stand corrected. I generally perform a set of modifications and cleanup on the dictionary obtained from the CSV reader. At this stage, I usually set values to proper types and replace values for matching with the requirements. Most of the time, I religiously perform a schema validation on the final dictionary to avoid surprises and prevent malformed input. My regular library is Cerberus [2], but schema [3] and good [4] are also worth mentioning. Since DCTAP has a well-defined schema, and the size of the CSV is relatively small, I think it is a good idea to validate it against the schema before the data structure is exported. Thus we can make sure that values like mandatory or repeatable are adequately set for truthiness. About signed-utf8, I don't have much idea if it breaks anything. UTF-8 with BOM is not a common use-case since the BOM is not mandatory as per UTF-8 specification. If the file is from excel, it is advised to set the dialect to excel for csvreader. Nishad [1]https://bugs.python.org/issue40013[2]https://docs.python-cerberus.org/en/stable/[3]https://github.com/kolypto/py-goodFrom: Thomas Baker <tbaker at tombaker.org> Date: Sunday, December 6, 2020 18:54 To: Nishad Thalhath <nishad at thalhath.org> Cc: Thomas Baker <tbaker at tombaker.org>, Phil Barker <phil.barker at pjjk.co.uk>,application-profiles-ig at lists.www.voudr.com<application-profiles-ig at lists.www.voudr.com> Subject: Re: [Application-profiles-ig] Python implementation(s) Hi Nishad, On 2020-12-05 04:23, Nishad Thalhath wrote: >A quick response on CSV reader for empty values : The>default behavior always depends on the>implementation/library/subclasses. If we assume that,>we use the built-in CSV module to parse, in>'csv.DictReader', the default is 'None', and we can>override it with 'restval'.I do use csv.DictReader [1]. In my reading, the restval [2] default of None is used when one of the fields specified in the optional "fieldnames" argument is missing. Rather than enumerate fields in the "fieldnames" argument, however, I simply take the values in the first row as fieldnames. This seems to be clearly the better option given our flexibility about which fields to use and the order in which they are used. So the restval default would never apply because there would never be missing fields, just empty string values for existing fields. However, the use of None as a default for missing fields does seem like a good hint for using None as a default for the dataclass parameters (instead of a blank string) - even for elements intended to hold Boolean values, such as 'mandatory'. Do you agree? On the subject of DictReader, note that I open the file object with encoding="utf-8-sig" (instead of just "utf-8"). This means it will handle Excel files, which have a U+FEFF Byte Order Mark, but seems to work fine with CSV files created by other means, such as text editors. Do you see any reason not to use utf-8-sig? Tom [1]https://github.com/tombaker/csv2shex/blob/csvschema/csv2shex/csvreader.py#L22[2]https://docs.python.org/3/library/csv.html-- Tom Baker <tom at tombaker.org> -------------- 下一个部分  -------------- 一个HTMLattachment was scrubbed... URL: <https://lists.www.voudr.com/pipermail/application-profiles-ig/attachments/20201206/1c154c7a/attachment.htm>